Large Language Models (LLMs) Prefer Reinforcement Learning over Supervised Learning for Finetuning: Here’s Why


Welcome to the fascinating world of Large Language Models and Reinforcement Learning! In recent months, Generative Artificial Intelligence has grown a lot, and LLMs are taking the lead in this growth. These models are exceptional and can carry out tasks that were once only possible for humans. In this blog post, we will explore how the reinforcement learning method helps to fine-tune LLMs such as ChatGPT, Pathways Language Model (PaLM), and Chinchilla. We’ll also examine the reasons why Reinforcement Learning is preferred over Supervised Learning for fine-tuning. Get ready to dive deep into the world of AI and explore these exciting concepts with us!

Reinforcement learning is a feedback-driven machine learning method that trains an agent to complete specific tasks and analyze the results of those actions. LLMs use RL for fine-tuning, and ChatGPT portrays remarkable performance, thanks to RLHF. ChatGPT uses RLHF to minimize biases and train the model to estimate the quality of the produced response, rather than just the ranking score.

Why not use Supervised Learning for fine-tuning? Here’s one of the essential reasons why: Supervised Learning only predicts ranks, while RLHF focuses on training the model to produce coherent responses. Sebastian Raschka, an AI and ML researcher, noted that the SL approach doesn’t refine the quality of the generated response, which RLHF does exceptionally well. Raschka suggests reformulating the task as a constrained optimization problem using Supervised Learning by combining the output text loss and the reward score term. Still, this approach’s success depends on producing QA pairs correctly.

The third reason for not opting for SL is the use of cross-entropy to optimize token-level loss. An alteration in individual token may have only minimal impact on the overall loss, but generating coherent conversations is a complex problem, and a single token can alter the entire context. Hence, dependency on SL might not be sufficient. Empirical studies have shown that RLHF tends to perform better than SL. A 2022 paper titled “Learning to Summarize from Human Feedback” showed that RLHF performs better than SL because it considers cumulative rewards for coherent conversations.

LLMs like InstructGPT and ChatGPT use both Supervised Learning and Reinforcement Learning. The combination of the two is crucial for attaining optimal performance. The SL stage helps the model to learn the basic structure and content of the task, while RLHF stage refines the model’s responses for improved accuracy.

In conclusion, the use of RLHF for LLM fine-tuning has shown better results than SL. It is essential to consider the context and coherence of the entire conversation, which RLHF does better than SL. We hope this blog post has given you a better understanding of the fascinating world of LLMs, Reinforcement Learning, and how they help to imitate human-like responses. Stay tuned for more fascinating insights into the world of AI and Machine Learning.

Leave a comment

Your email address will not be published. Required fields are marked *