Jan 21, 2024
Self-Rewarding LLMs
An explanation of self-rewarding language models, where models generate prompts, judge responses, and iteratively train from their own feedback.
Researchers from Meta and New York University presented the Self-Rewarding LLMs. A new approach, or even a "paradigm shift" in how language models are trained. By generating its own training data and evaluating its performance, the model is continuously self-improving.
It is widely known that aligning LLMs to human preference data e.g. with Reinforcement Learning from Human Feedback (RLHF) can improve the performance of pretrained models. In RLHF we first train a reward model from these human preferences. Then, the reward model is then frozen and used to train the LLM using RL, e.g., via PPO. A recent alternative is to avoid training the reward model at all, and directly use human preferences to train the LLM, as in Direct Preference Optimization. In both cases, the approach is bottlenecked by the size and quality of the human preference data, and in the case of RLHF the quality of the frozen reward model trained from them as well.
Methodology
The methodology is iterative. After each round of training, the model again generates prompts, creates responses for them, evaluates them, and again is trained with the new self-generated preference dataset. With each iteration, the model's ability to generate high-quality responses and to give feedback is expected to improve.
- First of all, instead of manually preparing a training dataset, the model itself generates new training prompts. These prompts are designed to be diverse and cover a wide range of topics and writing styles.
- Once the model has generated responses to its own prompts, it enters the evaluation phase. In this stage, the model acts as its "judge", assessing the quality of responses. The evaluation includes assigning rewards or feedback to each response. The reward assignment is not arbitrary; it's based on criteria set by the researchers, ensuring that the model's self-evaluation aligns with desired performance metrics.
- Next, the responses and their corresponding self-assigned rewards create a new preference dataset, generated entirely by the model's internal processes. The preference dataset then serves as the basis for further training.
The model undergoes Direct Preference Optimization, where it learns to optimize its responses based on the preferences it has previously set. DPO enables the model to iteratively improve its ability to follow instructions and accurately evaluate responses. Essentially, the model learns from its own learning.
Results
The Self-Rewarding Language Models approach shows the ability to iteratively improve through self-generated rewards and preference data.
They perform well on the AlpacaEval 2.0 leaderboard: the last iteration outperformed several existing models, including Claude 2, Gemini Pro, and GPT4 0613. This success is particularly significant considering the Self-Rewarding model started from a small set of seed data and generated its own targets and rewards for further iterations. It's important to notice that each iteration of training led to significant gains in performance. In the first iteration, training did not significantly impact instruction following performance. The second iteration of Self-Rewarding training (M2) showed a notable improvement (55.5% wins) over both the first iteration and the Supervised Fine-Tuning (SFT) Baseline (49.2% wins). The third iteration (M3) continued this trend of improvement, outperforming Iteration 2 (M2) and the SFT Baseline (62.5% wins).
Conclusion
Self-Rewarding LLMs generate their feedback, making them more autonomous compared to RLHF, which depends heavily on external human feedback. However, it means that the performance of the model heavily depends on the quality of the self-generated training data. If the initial iterations generate low-quality data, it could limit the model's improvement in subsequent iterations. Also, if the model develops or acquires any systematic biases or errors in the early stages, these might be reinforced through subsequent iterations. The self-rewarding approach is still in its early stages, and further exploration, including safety evaluations and understanding the limits of iterative training, is needed. Nonetheless, the paper opens doors for future research and development.