Jan 7, 2024
Language Models May Not Be Few-Shot Anymore
A look at task contamination and why benchmark gains in zero-shot and few-shot LLM evaluation may be misleading.
LLMs are well-known for their remarkable ability in zero-shot and few-shot tasks, often outperforming dedicated models in diverse benchmarks. However, a recent study on task contamination in LLMs introduces a critical perspective: these impressive benchmark results might be skewed by the inclusion of test data during training.
Understanding task contamination
Task contamination also referred to as data contamination or data leakage, occurs when test examples are used during model training. This leads to higher performance on familiar data compared to unseen data. The paper identifies two primary contamination sources:
- Test data contamination: defined as incorporating test data examples with labels in the model's pre-training data.
- Task contamination: when task training examples are included in pre-training data, undermining the validity of zero or few-shot evaluations.
Zero-shot evaluation is an evaluation where a model has seen zero examples for the task. Few-shot, or N-shot, where N is a small number, is where the model has seen N examples for the task.
Methodology
The authors proposed four different methods for measuring task contamination:
- Training data inspection: Analyzing training data for the presence of task training examples.
- Task example extraction: Extracting task examples from existing models.
- Membership inference: Determining if model-generated content for a specific input is an exact match of the original sample.
- Chronological analysis: Evaluating models trained at different times on datasets with known release dates, seeking chronological evidence of contamination.
The three first methods, while precise, are limited by low recall - the absence of contamination evidence does not guarantee its non-existence. Chronological analysis, on the other hand, offers high recall but low precision. It can effectively identify contamination but may be influenced by other performance factors.
Results
Authors found evidence that some LLMs have seen task examples during pre-training for a range of tasks. Task contamination potentially inflates the zero-shot or few-shot performance of closed-sourced models, thus rendering them unreliable as baselines in these contexts, especially for models enhanced with instruction fine-tuning or RLHF. Moreover, when there was no task contamination, LLMs did not show any statistically significant improvements over majority baselines, in both zero and few-shot settings. This might suggest that performance increase over time in GPT-3 series models for various tasks is likely attributable to task contamination.
Conclusions
The research does not necessarily imply that LLMs are "bad". They still perform great. However, it shows problems with existing benchmarks and LLMs comparison. Investigating task contamination seems to be a must-have if we want to better understand LLM's performance.