Dec 11, 2023

Who Would Learn Faster? A Chicken or Vision Transformer?

A NeurIPS study comparing newborn chicks and Vision Transformers on view-invariant object recognition from limited visual experience.

Who would learn faster? A chicken or Vision Transformer?

A recent study from NeurIPS tested whether Vision Transformers are more data-hungry than living organisms. Newborn chicks were used as a "model" and raised in controlled environments, focusing on their ability to recognize objects from different viewpoints.

The chicks were exposed to a single 3D object rotating through a 60° viewpoint range in their visual environment for one week. In the test phase, their ability to recognize this object from new views was tested.

How the experiment was replicated for ViTs

The visual experiences of chicks were replicated using a virtual controlled-rearing chamber. An agent in this virtual environment captured first-person images, which were used to train the ViTs using a self-supervised algorithm (ViT-CoT) that leveraged time as a teaching signal.

The training involved processing images into patches and adding them to the ViT encoder, with a focus on contrastive learning through time.

Results

ViT-CoTs performed on par with or better than chicks when trained on 11 viewpoint ranges. The performance was similar across small, medium, and large ViT-CoT architectures. However, a drop in performance was noted for the smallest ViT-CoT architecture.

The number of training images significantly impacted performance. Untrained ViTs showed the poorest performance, which gradually improved with increased numbers of training images. This improvement pattern was consistent across different architecture sizes, suggesting that larger ViT-CoTs were not more data-hungry than smaller ones.

The VideoMAE algorithm also performed well, matching the performance of newborn chicks. This demonstrated that different ViT models could spontaneously develop view-invariant object features when trained on first-person views similar to those available to newborn chicks.

CNNs, when equipped with a biologically plausible time-based learning objective, outperformed ViTs, which might be attributed to the strong architectural inductive bias present in CNNs. Despite this, ViTs successfully learned view-invariant object features in the same impoverished visual environments as the newborn chicks.

Takeaway

The study has its limitations but still shows us how we can compare living vision systems to artificial ones. I find this study interesting, although I hope the tests were performed safely on animals, without causing them any harm.

The study: Are Vision Transformers More Data Hungry Than Newborn Visual Systems?

Original post: LinkedIn

← AI explained