Google AI Researchers Propose Method for Highly Efficient and Stable Training of 22B-Parameter ViT (ViT-22B)


Are you looking for the latest breakthrough in vision transformers? Look no further! In this blog post, we will be discussing ViT-22B, a Transformer-based encoder model with 22 billion parameters, which is the biggest vision transformer model available to date. Through careful design and architecture improvements, ViT-22B has achieved an industry-leading performance on various benchmarks.

We will start off by looking at why the rise of vision transformers has been so significant. We will then discuss the innovative techniques used to create ViT-22B, such as overlapping layers and normalization of queries and keys. We will also look at the performance of ViT-22B on various tasks, such as classification and dense output tasks. Lastly, we will look at the model’s properties and how they more closely match how people see things.

So, let’s dive into the world of vision transformers and explore the wonders of ViT-22B!

# The Rise of Vision Transformers

The rise of vision transformers has been remarkable due to the larger datasets, scalable infrastructures, and innovative training techniques. Language models have significantly outperformed vision models in terms of emergent capabilities at large scales. The highest dense language model has 540B parameters, the largest dense vision model has just 4B parameters, and a moderately parameterized model for an entry-level competitive language model often comprises over 10B parameters.

# Innovative Techniques Used to Create ViT-22B

ViT-22B is a Transformer-based encoder model with parallel layers, query/key (QK) normalization, and omitted biases to increase efficiency and training stability at scale. Its architecture is similar to that of the original Vision Transformer. To increase efficiency, the MLP and attention blocks are applied in parallel instead of sequentially. To improve training stability, LayerNorm is applied on the queries and keys before the computation of the dot-product attention and biases are excluded from LayerNorms and QKV projections.

# Performance of ViT-22B

ViT-22B obtains an accuracy of 89.5% on ImageNet even when utilized as a frozen visual feature extractor. It achieves 85.9% accuracy on ImageNet in the zero-shot situation using a text tower trained to match these visual attributes. Moreover, ViT-22B is an excellent instructor; using it as a distillation objective, they educate a ViT-B student who scores an industry-leading 88.6% on ImageNet. Improvements in dependability, uncertainty estimates, and fairness tradeoffs accompany this performance. Lastly, the model’s properties more closely match how people see things, yielding a previously unheard-of form bias of 87%.

# Conclusion

ViT-22B is a remarkable vision transformer model that has achieved an industry-leading performance on various benchmarks. Through careful design and architecture improvements, ViT-22B has managed to increase efficiency and training stability at scale. Its performance on tasks such as classification and dense output tasks has been remarkable. Moreover, its properties more closely match how people see things, yielding a previously unheard-of form bias of 87%.

Check out the Paper. All Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to join our 14k+ ML SubRedditDiscord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.

Leave a comment

Your email address will not be published. Required fields are marked *