data2vec - A General Framework for Self-supervised Learning in Speech, Vision and Language

Paper

Notes while Reading

Summary

The authors present a general framework for self supervised learning across modalities using a Transformer backbone with a student teacher model that trains the student to predict contexualized latent representations given masked inputs. This method outperforms the SOTA in specific benchmarks across all three.