Masked Autoencoders are Scalable Vision Learners

Paper

Notes while Reading

Summary

This paper introduces a method for self supervised pretraining on images using an autoencoder architecture operating on heavily masked images, with the encoder operating on visible patches, and a lightweight decoder operating on masked tokens and visible patches, trained with a pixelwise MSE loss. It outperforms the previous SOTA of models pretrained on Imagenet-1K, with significantly less compute required.