Surgical Fine Tuning Improves Adaptation to Distribution Shifts

Paper fig1

This paper presents “surgical fine tuning”, which involves selectively fine tuning certain blocks of pretrained models, as a method for improving adapation to distribution shifts given small amounts of labeled in-domain data. The authors find that for input level shift, tuning only the early layers performs the best, for feature level shift (i.e different subgroups from the same class existing in the source/target domains), tuning intermediate layers works better, and for label shift (i.e. spurious correlations), tuning the last layer works the best. The authors validate this across 7 datasets (CIFAR-C, ImageNet-C, Living-17, Entity-30, Waterbirds, CIFAR-Flip, and CelebA) using mostly imagenet pretrained ResNets. The authors then provide a theoretical analysis and framework by examining a 2 layer neural network, and determining why tuning the first layer works better for input shift, and the last layer works better for label shifts. Finally, the authors propose auto-RGN (Relative Gradient Norm) and auto-SNR (Signal to Noise Ratio) as methods for freezing or downweighting certain layers while fine tuning. Auto-RGN is used to obtain a per tensor learning rate based on the ratio of gradient norm to parameter norm. Auto-SNR is used to choose which layers to freeze by setting a threshold of SNR after values are normalized to be from 0-1. Auto-RGN is found to work better, and the authors further investigate which layers are found by Auto-RGN to be more important to certain distribution shift. Future work that is suggested includes closing the gap between their proposed Auto-RGN method, and the best surgical fine tuning result by understanding which layers have stronger relationships to certain distribution shifts.

Notes and Questions:

fig2