MobileNet Baselines

Community Article Published July 26, 2024

Those who follow me know that I can't resist an opportunity to update an old baseline.

When the MobileNet-V4 paper came out I noted that they re-ran their MobileNet-V1 baseline to get a 74% ImageNet accuracy. The original models were around 71%. That's quite a jump.

Intruiged, I looked more closely at their recipe for the 'small' model with unusual optimizer hparams that brought the AdamW beta1 from the default 0.9 -> 0.6, taking it closer to RMSProp. Additionally, there was fairly high dropout and augmentation for a smaller model but a very long epoch count (9600 ImageNet-1k epochs in their case).

I set out to try these hparams myself in timm, initially in training a reproduction of the MobileNet-V4-Small where I successfully hit 73.8 at 2400 epochs (instead of 9600), I then took a crack at MobileNet-V1 as I'd never had that model in timm.

My MobileNet-V1 run just finished, 3600 ImageNet-1k epochs with a 75.4% top-1 accuracy on ImageNet at the 224x224 train resolution (76% at 256x256) -- no distillation, no additional data. The OOD dataset scores on ImageNet-V2, Sketch, etc seem pretty solid so it doesn't appear a gross overfit. Weights here: https://huggingface.co./timm/mobilenetv1_100.ra4_e3600_r224_in1k

Comparing to some other MobileNets:

I decided to give the old EfficientNet-B0 a go with these hparams. 78.6% top-1 accuracy. To put that in perspective the B0 trainings by top-1 are:

So a pure ImageNet-1k with no distillation and no extra data managed just a hair under the very impressive NoisyStudent models which had unlabeled access to JFT. Additionally the OOD test set scores are holding up relative to NoisyStudent, that's also impressive. I actually think this recipe could be tweaked to push the B0 to 79%. The accuracy improvement petered out early on this run, there is room for improvement with a tweak to the aug+reg.

What were my differences from the MobileNet-V4 hparams? Well, for one I used timm, if you read the Supplementary Material, section A of the Resnet Strikes Back paper, I detailed a number of fixes and improvements over the default RandAugment that's used in all Tensorflow and most JAX based trainings I'm aware of. I feel some of the issues in the original are detremental to great training. Other differences?

So, the theme I've visited many times (Resnet Strikes Back, https://huggingface.co./collections/timm/searching-for-better-vit-baselines-663eb74f64f847d2f35a9c19, and many timm weights) continues to hold there is a lot of wiggle room for improving old results through better training regimens.

I wonder, in 7-8 years time how much can be added to todays SOTA 100+B dense transformer architectures with better recipes and training techniques.