Tiny Test Models

Community Article Published October 2, 2024

I've recently trained a set of tiny test models (https://huggingface.co./collections/timm/timm-tiny-test-models-66f18bd70518277591a86cef) on ImageNet-1k covering several of the most popular architecture families.

It takes ~10 seconds to download all 13 pretrained weights and run one step of inference on each w/ a ratty old CPU (but fast internet connection). It will allow quick verification of model functionality, from pretrained weight download through every API feature of full size models. They differ from full size models in that they have lower default resolution and typically 1 block per stage, very narrow widths.

This is all well and good, but would anyone have any interest in these outside of tests? Well, this is where you come in. These are some of the smallest models that are decently trained on ImageNet-1k. They use a recent training recipe adapted from MobileNet-v4 (Conv-Small), a good recipe for squeezing accuracy from small models. The top-1 are by no means impressive, but the models do work well on fine-tune for small datasets, and I imagine they could work quite well for some reduced resource (embedded) applications or as part of reinforcement learning vision policies.

Let me know if you find any good applications for them outside of tests. Here's the summary of the model results, they were trained natively at 160x160, and most models see a small pickup at 192x192 by leveraging the train-test discrepancy.

ImageNet Accuracy

model img_size top1 top5 param_count norm
test_vit3.r160_in1k 192 58.116 81.876 0.93 LN
test_vit3.r160_in1k 160 56.894 80.748 0.93 LN
test_convnext3.r160_in1k 192 54.558 79.356 0.47 LN
test_convnext2.r160_in1k 192 53.62 78.636 0.48 LN
test_convnext2.r160_in1k 160 53.51 78.526 0.48 LN
test_convnext3.r160_in1k 160 53.328 78.318 0.47 LN
test_convnext.r160_in1k 192 48.532 74.944 0.27 LN
test_nfnet.r160_in1k 192 48.298 73.446 0.38 WS
test_convnext.r160_in1k 160 47.764 74.152 0.27 LN
test_nfnet.r160_in1k 160 47.616 72.898 0.38 WS
test_efficientnet.r160_in1k 192 47.164 71.706 0.36 BN
test_efficientnet_evos.r160_in1k 192 46.924 71.53 0.36 EVOS
test_byobnet.r160_in1k 192 46.688 71.668 0.46 BN
test_efficientnet_evos.r160_in1k 160 46.498 71.006 0.36 EVOS
test_efficientnet.r160_in1k 160 46.454 71.014 0.36 BN
test_byobnet.r160_in1k 160 45.852 70.996 0.46 BN
test_efficientnet_ln.r160_in1k 192 44.538 69.974 0.36 LN
test_efficientnet_gn.r160_in1k 192 44.448 69.75 0.36 GN
test_efficientnet_ln.r160_in1k 160 43.916 69.404 0.36 LN
test_efficientnet_gn.r160_in1k 160 43.88 69.162 0.36 GN
test_vit2.r160_in1k 192 43.454 69.798 0.46 LN
test_resnet.r160_in1k 192 42.376 68.744 0.47 BN
test_vit2.r160_in1k 160 42.232 68.982 0.46 LN
test_vit.r160_in1k 192 41.984 68.64 0.37 LN
test_resnet.r160_in1k 160 41.578 67.956 0.47 BN
test_vit.r160_in1k 160 40.946 67.362 0.37 LN

Througput @ 160x160 w/ torch.compile, mode='max-autotune', PyTorch 2.4.1, RTX4090

model infer_samples_per_sec train_samples_per_sec
test_vit 300560.67 87518.73
test_vit2 254514.84 70132.93
test_convnext 216367.11 50905.24
test_convnext3 200783.46 49074.48
test_byobnet 199426.55 49487.12
test_convnext2 196727.0 48119.64
test_efficientnet 181404.48 43546.96
test_efficientnet_ln 173432.33 33280.66
test_efficientnet_evos 169177.92 39684.92
test_vit3 163786.54 44318.45
test_efficientnet_gn 158421.02 44226.92
test_resnet 153289.49 28341.52
test_nfnet 80837.46 16907.38

Througput @ 160x160 w/ torch.compile, mode='reduce-overhead', PyTorch 2.4.1, RTX4090

model infer_samples_per_sec train_samples_per_sec
test_vit 274007.61 86652.08
test_vit2 231651.39 68993.91
test_byobnet 197767.6 48633.6
test_convnext 184134.55 46879.08
test_efficientnet 170239.18 42812.1
test_efficientnet_ln 166604.2 31946.88
test_efficientnet_evos 163667.41 42222.59
test_vit3 161792.13 45354.67
test_convnext2 160601.75 43187.22
test_convnext3 160494.65 44304.95
test_efficientnet_gn 155447.85 42003.28
test_resnet 150790.14 27286.95
test_nfnet 78314.21 15282.57

Througput @ 160x160 w/ torch.compile, mode='default', PyTorch 2.4.1, RTX4090

Output of python benchmark.py --amp --model 'test_*' --fast-norm --torchcompile:

model infer_samples_per_sec train_samples_per_sec
test_efficientnet 192256.16 30972.05
test_efficientnet_ln 186221.3 28402.3
test_efficientnet_evos 180578.68 32651.59
test_convnext3 179679.28 34998.59
test_byobnet 177707.5 32309.83
test_efficientnet_gn 169962.75 31801.23
test_convnext2 166527.39 37168.73
test_resnet 157618.18 25159.21
test_vit 146050.34 38321.33
test_convnext 138397.51 27930.18
test_vit2 116394.63 26856.88
test_vit3 89157.52 21656.06
test_nfnet 71030.73 14720.19

Details

The model names above give some hint as to what they are, but I did explore some 'unique' architecture variations that are worth mentioning for any who might try them.

test_byobnet

A ByobNet (mix of EfficientNet / ResNet / DarkNet blocks)

  • stage blocks = 1 * EdgeResidual (FusedMBConv), 1 * DarkBlock, 1 * ResNeXt Basic (group_size=32), 1 * ResNeXt Bottle (group_size=64)
  • channels = 32, 64, 128, 256
  • se_ratio = .25 (active in all blocks)
  • act_layer = ReLU
  • norm_layer = BatchNorm

test_convnext

A ConvNeXt

  • stage depths = 1, 2, 4, 2
  • channels = 24, 32, 48, 64
  • DW kernel_size = 7, 7, 7, 7
  • act_layer = GELU (tanh approx)
  • norm_layer = LayerNorm

test_convnext2

A ConvNeXt

  • stage depths = 1, 1, 1, 1
  • channels = 32, 64, 96, 128
  • DW kernel_size = 7, 7, 7, 7
  • act_layer = GELU (tanh approx)
  • norm_layer = LayerNorm

test_convnext3

A ConvNeXt w/ SiLU and varying kernel size

  • stage depths = 1, 1, 1, 1
  • channels = 32, 64, 96, 128
  • DW kernel_size = 7, 5, 5, 3
  • act_layer = SiLU
  • norm_layer = LayerNorm

test_efficientnet

An EfficientNet w/ V2 block mix

  • stage blocks = 1 * ConvBnAct, 2 * EdgeResidual (FusedMBConv), 2 * InvertedResidual (MBConv) w/ SE
  • channles = 16, 24, 32, 48, 64
  • kernel_size = 3x3 for all
  • expansion = 4x for all
  • stem_size = 24
  • act_layer = SiLU
  • norm_layer = BatchNorm

test_efficientnet_gn

An EfficientNet w/ V2 block mix and GroupNorm (group_size=8)

  • See above but with norm_layer=GroupNorm

test_efficientnet_ln

An EfficientNet w/ V2 block mix and LayerNorm

  • See above but with norm_layer=LayerNorm

test_efficientnet_evos

An EfficientNet w/ V2 block mix and EvoNorm-S

  • See above but with EvoNormS for norm + act

test_nfnet

A NormFree Net:

  • 4-stages, 1 block per stage
  • channels = 32, 64, 96, 128
  • group_size = 8
  • bottle_ratio = 0.25
  • se_ratio = 0.25
  • act_layer = SiLU
  • norm_layer = no norm, Scaled Weight Standardization is part of Convolution

test_resnet

A ResNet w/ mixed blocks:

  • stage blocks = 1 * BasicBlock, 1 * BasicBlock, 1 * BottleNeck, 1 * BasicBlock
  • channels = 32, 48, 48, 96
  • deep 3x3 stem (aka ResNet-D)
  • avg pool in downsample (aka ResNet-D)
  • stem_width = 16
  • act_layer = ReLU
  • norm_layer = BatchNorm

test_vit

A vanilla ViT w/ class token:

  • patch_size = 16
  • embed_dim = 64
  • num_heads = 2
  • mlp_ratio = 3
  • depth = 6
  • act_layer = GELU
  • norm_layer = LayerNorm

test_vit2

A ViT w/ global avg pool, 1 reg token, layer-scale (like timm SBB ViTs https://huggingface.co./collections/timm/searching-for-better-vit-baselines-663eb74f64f847d2f35a9c19):

  • patch_size = 16
  • embed_dim = 64
  • num_heads = 2
  • mlp_ratio = 3
  • depth = 8
  • act_layer = GELU
  • norm_layer = LayerNorm

test_vit3

A ViT w/ attention-pool, 1 reg token, layer-scale.

  • patch_size = 16
  • embed_dim = 96
  • num_heads = 3
  • mlp_ratio = 2
  • depth = 9
  • act_layer = GELU
  • norm_layer = LayerNorm