VanillaNet: the Power of Minimalism in Deep Learning
Abstract
At the heart of foundation models is the philosophy of "more is different", exemplified by the astonishing success in computer vision and natural language processing. However, the challenges of optimization and inherent complexity of transformer models call for a paradigm shift towards simplicity. In this study, we introduce VanillaNet, a neural network architecture that embraces elegance in design. By avoiding high depth, shortcuts, and intricate operations like self-attention, VanillaNet is refreshingly concise yet remarkably powerful. Each layer is carefully crafted to be compact and straightforward, with nonlinear activation functions pruned after training to restore the original architecture. VanillaNet overcomes the challenges of inherent complexity, making it ideal for resource-constrained environments. Its easy-to-understand and highly simplified architecture opens new possibilities for efficient deployment. Extensive experimentation demonstrates that VanillaNet delivers performance on par with renowned deep neural networks and vision transformers, showcasing the power of minimalism in deep learning. This visionary journey of VanillaNet has significant potential to redefine the landscape and challenge the status quo of foundation model, setting a new path for elegant and effective model design. Pre-trained models and codes are available at https://github.com/huawei-noah/VanillaNet and https://gitee.com/mindspore/models/tree/master/research/cv/vanillanet.
Community
Proposes VanillaNet: a simple, concise, and robust architecture with less sophisticated parts (contrary to ResNet that has residuals, ViT that has attention complexity, Swin Transformer that has custom CUDA modules). Simple 6 layer CNN (convolution neural network) architecture: a 4x4 conv (stride 4), then max-pooling with stride 2 (reduce height and width) and 1x1 conv/linear layer (double channels) for three modules/stages, then average pooling followed by FC layer (for classification settings). Deep training: combine activation functions with a residual-like alternative for non-linearity (linear weight - hyperparameter - increases linearly with epoch); merge batch norm and convolution layer, then 1x1 conv layers (merged weights and biases). Series Informed Activation function: Concurrently stack activation functions (with their own scale and bias) for each layer (increases non-linearity), can aggregate information from neighborhood; they have lesser computation complexity. Visualizes and contrasts (with ResNet) the attention learned (by VanillaNet) using GradCam++. Comparable (better than ViT) performance on ImageNet classification (SwinT comes close), but much lesser latency (high inference speed because of simple design). Higher FPS with slightly better AP on COCO detection and segmentation (with RetinaNet and Mask RCNN frameworks, compared to Swin-T backbone). Appendix has architecture and training details (LAMB optimizer). Code also has TensorRT ports. From Huawei, University of Sydney.
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper