ariG23498 HF staff commited on
Commit
37a98d4
·
verified ·
1 Parent(s): e9cc507

add model card

Browse files
Files changed (1) hide show
  1. README.md +165 -1
README.md CHANGED
@@ -1,11 +1,175 @@
1
  ---
2
  library_name: transformers
3
- tags: []
 
 
 
 
 
 
4
  base_model:
5
  - HuggingFaceTB/SmolLM2-1.7B-Instruct
6
  - google/siglip-so400m-patch14-384
7
  ---
 
8
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9
  # Model Card for Model ID
10
 
11
  <!-- Provide a quick summary of what the model is/does. -->
 
1
  ---
2
  library_name: transformers
3
+ license: apache-2.0
4
+ datasets:
5
+ - HuggingFaceM4/the_cauldron
6
+ - HuggingFaceM4/Docmatix
7
+ pipeline_tag: image-text-to-text
8
+ language:
9
+ - en
10
  base_model:
11
  - HuggingFaceTB/SmolLM2-1.7B-Instruct
12
  - google/siglip-so400m-patch14-384
13
  ---
14
+ <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/SmolVLM.png" width="800" height="auto" alt="Image description">
15
 
16
+ # SmolVLM
17
+
18
+ SmolVLM is a compact open multimodal model that accepts arbitrary sequences of image and text inputs to produce text outputs.
19
+ Designed for efficiency, SmolVLM can answer questions about images, describe visual content, create stories grounded on multiple images,
20
+ or function as a pure language model without visual inputs. Its lightweight architecture makes it suitable for on-device applications
21
+ while maintaining strong performance on multimodal tasks.
22
+
23
+ ## Model Summary
24
+
25
+ - **Developed by:** Hugging Face 🤗
26
+ - **Model type:** Multi-modal model (image+text)
27
+ - **Language(s) (NLP):** English
28
+ - **License:** Apache 2.0
29
+ - **Architecture:** Based on [Idefics3](https://huggingface.co/HuggingFaceM4/Idefics3-8B-Llama3) (see technical summary)
30
+
31
+ ## Resources
32
+
33
+ - **Demo:** [SmolVLM Demo](https://huggingface.co/spaces/HuggingFaceTB/SmolVLM)
34
+ - **Blog:** [Blog post](https://huggingface.co/blog/smolvlm)
35
+
36
+ ## Uses
37
+
38
+ SmolVLM can be used for inference on multimodal (image + text) tasks where the input comprises text queries along with one or more images.
39
+ Text and images can be interleaved arbitrarily, enabling tasks like image captioning, visual question answering, and storytelling based on
40
+ visual content. The model does not support image generation.
41
+
42
+ To fine-tune SmolVLM on a specific task, you can follow the fine-tuning tutorial.
43
+ <!-- todo: add link to fine-tuning tutorial -->
44
+
45
+ ### Technical Summary
46
+
47
+ SmolVLM leverages the lightweight SmolLM2 language model to provide a compact yet powerful multimodal experience.
48
+ It introduces several changes compared to previous Idefics models:
49
+
50
+ - **Image compression:** We introduce a more radical image compression compared to Idefics3 to enable the model to infer faster and use less RAM.
51
+ - **Visual Token Encoding:** SmolVLM uses 81 visual tokens to encode image patches of size 384×384. Larger images are divided into patches, each encoded separately, enhancing efficiency without compromising performance.
52
+
53
+ More details about the training and architecture are available in our technical report.
54
+
55
+
56
+ ### How to get started
57
+
58
+ You can use transformers to load, infer and fine-tune SmolVLM.
59
+
60
+ ```python
61
+ import torch
62
+ from PIL import Image
63
+ from transformers import AutoProcessor, AutoModelForVision2Seq
64
+ from transformers.image_utils import load_image
65
+ DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
66
+ # Load images
67
+ image1 = load_image("https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg")
68
+ image2 = load_image("https://huggingface.co/spaces/merve/chameleon-7b/resolve/main/bee.jpg")
69
+ # Initialize processor and model
70
+ processor = AutoProcessor.from_pretrained("HuggingFaceTB/SmolVLM-Base")
71
+ model = AutoModelForVision2Seq.from_pretrained(
72
+ "HuggingFaceTB/SmolVLM-Base",
73
+ torch_dtype=torch.bfloat16,
74
+ _attn_implementation="flash_attention_2" if DEVICE == "cuda" else "eager",
75
+ ).to(DEVICE)
76
+ # Create input messages
77
+ messages = [
78
+ {
79
+ "role": "user",
80
+ "content": [
81
+ {"type": "image"},
82
+ {"type": "image"},
83
+ {"type": "text", "text": "Can you describe the two images?"}
84
+ ]
85
+ },
86
+ ]
87
+ # Prepare inputs
88
+ prompt = processor.apply_chat_template(messages, add_generation_prompt=True)
89
+ inputs = processor(text=prompt, images=[image1, image2], return_tensors="pt")
90
+ inputs = inputs.to(DEVICE)
91
+ # Generate outputs
92
+ generated_ids = model.generate(**inputs, max_new_tokens=500)
93
+ generated_texts = processor.batch_decode(
94
+ generated_ids,
95
+ skip_special_tokens=True,
96
+ )
97
+ print(generated_texts[0])
98
+ """
99
+ User:<image>Can you describe the two images?
100
+ Assistant: I can describe the first one, but I can't describe the second one.
101
+ """
102
+ ```
103
+
104
+
105
+ ### Model optimizations
106
+
107
+ **Precision**: For better performance, load and run the model in half-precision (`torch.float16` or `torch.bfloat16`) if your hardware supports it.
108
+
109
+ ```python
110
+ from transformers import AutoModelForVision2Seq
111
+ import torch
112
+ model = AutoModelForVision2Seq.from_pretrained(
113
+ "HuggingFaceTB/SmolVLM-Base",
114
+ torch_dtype=torch.bfloat16
115
+ ).to("cuda")
116
+ ```
117
+
118
+ You can also load SmolVLM with 4/8-bit quantization using bitsandbytes, torchao or Quanto. Refer to [this page](https://huggingface.co/docs/transformers/en/main_classes/quantization) for other options.
119
+
120
+ ```python
121
+ from transformers import AutoModelForVision2Seq, BitsAndBytesConfig
122
+ import torch
123
+ quantization_config = BitsAndBytesConfig(load_in_8bit=True)
124
+ model = AutoModelForVision2Seq.from_pretrained(
125
+ "HuggingFaceTB/SmolVLM-Base",
126
+ quantization_config=quantization_config,
127
+ )
128
+ ```
129
+
130
+ **Vision Encoder Efficiency**: Adjust the image resolution by setting `size={"longest_edge": N*384}` when initializing the processor, where N is your desired value. The default `N=4` works well, which results in input images of
131
+ size 1536×1536. For documents, `N=5` might be beneficial. Decreasing N can save GPU memory and is appropriate for lower-resolution images. This is also useful if you want to fine-tune on videos.
132
+
133
+
134
+ ## Misuse and Out-of-scope Use
135
+
136
+ SmolVLM is not intended for high-stakes scenarios or critical decision-making processes that affect an individual's well-being or livelihood. The model may produce content that appears factual but may not be accurate. Misuse includes, but is not limited to:
137
+
138
+ - Prohibited Uses:
139
+ - Evaluating or scoring individuals (e.g., in employment, education, credit)
140
+ - Critical automated decision-making
141
+ - Generating unreliable factual content
142
+ - Malicious Activities:
143
+ - Spam generation
144
+ - Disinformation campaigns
145
+ - Harassment or abuse
146
+ - Unauthorized surveillance
147
+
148
+ ### License
149
+
150
+ SmolVLM is built upon [the shape-optimized SigLIP](https://huggingface.co/google/siglip-so400m-patch14-384) as image encoder and [SmolLM2](https://huggingface.co/HuggingFaceTB/SmolLM2-1.7B-Instruct) for text decoder part.
151
+
152
+ We release the SmolVLM checkpoints under the Apache 2.0 license.
153
+
154
+ ## Training Details
155
+
156
+ ### Training Data
157
+
158
+ The training data comes from [The Cauldron](https://huggingface.co/datasets/HuggingFaceM4/the_cauldron) and [Docmatix](https://huggingface.co/datasets/HuggingFaceM4/Docmatix) datasets, with emphasis on document understanding (25%) and image captioning (18%), while maintaining balanced coverage across other crucial capabilities like visual reasoning, chart comprehension, and general instruction following.
159
+ <img src="https://huggingface.co/HuggingFaceTB/SmolVLM-Instruct/resolve/main/mixture_the_cauldron.png" alt="Example Image" style="width:90%;" />
160
+
161
+
162
+ ## Evaluation
163
+
164
+ | Model | MMMU (val) | MathVista (testmini) | MMStar (val) | DocVQA (test) | TextVQA (val) | Min GPU RAM required (GB) |
165
+ |-------------------|------------|----------------------|--------------|---------------|---------------|---------------------------|
166
+ | SmolVLM | 38.8 | 44.6 | 42.1 | 81.6 | 72.7 | 5.02 |
167
+ | Qwen-VL 2B | 41.1 | 47.8 | 47.5 | 90.1 | 79.7 | 13.70 |
168
+ | InternVL2 2B | 34.3 | 46.3 | 49.8 | 86.9 | 73.4 | 10.52 |
169
+ | PaliGemma 3B 448px| 34.9 | 28.7 | 48.3 | 32.2 | 56.0 | 6.72 |
170
+ | moondream2 | 32.4 | 24.3 | 40.3 | 70.5 | 65.2 | 3.87 |
171
+ | MiniCPM-V-2 | 38.2 | 39.8 | 39.1 | 71.9 | 74.1 | 7.88 |
172
+ | MM1.5 1B | 35.8 | 37.2 | 0.0 | 81.0 | 72.5 | NaN |
173
  # Model Card for Model ID
174
 
175
  <!-- Provide a quick summary of what the model is/does. -->