royleibov
/

Llama-3.2-11B-Vision-Instruct-ZipNN-Compressed

Image-Text-to-Text

Transformers

Safetensors

Model card Files Files and versions Community

osanseviero HF staff

pcuenq HF staff commited on 2 days ago

Commit

9ef6114

•

1 Parent(s): 8324361

Update README (#14)

Browse files

- Update README (0c461edc7386276fd5044f0dd692b85b9b9f9aef)

Co-authored-by: Pedro Cuenca <[email protected]>

Files changed (1) hide show

README.md +22 -22

README.md CHANGED Viewed

@@ -212,10 +212,10 @@ The Llama 3.2-Vision collection of multimodal large language models (LLMs) is a
 **Model Architecture:** Llama 3.2-Vision is built on top of Llama 3.1 text-only model, which is an auto-regressive language model that uses an optimized transformer architecture. The tuned versions use supervised fine-tuning (SFT) and reinforcement learning with human feedback (RLHF) to align with human preferences for helpfulness and safety. To support image recognition tasks, the Llama 3.2-Vision model uses a separately trained vision adapter that integrates with the pre-trained Llama 3.1 language model. The adapter consists of a series of cross-attention layers that feed image encoder representations into the core LLM.
-|  | Training Data | Params | Input modalities | Output modalities | Context length | GQA | Token count | Knowledge cutoff |
 | :---- | :---- | :---- | :---- | :---- | :---- | :---- | :---- | :---- |
 | Llama 3.2-Vision  | (Image, text) pairs | 11B (10.6) | Text \+ Image | Text  | 128k | Yes | 6B (image, text) pairs | December 2023 |
-|  |  | 90B (88.8) | Text \+ Image | Text  | 128k | Yes |  |  |
 **Supported Languages:** For text only tasks, English, German, French, Italian, Portuguese, Hindi, Spanish, and Thai are officially supported. Llama 3.2 has been trained on a broader collection of languages than these 8 supported languages. Note for image+text applications, English is the only language supported.
@@ -329,31 +329,31 @@ In this section, we report the results for Llama 3.2-Vision models on standard a
 | Category | Benchmark | \# Shots | Metric | Llama 3.2 11B | Llama 3.2 90B |
 | ----- | ----- | ----- | ----- | ----- | ----- |
-| Image Understanding | VQAv2 (test-dev, 30k) | 0 | Zero-shot Accuracy | 66.83 | 73.64 |
-|  | Text VQA (val) | 0 | Zero-shot Relaxed accuracy | 73.14 | 73.52 |
-|  | DocVQA (val, unseen) | 0 | Zero-shot Average Normalized Levenshtein Similarity (ANLS) | 62.26 | 70.65 |
-| Visual Reasoning | MMMU (val, 0-shot) | 0 | Zero-shot Micro Average Accuracy | 41.67 | 49.33 |
-|  | ChartQA (test) | 0 | Zero-shot Accuracy | 39.4 | 54.16 |
-|  | InfographicsQA (val, unseen) | 0 | Zero-shot Average Normalized Levenshtein Similarity (ANLS) | 43.21 | 56.79 |
-|  | AI2 Diagram (test) | 0 | Zero-shot Accuracy | 62.37 | 75.26 |
 ### Instruction Tuned Models
 | Modality | Capability | Benchmark | \# Shots | Metric | Llama 3.2 11B | Llama 3.2 90B |
 | ----- | :---: | ----- | :---: | :---: | ----- | ----- |
-| Image | College-level Problems and Mathematical Reasoning | MMMU (val, CoT) | 0 | micro avg accuracy | 50.7 | 60.3 |
-|  |  | MMMU-Pro, Standard (10 opts, test) | 0 | accuracy | 33.0 | 45.2 |
-|  |  | MMMU-Pro, Vision (test) | 0 | accuracy | 23.7 | 33.8 |
-|  |  | MathVista (testmini) | 0 | accuracy | 51.5 | 57.3 |
-|  | Charts and Diagram Understanding | ChartQA (test, CoT) | 0 | relaxed accuracy | 83.4 | 85.5 |
-|  |  | AI2 Diagram (test) | 0 | accuracy | 91.1 | 92.3 |
 |  |  | DocVQA (test) | 0 | ANLS | 88.4 | 90.1 |
-|  | General Visual Question Answering | VQAv2 (test) | 0 | accuracy | 75.2 | 78.1 |
 |  |  |  |  |  |  |  |
-| Text | General | MMLU | 0 | macro\_avg/acc | [73.0](https://docs.google.com/spreadsheets/d/1b3IrobU5rTfbxtR-lfEMn7Fb41_YO3kfhrPmuP-g6ys/edit?gid=688970324#gid=688970324&range=E19) | [86.0](https://docs.google.com/spreadsheets/d/1b3IrobU5rTfbxtR-lfEMn7Fb41_YO3kfhrPmuP-g6ys/edit?gid=688970324#gid=688970324&range=H19) |
-|  | Math | MATH (CoT) | 0 | final\_em | [51.9](https://docs.google.com/spreadsheets/d/1b3IrobU5rTfbxtR-lfEMn7Fb41_YO3kfhrPmuP-g6ys/edit?gid=688970324#gid=688970324&range=E25) | [68.0](https://docs.google.com/spreadsheets/d/1b3IrobU5rTfbxtR-lfEMn7Fb41_YO3kfhrPmuP-g6ys/edit?gid=688970324#gid=688970324&range=H25) |
-|  | Reasoning | GPQA | 0 | acc | [32.8](https://docs.google.com/spreadsheets/d/1b3IrobU5rTfbxtR-lfEMn7Fb41_YO3kfhrPmuP-g6ys/edit?gid=688970324#gid=688970324&range=E27) | [46.7](https://docs.google.com/spreadsheets/d/1b3IrobU5rTfbxtR-lfEMn7Fb41_YO3kfhrPmuP-g6ys/edit?gid=688970324#gid=688970324&range=H27) |
-|  | Multilingual | MGSM (CoT) | 0 | em | [68.9](https://docs.google.com/spreadsheets/d/1b3IrobU5rTfbxtR-lfEMn7Fb41_YO3kfhrPmuP-g6ys/edit?gid=688970324#gid=688970324&range=E33) | [86.9](https://docs.google.com/spreadsheets/d/1b3IrobU5rTfbxtR-lfEMn7Fb41_YO3kfhrPmuP-g6ys/edit?gid=688970324#gid=688970324&range=H33) |
 ## Responsibility & Safety
@@ -399,8 +399,8 @@ In addition to our safety work above, we took extra care on measuring and/or mit
 **2\. Child Safety:** Child Safety risk assessments were conducted using a team of experts, to assess the model’s capability to produce outputs that could result in Child Safety risks and inform on any necessary and appropriate risk mitigations via fine tuning. We leveraged those expert red teaming sessions to expand the coverage of our evaluation benchmarks through Llama 3 model development.  For Llama 3, we conducted new in-depth sessions using objective based methodologies to assess the model risks along multiple attack vectors including the additional languages Llama 3 is trained on. We also partnered with content specialists to perform red teaming exercises assessing potentially violating content while taking account of market specific nuances or experiences.
-**3\. Cyber Attacks:** Our cyber attack uplift study investigated whether LLMs can enhance human capabilities in hacking tasks, both in terms of skill level and speed.
-Our attack automation study focused on evaluating the capabilities of LLMs when used as autonomous agents in cyber offensive operations, specifically in the context of ransomware attacks. This evaluation was distinct from previous studies that considered LLMs as interactive assistants. The primary objective was to assess whether these models could effectively function as independent agents in executing complex cyber-attacks without human intervention.
 ### Community

 **Model Architecture:** Llama 3.2-Vision is built on top of Llama 3.1 text-only model, which is an auto-regressive language model that uses an optimized transformer architecture. The tuned versions use supervised fine-tuning (SFT) and reinforcement learning with human feedback (RLHF) to align with human preferences for helpfulness and safety. To support image recognition tasks, the Llama 3.2-Vision model uses a separately trained vision adapter that integrates with the pre-trained Llama 3.1 language model. The adapter consists of a series of cross-attention layers that feed image encoder representations into the core LLM.
+|  | Training Data | Params | Input modalities | Output modalities | Context length | GQA | Data volume | Knowledge cutoff |
 | :---- | :---- | :---- | :---- | :---- | :---- | :---- | :---- | :---- |
 | Llama 3.2-Vision  | (Image, text) pairs | 11B (10.6) | Text \+ Image | Text  | 128k | Yes | 6B (image, text) pairs | December 2023 |
+| Llama 3.2-Vision | (Image, text) pairs | 90B (88.8) | Text \+ Image | Text  | 128k | Yes | 6B (image, text) pairs  | December 2023 |
 **Supported Languages:** For text only tasks, English, German, French, Italian, Portuguese, Hindi, Spanish, and Thai are officially supported. Llama 3.2 has been trained on a broader collection of languages than these 8 supported languages. Note for image+text applications, English is the only language supported.
 | Category | Benchmark | \# Shots | Metric | Llama 3.2 11B | Llama 3.2 90B |
 | ----- | ----- | ----- | ----- | ----- | ----- |
+| Image Understanding | VQAv2 (val) | 0 | Accuracy | 66.8 | 73.6 |
+|  | Text VQA (val) | 0 | Relaxed accuracy | 73.1 | 73.5 |
+|  | DocVQA (val, unseen) | 0 | ANLS | 62.3 | 70.7 |
+| Visual Reasoning | MMMU (val, 0-shot) | 0 | Micro average accuracy | 41.7 | 49.3 |
+|  | ChartQA (test) | 0 | Accuracy | 39.4 | 54.2 |
+|  | InfographicsQA (val, unseen) | 0 | ANLS | 43.2 | 56.8 |
+|  | AI2 Diagram (test) | 0 | Accuracy | 62.4 | 75.3 |
 ### Instruction Tuned Models
 | Modality | Capability | Benchmark | \# Shots | Metric | Llama 3.2 11B | Llama 3.2 90B |
 | ----- | :---: | ----- | :---: | :---: | ----- | ----- |
+| Image | College-level Problems and Mathematical Reasoning | MMMU (val, CoT) | 0 | Micro average accuracy | 50.7 | 60.3 |
+|  |  | MMMU-Pro, Standard (10 opts, test) | 0 | Accuracy | 33.0 | 45.2 |
+|  |  | MMMU-Pro, Vision (test) | 0 | Accuracy | 23.7 | 33.8 |
+|  |  | MathVista (testmini) | 0 | Accuracy | 51.5 | 57.3 |
+|  | Charts and Diagram Understanding | ChartQA (test, CoT) | 0 | Relaxed accuracy | 83.4 | 85.5 |
+|  |  | AI2 Diagram (test) | 0 | Accuracy | 91.1 | 92.3 |
 |  |  | DocVQA (test) | 0 | ANLS | 88.4 | 90.1 |
+|  | General Visual Question Answering | VQAv2 (test) | 0 | Accuracy | 75.2 | 78.1 |
 |  |  |  |  |  |  |  |
+| Text | General | MMLU (CoT) | 0 | Macro\_avg/acc | 73.0 | 86.0 |
+|  | Math | MATH (CoT) | 0 | Final\_em | 51.9 | 68.0 |
+|  | Reasoning | GPQA | 0 | Accuracy | 32.8 | 46.7 |
+|  | Multilingual | MGSM (CoT) | 0 | em | 68.9 | 86.9 |
 ## Responsibility & Safety
 **2\. Child Safety:** Child Safety risk assessments were conducted using a team of experts, to assess the model’s capability to produce outputs that could result in Child Safety risks and inform on any necessary and appropriate risk mitigations via fine tuning. We leveraged those expert red teaming sessions to expand the coverage of our evaluation benchmarks through Llama 3 model development.  For Llama 3, we conducted new in-depth sessions using objective based methodologies to assess the model risks along multiple attack vectors including the additional languages Llama 3 is trained on. We also partnered with content specialists to perform red teaming exercises assessing potentially violating content while taking account of market specific nuances or experiences.
+**3\. Cyber Attacks:** For Llama 3.1 405B, our cyber attack uplift study investigated whether LLMs can enhance human capabilities in hacking tasks, both in terms of skill level and speed.
+Our attack automation study focused on evaluating the capabilities of LLMs when used as autonomous agents in cyber offensive operations, specifically in the context of ransomware attacks. This evaluation was distinct from previous studies that considered LLMs as interactive assistants. The primary objective was to assess whether these models could effectively function as independent agents in executing complex cyber-attacks without human intervention. Because Llama 3.2’s vision capabilities are not generally germane to cyber uplift, we believe that the testing conducted for Llama 3.1 also applies to Llama 3.2.
 ### Community