osanseviero HF staff pcuenq HF staff commited on
Commit
9ef6114
1 Parent(s): 8324361

Update README (#14)

Browse files

- Update README (0c461edc7386276fd5044f0dd692b85b9b9f9aef)


Co-authored-by: Pedro Cuenca <[email protected]>

Files changed (1) hide show
  1. README.md +22 -22
README.md CHANGED
@@ -212,10 +212,10 @@ The Llama 3.2-Vision collection of multimodal large language models (LLMs) is a
212
 
213
  **Model Architecture:** Llama 3.2-Vision is built on top of Llama 3.1 text-only model, which is an auto-regressive language model that uses an optimized transformer architecture. The tuned versions use supervised fine-tuning (SFT) and reinforcement learning with human feedback (RLHF) to align with human preferences for helpfulness and safety. To support image recognition tasks, the Llama 3.2-Vision model uses a separately trained vision adapter that integrates with the pre-trained Llama 3.1 language model. The adapter consists of a series of cross-attention layers that feed image encoder representations into the core LLM.
214
 
215
- | | Training Data | Params | Input modalities | Output modalities | Context length | GQA | Token count | Knowledge cutoff |
216
  | :---- | :---- | :---- | :---- | :---- | :---- | :---- | :---- | :---- |
217
  | Llama 3.2-Vision | (Image, text) pairs | 11B (10.6) | Text \+ Image | Text | 128k | Yes | 6B (image, text) pairs | December 2023 |
218
- | | | 90B (88.8) | Text \+ Image | Text | 128k | Yes | | |
219
 
220
  **Supported Languages:** For text only tasks, English, German, French, Italian, Portuguese, Hindi, Spanish, and Thai are officially supported. Llama 3.2 has been trained on a broader collection of languages than these 8 supported languages. Note for image+text applications, English is the only language supported.
221
 
@@ -329,31 +329,31 @@ In this section, we report the results for Llama 3.2-Vision models on standard a
329
 
330
  | Category | Benchmark | \# Shots | Metric | Llama 3.2 11B | Llama 3.2 90B |
331
  | ----- | ----- | ----- | ----- | ----- | ----- |
332
- | Image Understanding | VQAv2 (test-dev, 30k) | 0 | Zero-shot Accuracy | 66.83 | 73.64 |
333
- | | Text VQA (val) | 0 | Zero-shot Relaxed accuracy | 73.14 | 73.52 |
334
- | | DocVQA (val, unseen) | 0 | Zero-shot Average Normalized Levenshtein Similarity (ANLS) | 62.26 | 70.65 |
335
- | Visual Reasoning | MMMU (val, 0-shot) | 0 | Zero-shot Micro Average Accuracy | 41.67 | 49.33 |
336
- | | ChartQA (test) | 0 | Zero-shot Accuracy | 39.4 | 54.16 |
337
- | | InfographicsQA (val, unseen) | 0 | Zero-shot Average Normalized Levenshtein Similarity (ANLS) | 43.21 | 56.79 |
338
- | | AI2 Diagram (test) | 0 | Zero-shot Accuracy | 62.37 | 75.26 |
339
 
340
  ### Instruction Tuned Models
341
 
342
  | Modality | Capability | Benchmark | \# Shots | Metric | Llama 3.2 11B | Llama 3.2 90B |
343
  | ----- | :---: | ----- | :---: | :---: | ----- | ----- |
344
- | Image | College-level Problems and Mathematical Reasoning | MMMU (val, CoT) | 0 | micro avg accuracy | 50.7 | 60.3 |
345
- | | | MMMU-Pro, Standard (10 opts, test) | 0 | accuracy | 33.0 | 45.2 |
346
- | | | MMMU-Pro, Vision (test) | 0 | accuracy | 23.7 | 33.8 |
347
- | | | MathVista (testmini) | 0 | accuracy | 51.5 | 57.3 |
348
- | | Charts and Diagram Understanding | ChartQA (test, CoT) | 0 | relaxed accuracy | 83.4 | 85.5 |
349
- | | | AI2 Diagram (test) | 0 | accuracy | 91.1 | 92.3 |
350
  | | | DocVQA (test) | 0 | ANLS | 88.4 | 90.1 |
351
- | | General Visual Question Answering | VQAv2 (test) | 0 | accuracy | 75.2 | 78.1 |
352
  | | | | | | | |
353
- | Text | General | MMLU | 0 | macro\_avg/acc | [73.0](https://docs.google.com/spreadsheets/d/1b3IrobU5rTfbxtR-lfEMn7Fb41_YO3kfhrPmuP-g6ys/edit?gid=688970324#gid=688970324&range=E19) | [86.0](https://docs.google.com/spreadsheets/d/1b3IrobU5rTfbxtR-lfEMn7Fb41_YO3kfhrPmuP-g6ys/edit?gid=688970324#gid=688970324&range=H19) |
354
- | | Math | MATH (CoT) | 0 | final\_em | [51.9](https://docs.google.com/spreadsheets/d/1b3IrobU5rTfbxtR-lfEMn7Fb41_YO3kfhrPmuP-g6ys/edit?gid=688970324#gid=688970324&range=E25) | [68.0](https://docs.google.com/spreadsheets/d/1b3IrobU5rTfbxtR-lfEMn7Fb41_YO3kfhrPmuP-g6ys/edit?gid=688970324#gid=688970324&range=H25) |
355
- | | Reasoning | GPQA | 0 | acc | [32.8](https://docs.google.com/spreadsheets/d/1b3IrobU5rTfbxtR-lfEMn7Fb41_YO3kfhrPmuP-g6ys/edit?gid=688970324#gid=688970324&range=E27) | [46.7](https://docs.google.com/spreadsheets/d/1b3IrobU5rTfbxtR-lfEMn7Fb41_YO3kfhrPmuP-g6ys/edit?gid=688970324#gid=688970324&range=H27) |
356
- | | Multilingual | MGSM (CoT) | 0 | em | [68.9](https://docs.google.com/spreadsheets/d/1b3IrobU5rTfbxtR-lfEMn7Fb41_YO3kfhrPmuP-g6ys/edit?gid=688970324#gid=688970324&range=E33) | [86.9](https://docs.google.com/spreadsheets/d/1b3IrobU5rTfbxtR-lfEMn7Fb41_YO3kfhrPmuP-g6ys/edit?gid=688970324#gid=688970324&range=H33) |
357
 
358
  ## Responsibility & Safety
359
 
@@ -399,8 +399,8 @@ In addition to our safety work above, we took extra care on measuring and/or mit
399
 
400
  **2\. Child Safety:** Child Safety risk assessments were conducted using a team of experts, to assess the model’s capability to produce outputs that could result in Child Safety risks and inform on any necessary and appropriate risk mitigations via fine tuning. We leveraged those expert red teaming sessions to expand the coverage of our evaluation benchmarks through Llama 3 model development. For Llama 3, we conducted new in-depth sessions using objective based methodologies to assess the model risks along multiple attack vectors including the additional languages Llama 3 is trained on. We also partnered with content specialists to perform red teaming exercises assessing potentially violating content while taking account of market specific nuances or experiences.
401
 
402
- **3\. Cyber Attacks:** Our cyber attack uplift study investigated whether LLMs can enhance human capabilities in hacking tasks, both in terms of skill level and speed.
403
- Our attack automation study focused on evaluating the capabilities of LLMs when used as autonomous agents in cyber offensive operations, specifically in the context of ransomware attacks. This evaluation was distinct from previous studies that considered LLMs as interactive assistants. The primary objective was to assess whether these models could effectively function as independent agents in executing complex cyber-attacks without human intervention.
404
 
405
  ### Community
406
 
 
212
 
213
  **Model Architecture:** Llama 3.2-Vision is built on top of Llama 3.1 text-only model, which is an auto-regressive language model that uses an optimized transformer architecture. The tuned versions use supervised fine-tuning (SFT) and reinforcement learning with human feedback (RLHF) to align with human preferences for helpfulness and safety. To support image recognition tasks, the Llama 3.2-Vision model uses a separately trained vision adapter that integrates with the pre-trained Llama 3.1 language model. The adapter consists of a series of cross-attention layers that feed image encoder representations into the core LLM.
214
 
215
+ | | Training Data | Params | Input modalities | Output modalities | Context length | GQA | Data volume | Knowledge cutoff |
216
  | :---- | :---- | :---- | :---- | :---- | :---- | :---- | :---- | :---- |
217
  | Llama 3.2-Vision | (Image, text) pairs | 11B (10.6) | Text \+ Image | Text | 128k | Yes | 6B (image, text) pairs | December 2023 |
218
+ | Llama 3.2-Vision | (Image, text) pairs | 90B (88.8) | Text \+ Image | Text | 128k | Yes | 6B (image, text) pairs | December 2023 |
219
 
220
  **Supported Languages:** For text only tasks, English, German, French, Italian, Portuguese, Hindi, Spanish, and Thai are officially supported. Llama 3.2 has been trained on a broader collection of languages than these 8 supported languages. Note for image+text applications, English is the only language supported.
221
 
 
329
 
330
  | Category | Benchmark | \# Shots | Metric | Llama 3.2 11B | Llama 3.2 90B |
331
  | ----- | ----- | ----- | ----- | ----- | ----- |
332
+ | Image Understanding | VQAv2 (val) | 0 | Accuracy | 66.8 | 73.6 |
333
+ | | Text VQA (val) | 0 | Relaxed accuracy | 73.1 | 73.5 |
334
+ | | DocVQA (val, unseen) | 0 | ANLS | 62.3 | 70.7 |
335
+ | Visual Reasoning | MMMU (val, 0-shot) | 0 | Micro average accuracy | 41.7 | 49.3 |
336
+ | | ChartQA (test) | 0 | Accuracy | 39.4 | 54.2 |
337
+ | | InfographicsQA (val, unseen) | 0 | ANLS | 43.2 | 56.8 |
338
+ | | AI2 Diagram (test) | 0 | Accuracy | 62.4 | 75.3 |
339
 
340
  ### Instruction Tuned Models
341
 
342
  | Modality | Capability | Benchmark | \# Shots | Metric | Llama 3.2 11B | Llama 3.2 90B |
343
  | ----- | :---: | ----- | :---: | :---: | ----- | ----- |
344
+ | Image | College-level Problems and Mathematical Reasoning | MMMU (val, CoT) | 0 | Micro average accuracy | 50.7 | 60.3 |
345
+ | | | MMMU-Pro, Standard (10 opts, test) | 0 | Accuracy | 33.0 | 45.2 |
346
+ | | | MMMU-Pro, Vision (test) | 0 | Accuracy | 23.7 | 33.8 |
347
+ | | | MathVista (testmini) | 0 | Accuracy | 51.5 | 57.3 |
348
+ | | Charts and Diagram Understanding | ChartQA (test, CoT) | 0 | Relaxed accuracy | 83.4 | 85.5 |
349
+ | | | AI2 Diagram (test) | 0 | Accuracy | 91.1 | 92.3 |
350
  | | | DocVQA (test) | 0 | ANLS | 88.4 | 90.1 |
351
+ | | General Visual Question Answering | VQAv2 (test) | 0 | Accuracy | 75.2 | 78.1 |
352
  | | | | | | | |
353
+ | Text | General | MMLU (CoT) | 0 | Macro\_avg/acc | 73.0 | 86.0 |
354
+ | | Math | MATH (CoT) | 0 | Final\_em | 51.9 | 68.0 |
355
+ | | Reasoning | GPQA | 0 | Accuracy | 32.8 | 46.7 |
356
+ | | Multilingual | MGSM (CoT) | 0 | em | 68.9 | 86.9 |
357
 
358
  ## Responsibility & Safety
359
 
 
399
 
400
  **2\. Child Safety:** Child Safety risk assessments were conducted using a team of experts, to assess the model’s capability to produce outputs that could result in Child Safety risks and inform on any necessary and appropriate risk mitigations via fine tuning. We leveraged those expert red teaming sessions to expand the coverage of our evaluation benchmarks through Llama 3 model development. For Llama 3, we conducted new in-depth sessions using objective based methodologies to assess the model risks along multiple attack vectors including the additional languages Llama 3 is trained on. We also partnered with content specialists to perform red teaming exercises assessing potentially violating content while taking account of market specific nuances or experiences.
401
 
402
+ **3\. Cyber Attacks:** For Llama 3.1 405B, our cyber attack uplift study investigated whether LLMs can enhance human capabilities in hacking tasks, both in terms of skill level and speed.
403
+ Our attack automation study focused on evaluating the capabilities of LLMs when used as autonomous agents in cyber offensive operations, specifically in the context of ransomware attacks. This evaluation was distinct from previous studies that considered LLMs as interactive assistants. The primary objective was to assess whether these models could effectively function as independent agents in executing complex cyber-attacks without human intervention. Because Llama 3.2’s vision capabilities are not generally germane to cyber uplift, we believe that the testing conducted for Llama 3.1 also applies to Llama 3.2.
404
 
405
  ### Community
406