File size: 3,484 Bytes

1e5ff57
 
481056b
 
1e5ff57
 
 
 
481056b
1e5ff57
481056b
1e5ff57
481056b
1e5ff57
 
481056b
1e5ff57
481056b
1e5ff57
 
 
481056b
 
1e5ff57
 
481056b
1e5ff57
481056b
1e5ff57
9fe3e7e
 
 
 
 
 
481056b
 
 
1e5ff57
481056b
1e5ff57
481056b
 
 
 
 
1e5ff57
 
 
 
 
481056b
1e5ff57
481056b
1e5ff57
481056b
 
1e5ff57
481056b
 
1e5ff57
481056b
 
1e5ff57
481056b
 
 
1e5ff57
481056b
 
 
 
1e5ff57
481056b
 
1e5ff57
481056b
 
 
1e5ff57
481056b
 
 
1e5ff57
481056b
 
 
1e5ff57
481056b
 
1e5ff57
 
481056b
1e5ff57
481056b
1e5ff57
481056b
1e5ff57
 
 
481056b
1e5ff57
481056b

---
library_name: transformers
base_model:
- meta-llama/Meta-Llama-3-8B-Instruct
---

# Model Card for Model ID

### Llama3-8B-1.58 Models

The **Llama3-8B-1.58** models are large language models fine-tuned on the **BitNet 1.58b architecture**, starting from the base model **Llama-3-8B-Instruct**.

For a deeper dive into the methods and results, check out our [blog post](https://).


## Model Details

### Model Sources

<!-- Provide the basic links for the model. -->

- **Repository:** [Model](https://huggingface.co./HF1BitLLM/Llama3-8B-1.58-Sigmoid-k100-10B-tokens)
- **Paper:** [The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits](https://arxiv.org/abs/2402.17764)


## How to Get Started with the Model

You can easily load and test our model in Transformers. Just follow the code below:

Start by installing the transformers version with the correct configuration to load bitnet models
```bash
pip install git+https://github.com/huggingface/transformers.git@refs/pull/33410/head
```

And then load the model : 
```python
model = AutoModelForCausalLM.from_pretrained("HF1BitLLM/Llama3-8B-1.58-Linear-10B-tokens", device_map="cuda", torch_dtype=torch.bfloat16)    
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct")

input_text = "Daniel went back to the the the garden. Mary travelled to the kitchen. Sandra journeyed to the kitchen. Sandra went to the hallway. John went to the bedroom. Mary went back to the garden. Where is Mary?\nAnswer:"

input_ids = tokenizer.encode(input_text, return_tensors="pt").cuda()
output = model.generate(input_ids, max_length=10, do_sample=False)
generated_text = tokenizer.decode(output[0], skip_special_tokens=True)
print(generated_text)
```

## Training Details

### Training Data

The model was trained on a subset of [FineWeb-edu](https://huggingface.co./datasets/HuggingFaceFW/fineweb-edu)

### Training Process

1. **Starting Point**
   - Initialized from Llama3 8B weights

2. **Training Duration**
   - Fine-tuned for 5,000 steps

3. **Dataset**
   - FineWeb-edu dataset

4. **Batch Size**
   - 2 million tokens per step
   - Total tokens: 5,000 steps * 2 million tokens = 10 billion tokens

5. **Lambda Scheduler**
   - Used a linear lambda scheduler for warmup quantization
   - Lambda value: `1 / (1 + exp(-k * (step / 1000 - 0.5)))`
   - This gradually introduced quantization over the first 1,000 steps

6. **Learning Rate**
   - Base learning rate: 1e-4

7. **Performance**
   - Achieved impressive results considering the limited training data
   - Outperformed some models trained on much larger datasets (e.g., BitNet 7B trained on 100B tokens)

8. **Evaluation**
   - Regular evaluations using various metrics
   - Metrics included perplexity, MMLU scores, and other standard benchmarks

9. **Quantization**
   - 1.58-bit (ternary) quantization for weights
   - Activations quantized to 8-bit precision

10. **Key Findings**
    - Warmup quantization (sigmoid or linear lambda scheduler) proved crucial for performance


## Evaluation

The evaluation of the models is done on the nanotron checkpoints using LightEval : 

![results](https://huggingface.co./datasets/huggingface/documentation-images/resolve/main/blog/1.58llm_extreme_quantization/metrics_comparison_updated.png)



## Citation

```bash
@misc{,
      title={1.58-Bit LLM: A New Era of Extreme Quantization}, 
      author={Mohamed Mekkouri and Marc Sun and Leandro von Werra and Thomas Wolf},
      year={2024},
}
```