File size: 8,927 Bytes
a365ae2 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 |
---
language:
- en
tags:
- llama
- llm
- fine-tuning
- fill-in-the-middle
- instruction-following
license: apache-2.0
datasets:
- mlabonne/FineTome-100k
- mlfoundations/dclm-baseline-1.0-parquet
- wikimedia/wikipedia
- bigcode/starcoderdata
pipeline_tag: text-generation
---
# Custom LLM with Full Fine-Tuning
## Model Overview
This project implements a custom-trained language model based on the Meta-Llama-3.1-8B architecture. Unlike the previous version which used a high-rank adapter, this model employs full fine-tuning for enhanced learning capacity across a variety of tasks.
- **Developer:** Eric Florenzano
- **Model Type:** Large Language Model (LLM)
- **Language(s):** English, with a focus on Python for code-related tasks
- **License:** Apache-2.0
- **Base Model:** meta-llama/Meta-Llama-3.1-8B
## Unique Training Approach
This model is trained directly on a mixture of high-quality datasets for general text and code completion tasks, as well as instruction-following. Key features include:
- **Full Fine-Tuning:** Unlike the previous LoRA approach, this version uses full fine-tuning to update all model parameters.
- **Diverse Dataset Mixture:** Combines pretraining and instruction datasets for comprehensive language understanding.
- **Multi-Format Instruction Tuning:** Alternates between ChatML and Llama Chat templates for flexible instruction-following.
- **Contextual Data Prefixing:** Uses source information to address data imbalance during training.
- **Fill-in-the-Middle (FIM) Training:** Incorporates FIM tasks for enhanced context understanding.
## Training Data
The model is trained on a blend of high-quality data sources:
- **FineTome-100k:** High-quality instruction-tuned data for general language tasks.
- **dclm-baseline-1.0-parquet:** Apple's pretraining corpus for text completion/prediction.
- **English, Spanish, and French Wikipedia:** For broad language understanding.
- **Starcoder:** High-quality Python-focused code dataset for code completion tasks.
## Training Procedure
### Setup
```bash
pip install -U transformers accelerate trl wandb wheel packaging peft bitsandbytes liger-kernel flash_attn
```
## Key Features
1. **Full Fine-Tuning:** Updates all model parameters for comprehensive learning.
2. **8-bit AdamW Optimizer:** Uses `adamw_bnb_8bit` for memory-efficient training.
3. **Flash Attention 2:** Implements `flash_attention_2` for faster training.
4. **Gradient Checkpointing:** Enables training with limited GPU memory.
5. **Liger and Packing:** Utilizes `use_liger=true` and `packing=true` for efficient data handling.
6. **BFloat16 Precision:** Uses `bfloat16` for balanced precision and performance.
## Advanced Training Techniques
This model incorporates several advanced training techniques to enhance its capabilities:
### 1. Fill-in-the-Middle (FIM) Capability
FIM allows the model to complete text when given both a prefix and a suffix, making it particularly useful for tasks like code completion, text infilling, and context-aware generation.
#### Using FIM with the Model
To use the FIM capability, structure your input with special tokens:
- `<|fim_start|>`: Marks the start of the FIM input
- `<|fim_marker|>`: Separates the prefix from the suffix
- `<|fim_gen|>`: Indicates where the generated content should begin
- `<|fim_end|>`: Marks the end of the FIM input
Example FIM input:
```
<|fim_start|>{prefix}<|fim_marker|>{suffix}<|fim_gen|>
```
The model will generate content to replace `<|fim_gen|>`, filling in the middle between the prefix and suffix.
### 2. Reverse Prediction and Instruction Backtranslation
This technique enhances the model's context understanding by training it to predict previous parts of a conversation or text. It's also known as instruction backtranslation.
#### How it works:
1. The model is given a snippet of conversation or text.
2. It's then tasked with predicting what came before this snippet.
3. This process helps the model understand context, conversation flow, and logical progression of ideas.
#### Benefits:
- Improved context understanding
- Enhanced ability to maintain coherent, contextually appropriate conversations
- Better grasp of cause-and-effect relationships in text
#### Example use case:
Input:
```
Human: Thank you for the information about Paris. Can you recommend some popular tourist attractions there?
```
Task: Predict the previous exchange in this conversation.
Possible model output:
```
Human: What's the capital of France?
Assistant: The capital of France is Paris. It's known as the "City of Light" and is famous for its art, culture, and historic landmarks.
Human: Thank you for the information about Paris. Can you recommend some popular tourist attractions there?
```
### 3. Meta-FIM
Meta-FIM applies the Fill-in-the-Middle technique to larger chunks of text, including entire conversations or documents. This improves the model's ability to handle complex, nested contexts.
#### Benefits:
- Enhanced understanding of long-range dependencies in text
- Improved ability to maintain coherence across longer contexts
- Better performance on tasks requiring integration of information from multiple parts of a document or conversation
#### Example:
```
<|fim_start|>Human: What's the weather like today?
Assistant: I'm sorry, but I don't have access to real-time weather information. Could you please provide your location?<|fim_marker|>Human: Thank you for the information about Paris. Can you recommend some popular tourist attractions there?<|fim_gen|>Human: I'm in Paris, France.
Assistant: Ah, Paris! While I can't provide real-time weather information, I can tell you that Paris generally has a temperate climate. May I suggest checking a local weather website or app for the most up-to-date information?
Human: That's a good idea, thanks. While we're on the topic of Paris, can you tell me about some famous landmarks?
Assistant: Certainly! Paris is known for its iconic landmarks. Here are a few famous ones:
1. Eiffel Tower
2. Louvre Museum
3. Notre-Dame Cathedral
4. Arc de Triomphe
5. Sacré-Cœur Basilica<|fim_end|>
```
In this example, the model needs to understand and generate a coherent conversation that fits between the given start and end points.
## Evaluation
| Tasks |Version| Filter |n-shot| Metric | |Value | |Stderr|
|-----------------|-------|----------------|-----:|-----------|---|-----:|---|------|
|tinyBenchmarks | N/A| | | | | | | |
| - tinyArc | 0|none | 25|acc_norm |↑ |0.5791|± | N/A|
| - tinyGSM8k | 0|flexible-extract| 5|exact_match|↑ |0.5536|± | N/A|
| | |strict-match | 5|exact_match|↑ |0.5536|± | N/A|
| - tinyHellaswag | 0|none | 10|acc_norm |↑ |0.8391|± | N/A|
| - tinyMMLU | 0|none | 0|acc_norm |↑ |0.6377|± | N/A|
| - tinyTruthfulQA| 0|none | 0|acc |↑ |0.4914|± | N/A|
| - tinyWinogrande| 0|none | 5|acc_norm |↑ |0.7608|± | N/A|
### Training Command
```bash
python sft_14.py \
--run_name="llama3.1-8b-continued3" \
--model_name_or_path="meta-llama/Meta-Llama-3.1-8B" \
--dataset_name="mlfoundations/dclm-baseline-1.0-parquet,mlabonne/FineTome-100k" \
--report_to="wandb" \
--optim="adamw_bnb_8bit" \
--lr_scheduler_type="cosine" \
--max_steps=100000 \
--max_seq_length=64000 \
--learning_rate=0.00001 \
--attn_implementation="flash_attention_2" \
--save_strategy="steps" \
--save_steps 50 \
--save_total_limit=10 \
--per_device_train_batch_size=1 \
--per_device_eval_batch_size=1 \
--gradient_accumulation_steps=8 \
--logging_steps=1 \
--num_train_epochs=1 \
--push_to_hub \
--hub_model_id="ericflo/Llama-3.1-8B-ContinuedTraining3-FFT" \
--hub_strategy="all_checkpoints" \
--gradient_checkpointing \
--use_liger=true \
--packing=true \
--torch_dtype="bfloat16" \
--output_dir="continuedtraining3_output"
```
## Intended Uses
This model is designed for:
- Text Completion and Generation
- Code Completion (especially Python)
- Instruction Following
- General Language Understanding
- Context-Aware Text Infilling (using FIM)
## Limitations and Biases
- The model may exhibit biases present in the training data.
- It lacks real-time knowledge beyond its training data.
- Should not be used for critical decision-making without human oversight.
## Technical Specifications
- **Base Model:** meta-llama/Meta-Llama-3.1-8B
- **Training Approach:** Full Fine-Tuning
- **Library:** Hugging Face Transformers and TRL
## Contact
For inquiries about this model, please contact Eric Florenzano through the [model repository](https://huggingface.co./ericflo/Llama-3.1-8B-ContinuedTraining3-FFT). |