File size: 8,927 Bytes

a365ae2

---
language:
  - en
tags:
  - llama
  - llm
  - fine-tuning
  - fill-in-the-middle
  - instruction-following
license: apache-2.0
datasets:
  - mlabonne/FineTome-100k
  - mlfoundations/dclm-baseline-1.0-parquet
  - wikimedia/wikipedia
  - bigcode/starcoderdata
pipeline_tag: text-generation
---

# Custom LLM with Full Fine-Tuning

## Model Overview

This project implements a custom-trained language model based on the Meta-Llama-3.1-8B architecture. Unlike the previous version which used a high-rank adapter, this model employs full fine-tuning for enhanced learning capacity across a variety of tasks.

- **Developer:** Eric Florenzano
- **Model Type:** Large Language Model (LLM)
- **Language(s):** English, with a focus on Python for code-related tasks
- **License:** Apache-2.0
- **Base Model:** meta-llama/Meta-Llama-3.1-8B

## Unique Training Approach

This model is trained directly on a mixture of high-quality datasets for general text and code completion tasks, as well as instruction-following. Key features include:

- **Full Fine-Tuning:** Unlike the previous LoRA approach, this version uses full fine-tuning to update all model parameters.
- **Diverse Dataset Mixture:** Combines pretraining and instruction datasets for comprehensive language understanding.
- **Multi-Format Instruction Tuning:** Alternates between ChatML and Llama Chat templates for flexible instruction-following.
- **Contextual Data Prefixing:** Uses source information to address data imbalance during training.
- **Fill-in-the-Middle (FIM) Training:** Incorporates FIM tasks for enhanced context understanding.

## Training Data

The model is trained on a blend of high-quality data sources:

- **FineTome-100k:** High-quality instruction-tuned data for general language tasks.
- **dclm-baseline-1.0-parquet:** Apple's pretraining corpus for text completion/prediction.
- **English, Spanish, and French Wikipedia:** For broad language understanding.
- **Starcoder:** High-quality Python-focused code dataset for code completion tasks.

## Training Procedure

### Setup

```bash
pip install -U transformers accelerate trl wandb wheel packaging peft bitsandbytes liger-kernel flash_attn
```

## Key Features

1. **Full Fine-Tuning:** Updates all model parameters for comprehensive learning.
2. **8-bit AdamW Optimizer:** Uses `adamw_bnb_8bit` for memory-efficient training.
3. **Flash Attention 2:** Implements `flash_attention_2` for faster training.
4. **Gradient Checkpointing:** Enables training with limited GPU memory.
5. **Liger and Packing:** Utilizes `use_liger=true` and `packing=true` for efficient data handling.
6. **BFloat16 Precision:** Uses `bfloat16` for balanced precision and performance.

## Advanced Training Techniques

This model incorporates several advanced training techniques to enhance its capabilities:

### 1. Fill-in-the-Middle (FIM) Capability

FIM allows the model to complete text when given both a prefix and a suffix, making it particularly useful for tasks like code completion, text infilling, and context-aware generation.

#### Using FIM with the Model

To use the FIM capability, structure your input with special tokens:

- `<|fim_start|>`: Marks the start of the FIM input
- `<|fim_marker|>`: Separates the prefix from the suffix
- `<|fim_gen|>`: Indicates where the generated content should begin
- `<|fim_end|>`: Marks the end of the FIM input

Example FIM input:
```
<|fim_start|>{prefix}<|fim_marker|>{suffix}<|fim_gen|>
```

The model will generate content to replace `<|fim_gen|>`, filling in the middle between the prefix and suffix.

### 2. Reverse Prediction and Instruction Backtranslation

This technique enhances the model's context understanding by training it to predict previous parts of a conversation or text. It's also known as instruction backtranslation.

#### How it works:
1. The model is given a snippet of conversation or text.
2. It's then tasked with predicting what came before this snippet.
3. This process helps the model understand context, conversation flow, and logical progression of ideas.

#### Benefits:
- Improved context understanding
- Enhanced ability to maintain coherent, contextually appropriate conversations
- Better grasp of cause-and-effect relationships in text

#### Example use case:
Input:
```
Human: Thank you for the information about Paris. Can you recommend some popular tourist attractions there?
```
Task: Predict the previous exchange in this conversation.

Possible model output:
```
Human: What's the capital of France?
Assistant: The capital of France is Paris. It's known as the "City of Light" and is famous for its art, culture, and historic landmarks.
Human: Thank you for the information about Paris. Can you recommend some popular tourist attractions there?
```

### 3. Meta-FIM

Meta-FIM applies the Fill-in-the-Middle technique to larger chunks of text, including entire conversations or documents. This improves the model's ability to handle complex, nested contexts.

#### Benefits:
- Enhanced understanding of long-range dependencies in text
- Improved ability to maintain coherence across longer contexts
- Better performance on tasks requiring integration of information from multiple parts of a document or conversation

#### Example:
```
<|fim_start|>Human: What's the weather like today?
Assistant: I'm sorry, but I don't have access to real-time weather information. Could you please provide your location?<|fim_marker|>Human: Thank you for the information about Paris. Can you recommend some popular tourist attractions there?<|fim_gen|>Human: I'm in Paris, France.
Assistant: Ah, Paris! While I can't provide real-time weather information, I can tell you that Paris generally has a temperate climate. May I suggest checking a local weather website or app for the most up-to-date information?
Human: That's a good idea, thanks. While we're on the topic of Paris, can you tell me about some famous landmarks?
Assistant: Certainly! Paris is known for its iconic landmarks. Here are a few famous ones:
1. Eiffel Tower
2. Louvre Museum
3. Notre-Dame Cathedral
4. Arc de Triomphe
5. Sacré-Cœur Basilica<|fim_end|>
```

In this example, the model needs to understand and generate a coherent conversation that fits between the given start and end points.

## Evaluation

|      Tasks      |Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----------------|-------|----------------|-----:|-----------|---|-----:|---|------|
|tinyBenchmarks   |    N/A|                |      |           |   |      |   |      |
| - tinyArc       |      0|none            |    25|acc_norm   |↑  |0.5791|±  |   N/A|
| - tinyGSM8k     |      0|flexible-extract|     5|exact_match|↑  |0.5536|±  |   N/A|
|                 |       |strict-match    |     5|exact_match|↑  |0.5536|±  |   N/A|
| - tinyHellaswag |      0|none            |    10|acc_norm   |↑  |0.8391|±  |   N/A|
| - tinyMMLU      |      0|none            |     0|acc_norm   |↑  |0.6377|±  |   N/A|
| - tinyTruthfulQA|      0|none            |     0|acc        |↑  |0.4914|±  |   N/A|
| - tinyWinogrande|      0|none            |     5|acc_norm   |↑  |0.7608|±  |   N/A|

### Training Command

```bash
python sft_14.py \
    --run_name="llama3.1-8b-continued3" \
    --model_name_or_path="meta-llama/Meta-Llama-3.1-8B" \
    --dataset_name="mlfoundations/dclm-baseline-1.0-parquet,mlabonne/FineTome-100k" \
    --report_to="wandb" \
    --optim="adamw_bnb_8bit" \
    --lr_scheduler_type="cosine" \
    --max_steps=100000 \
    --max_seq_length=64000 \
    --learning_rate=0.00001 \
    --attn_implementation="flash_attention_2" \
    --save_strategy="steps" \
    --save_steps 50 \
    --save_total_limit=10 \
    --per_device_train_batch_size=1 \
    --per_device_eval_batch_size=1 \
    --gradient_accumulation_steps=8 \
    --logging_steps=1 \
    --num_train_epochs=1 \
    --push_to_hub \
    --hub_model_id="ericflo/Llama-3.1-8B-ContinuedTraining3-FFT" \
    --hub_strategy="all_checkpoints" \
    --gradient_checkpointing \
    --use_liger=true \
    --packing=true \
    --torch_dtype="bfloat16" \
    --output_dir="continuedtraining3_output"
```

## Intended Uses

This model is designed for:

- Text Completion and Generation
- Code Completion (especially Python)
- Instruction Following
- General Language Understanding
- Context-Aware Text Infilling (using FIM)

## Limitations and Biases

- The model may exhibit biases present in the training data.
- It lacks real-time knowledge beyond its training data.
- Should not be used for critical decision-making without human oversight.

## Technical Specifications

- **Base Model:** meta-llama/Meta-Llama-3.1-8B
- **Training Approach:** Full Fine-Tuning
- **Library:** Hugging Face Transformers and TRL

## Contact

For inquiries about this model, please contact Eric Florenzano through the [model repository](https://huggingface.co./ericflo/Llama-3.1-8B-ContinuedTraining3-FFT).