metadata

base_model: Qwen/Qwen2.5-7B-Instruct
library_name: peft
datasets:
  - gbharti/finance-alpaca
  - sujet-ai/Sujet-Finance-Instruct-177k
tags:
  - krx

Qwen 2.5 7B Instruct Model Fine-tuning

This repository contains code for fine-tuning the Qwen 2.5 7B Instruct model using Amazon SageMaker. The project uses QLoRA (Quantized Low-Rank Adaptation) for efficient fine-tuning of large language models.

Project Structure

.
├── scripts/
│   ├── train.py
│   ├── tokenization_qwen2.py
│   ├── requirements.txt
│   └── bootstrap.sh
├── sagemaker_train.py
└── README.md

Prerequisites

Amazon SageMaker access
Hugging Face account and access token
AWS credentials configured
Python 3.10+

Environment Setup

The project uses the following key dependencies:

PyTorch 2.1.0
Transformers (latest from main branch)
Accelerate >= 0.27.0
PEFT >= 0.6.0
BitsAndBytes >= 0.41.0

Model Configuration

Base Model: Qwen/Qwen2.5-7B-Instruct
Training Method: QLoRA (4-bit quantization)
Instance Type: ml.p5.48xlarge
Distribution Strategy: PyTorch DDP

Training Configuration

Hyperparameters

{
    'epochs': 3,
    'per_device_train_batch_size': 4,
    'gradient_accumulation_steps': 8,
    'learning_rate': 1e-5,
    'max_steps': 1000,
    'bf16': True,
    'max_length': 2048,
    'gradient_checkpointing': True,
    'optim': 'adamw_torch',
    'lr_scheduler_type': 'cosine',
    'warmup_ratio': 0.1,
    'weight_decay': 0.01,
    'max_grad_norm': 0.3
}

Environment Variables

The training environment is configured with optimizations for distributed training and memory management:

CUDA device configuration
Memory optimization settings
EFA (Elastic Fabric Adapter) configuration for distributed training
Hugging Face token and cache settings

Training Process

Environment Preparation:
- Creates requirements.txt with necessary dependencies
- Generates bootstrap.sh for Transformers installation
- Sets up SageMaker training configuration
Model Loading:
- Loads the base Qwen 2.5 7B model with 4-bit quantization
- Configures BitsAndBytes for quantization
- Prepares model for k-bit training
Dataset Processing:
- Uses the Sujet Finance dataset
- Formats conversations in Qwen2 format
- Applies tokenization with maximum length of 2048 tokens
- Implements data preprocessing with parallel processing
Training:
- Implements gradient checkpointing for memory efficiency
- Uses cosine learning rate schedule with warmup
- Saves checkpoints every 50 steps
- Logs training metrics every 10 steps

Monitoring and Metrics

The training process tracks the following metrics:

Training loss
Evaluation loss

Error Handling

The implementation includes comprehensive error handling and logging:

Environment validation
Dataset preparation verification
Training process monitoring
Detailed error messages and stack traces

Usage

Configure AWS credentials and SageMaker role
Set up Hugging Face token
Run the training script:

python sagemaker_train.py

Custom Components

Custom Tokenizer

The project includes a custom implementation of the Qwen2 tokenizer (tokenization_qwen2.py) with:

Special token handling
Unicode normalization
Vocabulary management
Input preparation for model training

Notes

The training script is optimized for the ml.p5.48xlarge instance type
Uses PyTorch Distributed Data Parallel for training
Implements gradient checkpointing for memory optimization
Includes automatic retry mechanism for training failures

License

[Add License Information]