metadata
base_model: Qwen/Qwen2.5-7B-Instruct
library_name: peft
datasets:
- gbharti/finance-alpaca
- sujet-ai/Sujet-Finance-Instruct-177k
tags:
- krx
Qwen 2.5 7B Instruct Model Fine-tuning
This repository contains code for fine-tuning the Qwen 2.5 7B Instruct model using Amazon SageMaker. The project uses QLoRA (Quantized Low-Rank Adaptation) for efficient fine-tuning of large language models.
Project Structure
.
├── scripts/
│ ├── train.py
│ ├── tokenization_qwen2.py
│ ├── requirements.txt
│ └── bootstrap.sh
├── sagemaker_train.py
└── README.md
Prerequisites
- Amazon SageMaker access
- Hugging Face account and access token
- AWS credentials configured
- Python 3.10+
Environment Setup
The project uses the following key dependencies:
- PyTorch 2.1.0
- Transformers (latest from main branch)
- Accelerate >= 0.27.0
- PEFT >= 0.6.0
- BitsAndBytes >= 0.41.0
Model Configuration
- Base Model:
Qwen/Qwen2.5-7B-Instruct
- Training Method: QLoRA (4-bit quantization)
- Instance Type: ml.p5.48xlarge
- Distribution Strategy: PyTorch DDP
Training Configuration
Hyperparameters
{
'epochs': 3,
'per_device_train_batch_size': 4,
'gradient_accumulation_steps': 8,
'learning_rate': 1e-5,
'max_steps': 1000,
'bf16': True,
'max_length': 2048,
'gradient_checkpointing': True,
'optim': 'adamw_torch',
'lr_scheduler_type': 'cosine',
'warmup_ratio': 0.1,
'weight_decay': 0.01,
'max_grad_norm': 0.3
}
Environment Variables
The training environment is configured with optimizations for distributed training and memory management:
- CUDA device configuration
- Memory optimization settings
- EFA (Elastic Fabric Adapter) configuration for distributed training
- Hugging Face token and cache settings
Training Process
Environment Preparation:
- Creates
requirements.txt
with necessary dependencies - Generates
bootstrap.sh
for Transformers installation - Sets up SageMaker training configuration
- Creates
Model Loading:
- Loads the base Qwen 2.5 7B model with 4-bit quantization
- Configures BitsAndBytes for quantization
- Prepares model for k-bit training
Dataset Processing:
- Uses the Sujet Finance dataset
- Formats conversations in Qwen2 format
- Applies tokenization with maximum length of 2048 tokens
- Implements data preprocessing with parallel processing
Training:
- Implements gradient checkpointing for memory efficiency
- Uses cosine learning rate schedule with warmup
- Saves checkpoints every 50 steps
- Logs training metrics every 10 steps
Monitoring and Metrics
The training process tracks the following metrics:
- Training loss
- Evaluation loss
Error Handling
The implementation includes comprehensive error handling and logging:
- Environment validation
- Dataset preparation verification
- Training process monitoring
- Detailed error messages and stack traces
Usage
- Configure AWS credentials and SageMaker role
- Set up Hugging Face token
- Run the training script:
python sagemaker_train.py
Custom Components
Custom Tokenizer
The project includes a custom implementation of the Qwen2 tokenizer (tokenization_qwen2.py
) with:
- Special token handling
- Unicode normalization
- Vocabulary management
- Input preparation for model training
Notes
- The training script is optimized for the ml.p5.48xlarge instance type
- Uses PyTorch Distributed Data Parallel for training
- Implements gradient checkpointing for memory optimization
- Includes automatic retry mechanism for training failures
License
[Add License Information]