Phi 3 Model with Extended Vocabulary and Fine-Tuning for Japanese
Overview
This project is a proof of concept that extends the base vocabulary of the Phi 3 model and then applies supervised fine-tuning to teach it a new language (Japanese). Despite using a very small custom dataset, the improvement in Japanese language understanding is substantial.
Model Details
- Base Model: Phi 3
- Objective: Extend the base vocabulary and fine-tune for Japanese language understanding.
- Dataset: Custom dataset of 1,000 entries generated using ChatGPT-4.
- Language: Japanese
Dataset
The dataset used for this project was generated with the assistance of ChatGPT-4. It comprises 1,000 entries, carefully curated to cover a diverse range of topics and linguistic structures.
Training
Vocabulary Extension
The base vocabulary of the Phi 3 model was extended to include new Japanese tokens. This was a crucial step to enable the model to comprehend and generate Japanese text more effectively.
Fine-Tuning
Supervised fine-tuning was performed on the extended model using the custom dataset. Despite the small dataset size, the model showed significant improvement in understanding and generating Japanese text.
Results
Even with the limited dataset and vocabulary size, the fine-tuned model demonstrated substantial improvements over the base model in terms of Japanese language understanding and generation.
Future Work
- Dataset Expansion: Increase the size and diversity of the dataset to further enhance model performance.
- Evaluation: Conduct comprehensive evaluation and benchmarking against standard Japanese language tasks.
- Optimization: Optimize the model for better performance and efficiency.