Update README.md
Browse files
README.md
CHANGED
@@ -9,9 +9,60 @@ language:
|
|
9 |
- en
|
10 |
---
|
11 |
|
12 |
-
# Eli
|
13 |
|
14 |
-
##
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
15 |
|
16 |
## Generate
|
17 |
```python
|
@@ -158,3 +209,13 @@ system prompt = `You are Eli, an AI assistant created by NeoHumans-ai and traine
|
|
158 |
|
159 |
## Benchmarks
|
160 |
coming soon
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
9 |
- en
|
10 |
---
|
11 |
|
12 |
+
# Eli: A Bilingual Hindi-English Large Language Model
|
13 |
|
14 |
+
## Introduction
|
15 |
+
|
16 |
+
Eli is an innovative, open-source bilingual Hindi-English Large Language Model (LLM) designed to bridge the linguistic gap between Hindi and English. Developed with meticulous attention to detail, Eli represents a pioneering effort to broaden the scope of LLMs to diverse languages.
|
17 |
+
|
18 |
+
## Purpose Behind Eli
|
19 |
+
|
20 |
+
**Why We Built Eli:**
|
21 |
+
|
22 |
+
- **Language Adaptation:** Enhance language adaptability within LLMs for Hindi and English.
|
23 |
+
- **Efficient Training:** Train and finetune on a compact dataset of 1 billion tokens.
|
24 |
+
- **Optimized Processes:** Identify and implement the most efficient training processes.
|
25 |
+
- **World Knowledge Acquisition:** Observe how the model acquires and processes world knowledge.
|
26 |
+
- **Training Method Optimization:** Optimize training methods tailored to each development stage.
|
27 |
+
|
28 |
+
## Development Stages
|
29 |
+
|
30 |
+
### Pre-training
|
31 |
+
|
32 |
+
- **Objective:** Familiarize Eli with a newly enriched vocabulary.
|
33 |
+
- **Method:** Full-weight pre-training on a 500-million-token corpus using 2xA100 GPUs, taking about 25 hours.
|
34 |
+
- **Outcome:** Improved Hindi token prediction and generation capabilities.
|
35 |
+
|
36 |
+
### Bilingual Next Token Prediction and Translation
|
37 |
+
|
38 |
+
- **Inspired By:** The open Hathi series by Sarvam.ai.
|
39 |
+
- **Dataset:** 200,000 tokens, with translation using IndicTrans2.
|
40 |
+
- **Method:** Alternating sentences between Hindi and English for enhanced alignment and balanced exposure.
|
41 |
+
|
42 |
+
### Bilingual Instruct Fine-tuning
|
43 |
+
|
44 |
+
- **Objective:** Enhance model responsiveness in both English and Hindi.
|
45 |
+
- **Method:** Supervised fine-tuning with low-rank adaptation using various instruction datasets.
|
46 |
+
- **Outcome:** A finely-tuned model available on Hugging Face, with a 4-bit quantized version for hands-on experience.
|
47 |
+
|
48 |
+
### DPO Fine-tuning
|
49 |
+
|
50 |
+
- **Objective:** Refine model preferences using Direct Preference Optimization.
|
51 |
+
- **Method:** Translation and fine-tuning with the Anthropic/hh-rlhf dataset.
|
52 |
+
- **Outcome:** Ongoing comprehensive evaluation.
|
53 |
+
|
54 |
+
## Learnings and Future Directions
|
55 |
+
|
56 |
+
**Challenges:**
|
57 |
+
|
58 |
+
- **World Knowledge:** Occasional hallucinations in response to specific queries.
|
59 |
+
- **Translation:** Requires more training data for nuanced translations.
|
60 |
+
- **Fine-tuning:** Future iterations will balance between full-weight and Lora fine-tuning based on further tests.
|
61 |
+
|
62 |
+
**What's Next:**
|
63 |
+
|
64 |
+
- **Romanized Hindi:** Incorporate Romanized Hindi for added linguistic versatility.
|
65 |
+
- **Continuous Learning:** Refine data pipelines, increase the training dataset to 10-15 billion Hindi tokens, and improve efficiency.
|
66 |
|
67 |
## Generate
|
68 |
```python
|
|
|
209 |
|
210 |
## Benchmarks
|
211 |
coming soon
|
212 |
+
|
213 |
+
## Conclusion
|
214 |
+
|
215 |
+
Eli is designed to handle multi-turn chat conversations and understands Hinglish, making it highly effective for bilingual and code-mixed language contexts. Explore Eli’s capabilities on Hugging Face and experience the model firsthand on [chat.cognitivelab.in](https://chat.cognitivelab.in/).
|
216 |
+
|
217 |
+
Weights and datasets are available on Hugging Face:
|
218 |
+
- [Base Model](https://huggingface.co/Cognitive-Lab/LLama3-Gaja-Hindi-8B-base-v0.1)
|
219 |
+
- [Instruct Model](https://huggingface.co/datasets/Cognitive-Lab/Hindi-Instruct-dataset)
|
220 |
+
|
221 |
+
Stay tuned for more updates as we continue to evolve and enrich Eli.
|