AdithyaSK commited on
Commit
807c7a4
1 Parent(s): 5a0c3e3

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +63 -2
README.md CHANGED
@@ -9,9 +9,60 @@ language:
9
  - en
10
  ---
11
 
12
- # Eli
13
 
14
- ## Overview
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
15
 
16
  ## Generate
17
  ```python
@@ -158,3 +209,13 @@ system prompt = `You are Eli, an AI assistant created by NeoHumans-ai and traine
158
 
159
  ## Benchmarks
160
  coming soon
 
 
 
 
 
 
 
 
 
 
 
9
  - en
10
  ---
11
 
12
+ # Eli: A Bilingual Hindi-English Large Language Model
13
 
14
+ ## Introduction
15
+
16
+ Eli is an innovative, open-source bilingual Hindi-English Large Language Model (LLM) designed to bridge the linguistic gap between Hindi and English. Developed with meticulous attention to detail, Eli represents a pioneering effort to broaden the scope of LLMs to diverse languages.
17
+
18
+ ## Purpose Behind Eli
19
+
20
+ **Why We Built Eli:**
21
+
22
+ - **Language Adaptation:** Enhance language adaptability within LLMs for Hindi and English.
23
+ - **Efficient Training:** Train and finetune on a compact dataset of 1 billion tokens.
24
+ - **Optimized Processes:** Identify and implement the most efficient training processes.
25
+ - **World Knowledge Acquisition:** Observe how the model acquires and processes world knowledge.
26
+ - **Training Method Optimization:** Optimize training methods tailored to each development stage.
27
+
28
+ ## Development Stages
29
+
30
+ ### Pre-training
31
+
32
+ - **Objective:** Familiarize Eli with a newly enriched vocabulary.
33
+ - **Method:** Full-weight pre-training on a 500-million-token corpus using 2xA100 GPUs, taking about 25 hours.
34
+ - **Outcome:** Improved Hindi token prediction and generation capabilities.
35
+
36
+ ### Bilingual Next Token Prediction and Translation
37
+
38
+ - **Inspired By:** The open Hathi series by Sarvam.ai.
39
+ - **Dataset:** 200,000 tokens, with translation using IndicTrans2.
40
+ - **Method:** Alternating sentences between Hindi and English for enhanced alignment and balanced exposure.
41
+
42
+ ### Bilingual Instruct Fine-tuning
43
+
44
+ - **Objective:** Enhance model responsiveness in both English and Hindi.
45
+ - **Method:** Supervised fine-tuning with low-rank adaptation using various instruction datasets.
46
+ - **Outcome:** A finely-tuned model available on Hugging Face, with a 4-bit quantized version for hands-on experience.
47
+
48
+ ### DPO Fine-tuning
49
+
50
+ - **Objective:** Refine model preferences using Direct Preference Optimization.
51
+ - **Method:** Translation and fine-tuning with the Anthropic/hh-rlhf dataset.
52
+ - **Outcome:** Ongoing comprehensive evaluation.
53
+
54
+ ## Learnings and Future Directions
55
+
56
+ **Challenges:**
57
+
58
+ - **World Knowledge:** Occasional hallucinations in response to specific queries.
59
+ - **Translation:** Requires more training data for nuanced translations.
60
+ - **Fine-tuning:** Future iterations will balance between full-weight and Lora fine-tuning based on further tests.
61
+
62
+ **What's Next:**
63
+
64
+ - **Romanized Hindi:** Incorporate Romanized Hindi for added linguistic versatility.
65
+ - **Continuous Learning:** Refine data pipelines, increase the training dataset to 10-15 billion Hindi tokens, and improve efficiency.
66
 
67
  ## Generate
68
  ```python
 
209
 
210
  ## Benchmarks
211
  coming soon
212
+
213
+ ## Conclusion
214
+
215
+ Eli is designed to handle multi-turn chat conversations and understands Hinglish, making it highly effective for bilingual and code-mixed language contexts. Explore Eli’s capabilities on Hugging Face and experience the model firsthand on [chat.cognitivelab.in](https://chat.cognitivelab.in/).
216
+
217
+ Weights and datasets are available on Hugging Face:
218
+ - [Base Model](https://huggingface.co/Cognitive-Lab/LLama3-Gaja-Hindi-8B-base-v0.1)
219
+ - [Instruct Model](https://huggingface.co/datasets/Cognitive-Lab/Hindi-Instruct-dataset)
220
+
221
+ Stay tuned for more updates as we continue to evolve and enrich Eli.