Neohumans-ai
/

Eli-Hindi-v0.1

@@ -9,9 +9,60 @@ language:
 - en
 ---
-# Eli
-## Overview
 ## Generate
 ```python
@@ -158,3 +209,13 @@ system prompt = `You are Eli, an AI assistant created by NeoHumans-ai and traine
 ## Benchmarks
 coming soon

 - en
 ---
+# Eli: A Bilingual Hindi-English Large Language Model
+## Introduction
+Eli is an innovative, open-source bilingual Hindi-English Large Language Model (LLM) designed to bridge the linguistic gap between Hindi and English. Developed with meticulous attention to detail, Eli represents a pioneering effort to broaden the scope of LLMs to diverse languages.
+## Purpose Behind Eli
+**Why We Built Eli:**
+- **Language Adaptation:** Enhance language adaptability within LLMs for Hindi and English.
+- **Efficient Training:** Train and finetune on a compact dataset of 1 billion tokens.
+- **Optimized Processes:** Identify and implement the most efficient training processes.
+- **World Knowledge Acquisition:** Observe how the model acquires and processes world knowledge.
+- **Training Method Optimization:** Optimize training methods tailored to each development stage.
+## Development Stages
+### Pre-training
+- **Objective:** Familiarize Eli with a newly enriched vocabulary.
+- **Method:** Full-weight pre-training on a 500-million-token corpus using 2xA100 GPUs, taking about 25 hours.
+- **Outcome:** Improved Hindi token prediction and generation capabilities.
+### Bilingual Next Token Prediction and Translation
+- **Inspired By:** The open Hathi series by Sarvam.ai.
+- **Dataset:** 200,000 tokens, with translation using IndicTrans2.
+- **Method:** Alternating sentences between Hindi and English for enhanced alignment and balanced exposure.
+### Bilingual Instruct Fine-tuning
+- **Objective:** Enhance model responsiveness in both English and Hindi.
+- **Method:** Supervised fine-tuning with low-rank adaptation using various instruction datasets.
+- **Outcome:** A finely-tuned model available on Hugging Face, with a 4-bit quantized version for hands-on experience.
+### DPO Fine-tuning
+- **Objective:** Refine model preferences using Direct Preference Optimization.
+- **Method:** Translation and fine-tuning with the Anthropic/hh-rlhf dataset.
+- **Outcome:** Ongoing comprehensive evaluation.
+## Learnings and Future Directions
+**Challenges:**
+- **World Knowledge:** Occasional hallucinations in response to specific queries.
+- **Translation:** Requires more training data for nuanced translations.
+- **Fine-tuning:** Future iterations will balance between full-weight and Lora fine-tuning based on further tests.
+**What's Next:**
+- **Romanized Hindi:** Incorporate Romanized Hindi for added linguistic versatility.
+- **Continuous Learning:** Refine data pipelines, increase the training dataset to 10-15 billion Hindi tokens, and improve efficiency.
 ## Generate
 ```python
 ## Benchmarks
 coming soon
+## Conclusion
+Eli is designed to handle multi-turn chat conversations and understands Hinglish, making it highly effective for bilingual and code-mixed language contexts. Explore Eli’s capabilities on Hugging Face and experience the model firsthand on [chat.cognitivelab.in](https://chat.cognitivelab.in/).
+Weights and datasets are available on Hugging Face:
+- [Base Model](https://huggingface.co/Cognitive-Lab/LLama3-Gaja-Hindi-8B-base-v0.1)
+- [Instruct Model](https://huggingface.co/datasets/Cognitive-Lab/Hindi-Instruct-dataset)
+Stay tuned for more updates as we continue to evolve and enrich Eli.