--- library_name: transformers tags: - hindi - bilingual license: llama2 language: - hi - en --- # Eli: A Bilingual Hindi-English Large Language Model ## Introduction Eli is an innovative, open-source bilingual Hindi-English Large Language Model (LLM) designed to bridge the linguistic gap between Hindi and English. Developed with meticulous attention to detail, Eli represents a pioneering effort to broaden the scope of LLMs to diverse languages. ## Purpose Behind Eli **Why We Built Eli:** - **Language Adaptation:** Enhance language adaptability within LLMs for Hindi and English. - **Efficient Training:** Train and finetune on a compact dataset of 1 billion tokens. - **Optimized Processes:** Identify and implement the most efficient training processes. - **World Knowledge Acquisition:** Observe how the model acquires and processes world knowledge. - **Training Method Optimization:** Optimize training methods tailored to each development stage. ## Development Stages ### Pre-training - **Objective:** Familiarize Eli with a newly enriched vocabulary. - **Method:** Full-weight pre-training on a 500-million-token corpus using 2xA100 GPUs, taking about 25 hours. - **Outcome:** Improved Hindi token prediction and generation capabilities. ### Bilingual Next Token Prediction and Translation - **Inspired By:** The open Hathi series by Sarvam.ai. - **Dataset:** 200,000 tokens, with translation using IndicTrans2. - **Method:** Alternating sentences between Hindi and English for enhanced alignment and balanced exposure. ### Bilingual Instruct Fine-tuning - **Objective:** Enhance model responsiveness in both English and Hindi. - **Method:** Supervised fine-tuning with low-rank adaptation using various instruction datasets. - **Outcome:** A finely-tuned model available on Hugging Face, with a 4-bit quantized version for hands-on experience. ### DPO Fine-tuning - **Objective:** Refine model preferences using Direct Preference Optimization. - **Method:** Translation and fine-tuning with the Anthropic/hh-rlhf dataset. - **Outcome:** Ongoing comprehensive evaluation. ## Learnings and Future Directions **Challenges:** - **World Knowledge:** Occasional hallucinations in response to specific queries. - **Translation:** Requires more training data for nuanced translations. - **Fine-tuning:** Future iterations will balance between full-weight and Lora fine-tuning based on further tests. **What's Next:** - **Romanized Hindi:** Incorporate Romanized Hindi for added linguistic versatility. - **Continuous Learning:** Refine data pipelines, increase the training dataset to 10-15 billion Hindi tokens, and improve efficiency. ## Generate ```python import torch from transformers import AutoModelForCausalLM, AutoTokenizer from transformers import GenerationConfig, TextStreamer , TextIteratorStreamer model = AutoModelForCausalLM.from_pretrained("Neohumans-ai/Eli", torch_dtype=torch.bfloat16).to("cuda") tokenizer = AutoTokenizer.from_pretrained("Neohumans-ai/Eli", trust_remote_code=True) # Existing messages list messages = [ {"role": "system", "content": " You are Eli, an AI assistant created by NeoHumans-ai and trained on top of Llama 3 Large language model (LLM), proficient in English and Hindi. You can respond in both languages based on the user's request."}, {"role": "user", "content": "Who are you"} ] input_ids = tokenizer.apply_chat_template( messages, add_generation_prompt=True, # tokenize=False, return_tensors="pt" ).to("cuda") outputs = model.generate( input_ids, max_new_tokens=256, eos_token_id=tokenizer.convert_tokens_to_ids("<|eot_id|>"), do_sample=True, temperature=0.6, top_p=0.9, ) response = outputs[0][input_ids.shape[-1]:] print(tokenizer.decode(response, skip_special_tokens=True)) ``` ## Multi-turn Chat To use the Eli model, you can follow the example code below: ```python import torch from transformers import AutoModelForCausalLM, AutoTokenizer from transformers import GenerationConfig, TextStreamer , TextIteratorStreamer model = AutoModelForCausalLM.from_pretrained("Neohumans-ai/Eli", torch_dtype=torch.bfloat16).to("cuda") tokenizer = AutoTokenizer.from_pretrained("Neohumans-ai/Eli", trust_remote_code=True) # Existing messages list messages = [ {"role": "system", "content": " You are Eli, an AI assistant created by NeoHumans-ai and trained on top of Llama 3 Large language model (LLM), proficient in English and Hindi. You can respond in both languages based on the user's request."}, ] # Function to add user input and generate response def process_user_input(user_input): global messages # Add user's input to messages list messages.append({"role": "user", "content": user_input}) # Prepare the prompt for generation prompt_formatted_message = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=False ) # Configure generation parameters generation_config = GenerationConfig( repetition_penalty=1.2, max_new_tokens=8000, temperature=0.2, top_p=0.95, top_k=40, bos_token_id=tokenizer.bos_token_id, eos_token_id=tokenizer.convert_tokens_to_ids("<|eot_id|>"), pad_token_id=tokenizer.pad_token_id, do_sample=True, use_cache=True, return_dict_in_generate=True, output_attentions=False, output_hidden_states=False, output_scores=False, ) streamer = TextStreamer(tokenizer) batch = tokenizer(str(prompt_formatted_message.strip()), return_tensors="pt") print("\033[32mResponse: \033[0m") # Print an empty response # Generate response generated = model.generate( inputs=batch["input_ids"].to("cuda"), generation_config=generation_config, streamer=streamer, ) # Extract and format assistant's response # print(tokenizer.decode(generated["sequences"].cpu().tolist()[0])) assistant_response = tokenizer.decode(generated["sequences"].cpu().tolist()[0]) # Find the last occurrence of "assistant" and empty string ("") assistant_start_index = assistant_response.rfind("<|start_header_id|>assistant<|end_header_id|>") empty_string_index = assistant_response.rfind("<|eot_id|>") # Extract the text between the last "assistant" and "" if assistant_start_index != -1 and empty_string_index != -1: final_response = assistant_response[assistant_start_index + len("<|start_header_id|>assistant<|end_header_id|>") : empty_string_index] else: # final_response = assistant_response # If indices not found, use the whole response assert "Filed to generate multi turn prompt formate" # Append the extracted response to the messages list messages.append({"role": "assistant", "content": final_response}) # messages.append({"role": "assistant", "content": assistant_response}) # Print assistant's response # print(f"Assistant: {assistant_response}") # Main interaction loop while True: print("=================================================================================") user_input = input("Input: ") # Prompt user for input # Check if user_input is empty if not user_input.strip(): # .strip() removes any leading or trailing whitespace break # Break out of the loop if input is empty # Print response placeholder process_user_input(user_input) # Process user's input and generate response ``` ## Prompt formate system prompt = `You are Eli, an AI assistant created by NeoHumans-ai and trained on top of Llama 3 Large language model(LLM), proficient in English and Hindi. You can respond in both languages based on the users request.` ``` <|begin_of_text|><|start_header_id|>system<|end_header_id|> {{ system_prompt }}<|eot_id|><|start_header_id|>user<|end_header_id|> {{ user_message_1 }}<|eot_id|><|start_header_id|>assistant<|end_header_id|> {{ model_answer_1 }}<|eot_id|><|start_header_id|>user<|end_header_id|> {{ user_message_2 }}<|eot_id|><|start_header_id|>assistant<|end_header_id|> ``` ## Benchmarks coming soon ## Conclusion Eli is designed to handle multi-turn chat conversations and understands Hinglish, making it highly effective for bilingual and code-mixed language contexts. Explore Eli’s capabilities on Hugging Face and experience the model firsthand on [chat.cognitivelab.in](https://chat.cognitivelab.in/). Weights and datasets are available on Hugging Face: - [Base Model](https://huggingface.co./Cognitive-Lab/LLama3-Gaja-Hindi-8B-base-v0.1) - [Instruct Model](https://huggingface.co./datasets/Cognitive-Lab/Hindi-Instruct-dataset) Stay tuned for more updates as we continue to evolve and enrich Eli.