File size: 3,174 Bytes
15dc9d3
 
 
 
 
 
 
6fd6b81
 
63960e6
 
6d95e2f
 
eb37f28
6d95e2f
e96be08
 
 
eb37f28
 
e96be08
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
eb37f28
 
e96be08
 
 
eb37f28
e96be08
 
 
29192a2
e96be08
29192a2
e96be08
29192a2
e96be08
29192a2
e96be08
29192a2
e96be08
29192a2
e96be08
29192a2
e96be08
29192a2
e96be08
29192a2
e96be08
29192a2
e96be08
29192a2
e96be08
29192a2
e96be08
29192a2
e96be08
29192a2
e96be08
 
 
eb37f28
e96be08
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
---
language:
- en
- hi
pipeline_tag: translation
---

![image/png](https://cdn-uploads.huggingface.co/production/uploads/6402581706c715b93401f1e2/F_jewCiFKR4e8PiLC15C7.png)


## Model Details
This is a merge of lora trained on English to Hinglish translation dataset by NATERAW on llama2-7b and OPENHATHI-7B-BASE. Since openHathi has more hindi data in it's pretraining compared to llama2 the translation is significantly better.

## Prompting
You can use the prompt template provided by nateraw
"Translate from english to hinglish:\n{{en}}\n---\nTranslation:\n"


**Sample code**:
```
from transformers import LlamaForCausalLM, AutoTokenizer
import torch

device = "cuda:0"
tokenizer = AutoTokenizer.from_pretrained('akashgoel-id/OpenHathi-7B-English-to-Hinglish')
model = LlamaForCausalLM.from_pretrained('akashgoel-id/OpenHathi-7B-English-to-Hinglish', torch_dtype=torch.bfloat16).to(device)

PROMPT_TEMPLATE = (
    f"Translate from english to hinglish:\n{{en}}\n---\nTranslation:\n"
)
while True:
    userInput = input("Enter input: ")
    prompt = PROMPT_TEMPLATE.format(en=userInput)
    inputs = tokenizer(prompt, return_tensors="pt").to(device)
    generate_ids = model.generate(inputs.input_ids, max_length=500)
    print(tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0])

```




## Limitations
The model is still not that good when it comes to idioms

1) Input : When it rains, it pours

   Response: Jab baarish hoti hai, to baarish hoti hai

   Evaluation: This translation is quite literal and misses the idiomatic meaning. The original phrase implies that when something happens, it often happens to an extreme degree. The provided translation simply states that "when it rains, it rains," which doesn't capture the essence of the idiom.

2) Input : Don't count your chickens before they hatch
   
   Response: Apne murgon ko ande se pahle na ginein
   
   Evaluation: This is a reasonable translation of the idiom, maintaining the metaphorical meaning about not making plans based on assumptions of uncertain future events.

3) Input : Biting off more than you can chew
   
   Response: Aap jo chaba sakte hain usse adhik kaatna
   
   Evaluation : This translation captures the literal aspect of biting and chewing but may not fully convey the idiomatic sense of taking on a task that is too big or difficult to handle.

4) Input : The ball is in your court
 
   Response : Gend aapke court mein hai
   
   Evaluation : This translation effectively communicates the meaning of the idiom, which is about it being someone else's turn to make a decision or take an action.

5) Input : Beating around the bush
 
   Response : Bush ke chaaron or peetna

   Evaluation : This is a literal translation and doesn't quite capture the idiomatic meaning of avoiding the main point or not speaking directly about a subject. The phrase "Ghumaphira ke baat karna" would be more appropriate.


## Next steps
1) The model seems to be highly censored given it used llama2. Next step would be to remove some of censorship by finetuning on more uncensored data. (What WizardLM has done for llama2)
2) Finetune on idioms