fragata commited on
Commit
c552c00
1 Parent(s): 9244fb2

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +79 -0
README.md CHANGED
@@ -1,3 +1,82 @@
1
  ---
 
 
 
 
 
 
2
  license: cc-by-nc-4.0
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ language:
3
+ - hu
4
+ - en
5
+ - zh
6
+ tags:
7
+ - text-generation
8
  license: cc-by-nc-4.0
9
+ widget:
10
+ - text: Elmesélek egy történetet a nyelvtechnológiáról.
11
  ---
12
+
13
+ # PULI GPTrio (6.7 billion parameter)
14
+
15
+ For further details, see [our demo site](https://juniper.nytud.hu/demo/gptrio).
16
+
17
+ - Hungarian-English-Chinese trilingual GPT-NeoX model (6.7 billion parameter)
18
+ - Trained with EleutherAI's GPT-NeoX [github](https://github.com/EleutherAI/gpt-neox)
19
+ - Checkpoint: 410 000 steps
20
+
21
+ ## Dataset
22
+
23
+ - Hungarian: 41 508 933 801 words (314 GB)
24
+ - English: 61 906 491 82 words (391 GB)
25
+ - Github: 6 018 366 documents (33 GB)
26
+ - Chinese: 98 693 705 456 Chinese character (340 GB)
27
+ - (12 072 234 774 non Chinese token)
28
+
29
+ ## Limitations
30
+
31
+ - max_seq_length = 2048
32
+ - float16
33
+
34
+
35
+ ## Citation
36
+ If you use this model, please cite the following paper:
37
+
38
+ ```
39
+ @inproceedings {yang-puli-gptrio,
40
+ title = {Mono- and multilingual GPT-3 models for Hungarian},
41
+ booktitle = {Text, Speech, and Dialogue - 26th International Conference, {TSD} 2023, Proceedings},
42
+ year = {2023},
43
+ publisher = {Springer},
44
+ series = {Lecture Notes in Computer Science},
45
+ address = {Plzeň, Czech Republic},
46
+ author = {Yang, Zijian Győző and Laki, László János and Váradi, Tamás and Prószéky, Gábor},
47
+ pages = {Accepted}
48
+ }
49
+ ```
50
+
51
+ ## Usage
52
+
53
+ ```python
54
+ from transformers import GPTNeoXForCausalLM, GPTNeoXTokenizerFast
55
+
56
+ model = GPTNeoXForCausalLM.from_pretrained("NYTK/PULI-GPTrio")
57
+ tokenizer = GPTNeoXTokenizerFast.from_pretrained("NYTK/PULI-GPTrio")
58
+ prompt = "Elmesélek egy történetet a nyelvtechnológiáról."
59
+ input_ids = tokenizer(prompt, return_tensors="pt").input_ids
60
+
61
+ gen_tokens = model.generate(
62
+ input_ids,
63
+ do_sample=True,
64
+ temperature=0.9,
65
+ max_length=100,
66
+ )
67
+
68
+ gen_text = tokenizer.batch_decode(gen_tokens)[0]
69
+ print(gen_text)
70
+ ```
71
+ ## Usage with pipeline
72
+
73
+ ```python
74
+ from transformers import pipeline, GPTNeoXForCausalLM, GPTNeoXTokenizerFast
75
+
76
+ model = GPTNeoXForCausalLM.from_pretrained("NYTK/PULI-GPTrio")
77
+ tokenizer = GPTNeoXTokenizerFast.from_pretrained("NYTK/PULI-GPTrio")
78
+ prompt = "Elmesélek egy történetet a nyelvtechnológiáról."
79
+ generator = pipeline(task="text-generation", model=model, tokenizer=tokenizer)
80
+
81
+ print(generator(prompt)[0]["generated_text"])
82
+ ```