d0p3 commited on
Commit
aee1ac4
·
verified ·
1 Parent(s): 77772db

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +69 -0
README.md ADDED
@@ -0,0 +1,69 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ language:
4
+ - uk
5
+ - en
6
+ ---
7
+
8
+ # ukr-t5-small
9
+
10
+ A compact T5-small model fine-tuned for Ukrainian language tasks, with base English understanding.
11
+
12
+ ## Model Description
13
+
14
+ * **Base Model:** mT5-small
15
+ * **Fine-tuning Data:** Leipzig Corpora Collection (English & Ukrainian news from 2023)
16
+ * **Tasks:**
17
+ * Text summarization (Ukrainian)
18
+ * Text generation (Ukrainian)
19
+ * Other Ukrainian-centric NLP tasks
20
+
21
+ ## Technical Details
22
+ * **Model Size:** 300 MB
23
+ * **Framework:** Transformers (Hugging Face)
24
+
25
+ ## Usage
26
+
27
+ **Installation**
28
+
29
+ ```bash
30
+ pip install transformers
31
+ ```
32
+
33
+ **Loading the Model**
34
+
35
+ ```python
36
+ from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
37
+
38
+ tokenizer = AutoTokenizer.from_pretrained("path/to/ukr-t5-small")
39
+ model = AutoModelForSeq2SeqLM.from_pretrained("path/to/ukr-t5-small")
40
+ ```
41
+
42
+ **Example: Machine Translation**
43
+
44
+ ```python
45
+ text = "(Text in Ukrainian here)"
46
+
47
+ # Tokenize and translate
48
+ inputs = tokenizer("summarize: " + text, return_tensors="pt", max_length=512, truncation=True)
49
+ summary_ids = model.generate(inputs["input_ids"], num_beams=4, max_length=128)
50
+
51
+ # Decode output
52
+ summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
53
+ print(summary)
54
+ ```
55
+
56
+ ## Limitations
57
+
58
+ * The model's focus is on Ukrainian text processing, so performance on purely English tasks may be below that of general T5-small models.
59
+ * Further fine-tuning may be required for optimal results on specific NLP tasks.
60
+
61
+ ## Dataset Credits
62
+
63
+ This model was fine-tuned on the Leipzig Corpora Collection (specify if there's a particular subset within the collection that you used). For full licensing and usage information of the original dataset, please refer to [Leipzig Corpora Collection website](https://wortschatz.uni-leipzig.de/en/download)
64
+
65
+ ## Ethical Considerations
66
+
67
+ * NLP models can reflect biases present in their training data. Be mindful of this when using this model for applications that have real-world impact.
68
+ * It's important to test this model thoroughly across a variety of Ukrainian language samples to evaluate its reliability and fairness.
69
+