Create README.md
Browse files
README.md
ADDED
@@ -0,0 +1,69 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
license: apache-2.0
|
3 |
+
language:
|
4 |
+
- uk
|
5 |
+
- en
|
6 |
+
---
|
7 |
+
|
8 |
+
# ukr-t5-small
|
9 |
+
|
10 |
+
A compact T5-small model fine-tuned for Ukrainian language tasks, with base English understanding.
|
11 |
+
|
12 |
+
## Model Description
|
13 |
+
|
14 |
+
* **Base Model:** mT5-small
|
15 |
+
* **Fine-tuning Data:** Leipzig Corpora Collection (English & Ukrainian news from 2023)
|
16 |
+
* **Tasks:**
|
17 |
+
* Text summarization (Ukrainian)
|
18 |
+
* Text generation (Ukrainian)
|
19 |
+
* Other Ukrainian-centric NLP tasks
|
20 |
+
|
21 |
+
## Technical Details
|
22 |
+
* **Model Size:** 300 MB
|
23 |
+
* **Framework:** Transformers (Hugging Face)
|
24 |
+
|
25 |
+
## Usage
|
26 |
+
|
27 |
+
**Installation**
|
28 |
+
|
29 |
+
```bash
|
30 |
+
pip install transformers
|
31 |
+
```
|
32 |
+
|
33 |
+
**Loading the Model**
|
34 |
+
|
35 |
+
```python
|
36 |
+
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
|
37 |
+
|
38 |
+
tokenizer = AutoTokenizer.from_pretrained("path/to/ukr-t5-small")
|
39 |
+
model = AutoModelForSeq2SeqLM.from_pretrained("path/to/ukr-t5-small")
|
40 |
+
```
|
41 |
+
|
42 |
+
**Example: Machine Translation**
|
43 |
+
|
44 |
+
```python
|
45 |
+
text = "(Text in Ukrainian here)"
|
46 |
+
|
47 |
+
# Tokenize and translate
|
48 |
+
inputs = tokenizer("summarize: " + text, return_tensors="pt", max_length=512, truncation=True)
|
49 |
+
summary_ids = model.generate(inputs["input_ids"], num_beams=4, max_length=128)
|
50 |
+
|
51 |
+
# Decode output
|
52 |
+
summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
|
53 |
+
print(summary)
|
54 |
+
```
|
55 |
+
|
56 |
+
## Limitations
|
57 |
+
|
58 |
+
* The model's focus is on Ukrainian text processing, so performance on purely English tasks may be below that of general T5-small models.
|
59 |
+
* Further fine-tuning may be required for optimal results on specific NLP tasks.
|
60 |
+
|
61 |
+
## Dataset Credits
|
62 |
+
|
63 |
+
This model was fine-tuned on the Leipzig Corpora Collection (specify if there's a particular subset within the collection that you used). For full licensing and usage information of the original dataset, please refer to [Leipzig Corpora Collection website](https://wortschatz.uni-leipzig.de/en/download)
|
64 |
+
|
65 |
+
## Ethical Considerations
|
66 |
+
|
67 |
+
* NLP models can reflect biases present in their training data. Be mindful of this when using this model for applications that have real-world impact.
|
68 |
+
* It's important to test this model thoroughly across a variety of Ukrainian language samples to evaluate its reliability and fairness.
|
69 |
+
|