Update README.md
Browse files
README.md
CHANGED
@@ -14,6 +14,34 @@ widget:
|
|
14 |
|
15 |
This model is [t5-base](https://huggingface.co/t5-base) fine-tuned on the [190k Medium Articles](https://www.kaggle.com/datasets/fabiochiusano/medium-articles) dataset for predicting article tags using the article textual content as input.
|
16 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
17 |
## Data cleaning
|
18 |
|
19 |
The dataset is composed of Medium articles and their tags. However, each Medium article can have at most five tags, therefore the author needs to choose what he/she believes are the best tags (mainly for SEO-related purposes). This means that an article with the "Python" tag may have not the "Programming Languages" tag, even though the first implies the latter.
|
|
|
14 |
|
15 |
This model is [t5-base](https://huggingface.co/t5-base) fine-tuned on the [190k Medium Articles](https://www.kaggle.com/datasets/fabiochiusano/medium-articles) dataset for predicting article tags using the article textual content as input.
|
16 |
|
17 |
+
# How to use the model
|
18 |
+
```
|
19 |
+
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
|
20 |
+
import nltk
|
21 |
+
nltk.download('punkt')
|
22 |
+
|
23 |
+
tokenizer = AutoTokenizer.from_pretrained("fabiochiu/t5-base-tag-generation")
|
24 |
+
model = AutoModelForSeq2SeqLM.from_pretrained("fabiochiu/t5-base-tag-generation")
|
25 |
+
|
26 |
+
text = """
|
27 |
+
Python is a high-level, interpreted, general-purpose programming language. Its
|
28 |
+
design philosophy emphasizes code readability with the use of significant
|
29 |
+
indentation. Python is dynamically-typed and garbage-collected.
|
30 |
+
"""
|
31 |
+
|
32 |
+
inputs = tokenizer([text], max_length=512, truncation=True, return_tensors="pt")
|
33 |
+
output = model.generate(**inputs, num_beams=8, do_sample=True, min_length=10,
|
34 |
+
max_length=64)
|
35 |
+
decoded_output = tokenizer.batch_decode(output, skip_special_tokens=True)[0]
|
36 |
+
tags = list(set(decoded_output.strip().split(", ")))
|
37 |
+
|
38 |
+
print(tags)
|
39 |
+
# ['Programming', 'Code', 'Software Development', 'Programming Languages',
|
40 |
+
# 'Software', 'Developer', 'Python', 'Software Engineering', 'Science',
|
41 |
+
# 'Engineering', 'Technology', 'Computer Science', 'Coding', 'Digital', 'Tech',
|
42 |
+
# 'Python Programming']
|
43 |
+
```
|
44 |
+
|
45 |
## Data cleaning
|
46 |
|
47 |
The dataset is composed of Medium articles and their tags. However, each Medium article can have at most five tags, therefore the author needs to choose what he/she believes are the best tags (mainly for SEO-related purposes). This means that an article with the "Python" tag may have not the "Programming Languages" tag, even though the first implies the latter.
|