fabiochiu commited on
Commit
a28a1f2
1 Parent(s): c5f2a16

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +28 -0
README.md CHANGED
@@ -14,6 +14,34 @@ widget:
14
 
15
  This model is [t5-base](https://huggingface.co/t5-base) fine-tuned on the [190k Medium Articles](https://www.kaggle.com/datasets/fabiochiusano/medium-articles) dataset for predicting article tags using the article textual content as input.
16
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
17
  ## Data cleaning
18
 
19
  The dataset is composed of Medium articles and their tags. However, each Medium article can have at most five tags, therefore the author needs to choose what he/she believes are the best tags (mainly for SEO-related purposes). This means that an article with the "Python" tag may have not the "Programming Languages" tag, even though the first implies the latter.
 
14
 
15
  This model is [t5-base](https://huggingface.co/t5-base) fine-tuned on the [190k Medium Articles](https://www.kaggle.com/datasets/fabiochiusano/medium-articles) dataset for predicting article tags using the article textual content as input.
16
 
17
+ # How to use the model
18
+ ```
19
+ from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
20
+ import nltk
21
+ nltk.download('punkt')
22
+
23
+ tokenizer = AutoTokenizer.from_pretrained("fabiochiu/t5-base-tag-generation")
24
+ model = AutoModelForSeq2SeqLM.from_pretrained("fabiochiu/t5-base-tag-generation")
25
+
26
+ text = """
27
+ Python is a high-level, interpreted, general-purpose programming language. Its
28
+ design philosophy emphasizes code readability with the use of significant
29
+ indentation. Python is dynamically-typed and garbage-collected.
30
+ """
31
+
32
+ inputs = tokenizer([text], max_length=512, truncation=True, return_tensors="pt")
33
+ output = model.generate(**inputs, num_beams=8, do_sample=True, min_length=10,
34
+ max_length=64)
35
+ decoded_output = tokenizer.batch_decode(output, skip_special_tokens=True)[0]
36
+ tags = list(set(decoded_output.strip().split(", ")))
37
+
38
+ print(tags)
39
+ # ['Programming', 'Code', 'Software Development', 'Programming Languages',
40
+ # 'Software', 'Developer', 'Python', 'Software Engineering', 'Science',
41
+ # 'Engineering', 'Technology', 'Computer Science', 'Coding', 'Digital', 'Tech',
42
+ # 'Python Programming']
43
+ ```
44
+
45
  ## Data cleaning
46
 
47
  The dataset is composed of Medium articles and their tags. However, each Medium article can have at most five tags, therefore the author needs to choose what he/she believes are the best tags (mainly for SEO-related purposes). This means that an article with the "Python" tag may have not the "Programming Languages" tag, even though the first implies the latter.