ctoraman commited on
Commit
7cb238c
1 Parent(s): eb43009

Upload README.md

Browse files
Files changed (1) hide show
  1. README.md +37 -0
README.md CHANGED
@@ -1,3 +1,40 @@
1
  ---
 
 
 
 
2
  license: cc-by-nc-sa-4.0
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ language:
3
+ - tr
4
+ tags:
5
+ - roberta
6
  license: cc-by-nc-sa-4.0
7
  ---
8
+
9
+ # RoBERTweetTurkCovid (uncased)
10
+
11
+ Pretrained model on Turkish language using a masked language modeling (MLM) objective. The model is uncased.
12
+ The pretrained corpus is a Turkish tweets collection related to COVID-19. The details of the data can be found at this paper:
13
+ https://arxiv.org/...
14
+
15
+ Model architecture is similar to RoBERTa-base (12 layers, 12 heads, and 768 hidden size). Tokenization algorithm is WordPiece. Vocabulary size is 30k.
16
+
17
+ The details of pretraining can be found at this paper:
18
+ https://arxiv.org/...
19
+
20
+ The following code can be used for model loading and tokenization, example max length (768) can be changed:
21
+ ```
22
+ model = AutoModel.from_pretrained([model_path])
23
+ #for sequence classification:
24
+ #model = AutoModelForSequenceClassification.from_pretrained([model_path], num_labels=[num_classes])
25
+
26
+ tokenizer = PreTrainedTokenizerFast(tokenizer_file=[file_path])
27
+ tokenizer.mask_token = "[MASK]"
28
+ tokenizer.cls_token = "[CLS]"
29
+ tokenizer.sep_token = "[SEP]"
30
+ tokenizer.pad_token = "[PAD]"
31
+ tokenizer.unk_token = "[UNK]"
32
+ tokenizer.bos_token = "[CLS]"
33
+ tokenizer.eos_token = "[SEP]"
34
+ tokenizer.model_max_length = 768
35
+ ```
36
+
37
+ ### BibTeX entry and citation info
38
+ ```bibtex
39
+ @article{}
40
+ ```