rmroczkowski commited on
Commit
2323f05
1 Parent(s): 0c90cb1

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +58 -0
README.md ADDED
@@ -0,0 +1,58 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: pl
3
+ tags:
4
+ - T5
5
+ - translation
6
+ - summarization
7
+ - question answering
8
+ - reading comprehension
9
+ datasets:
10
+ - ccnet
11
+ - nkjp
12
+ - wikipedia
13
+ - open subtitles
14
+ - free readings
15
+ license: cc-by-4.0
16
+ ---
17
+
18
+ # plT5 Small
19
+ **plT5** models are T5-based language models trained on Polish corpora. Models were optimized for the original T5 denoising target.
20
+
21
+ ## Corpus
22
+ plT5 was trained on six different corpora available for Polish language:
23
+
24
+ | Corpus | Tokens | Documents |
25
+ | :------ | ------: | ------: |
26
+ | [CCNet Middle](https://github.com/facebookresearch/cc_net) | 3243M | 7.9M |
27
+ | [CCNet Head](https://github.com/facebookresearch/cc_net) | 2641M | 7.0M |
28
+ | [National Corpus of Polish](http://nkjp.pl/index.php?page=14&lang=1)| 1357M | 3.9M |
29
+ | [Open Subtitles](http://opus.nlpl.eu/OpenSubtitles-v2018.php) | 1056M | 1.1M
30
+ | [Wikipedia](https://dumps.wikimedia.org/) | 260M | 1.4M |
31
+ | [Wolne Lektury](https://wolnelektury.pl/) | 41M | 5.5k |
32
+
33
+ ## Tokenizer
34
+ The training dataset was tokenized into subwords using a sentencepiece unigram with
35
+ vocabulary size of 50k tokens.
36
+
37
+ ## Usage
38
+ Example code:
39
+ ```python
40
+ from transformers import AutoTokenizer, AutoModel
41
+
42
+ tokenizer = AutoTokenizer.from_pretrained("allegro/plT5-small")
43
+ model = AutoModel.from_pretrained("allegro/plT5-small")
44
+ ```
45
+
46
+ ## License
47
+ CC BY 4.0
48
+
49
+ ## Citation
50
+ If you use this model, please cite the following paper:
51
+ ```
52
+
53
+ ```
54
+
55
+ ## Authors
56
+ The model was trained by [**Machine Learning Research Team at Allegro**](https://ml.allegro.tech/) and [**Linguistic Engineering Group at Institute of Computer Science, Polish Academy of Sciences**](http://zil.ipipan.waw.pl/).
57
+
58
+ You can contact us at: <a href="mailto:[email protected]">[email protected]</a>