metadata
language: pl
tags:
- T5
- translation
- summarization
- question answering
- reading comprehension
datasets:
- ccnet
- nkjp
- wikipedia
- open subtitles
- free readings
license: cc-by-4.0
plT5 Base
plT5 models are T5-based language models trained on Polish corpora. Models were optimized for the original T5 denoising target.
Corpus
plT5 was trained on six different corpora available for Polish language:
Corpus | Tokens | Documents |
---|---|---|
CCNet Middle | 3243M | 7.9M |
CCNet Head | 2641M | 7.0M |
National Corpus of Polish | 1357M | 3.9M |
Open Subtitles | 1056M | 1.1M |
Wikipedia | 260M | 1.4M |
Wolne Lektury | 41M | 5.5k |
Tokenizer
The training dataset was tokenized into subwords using a sentencepiece unigram with vocabulary size of 50k tokens.
Usage
Example code:
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("allegro/plT5-base")
model = AutoModel.from_pretrained("allegro/plT5-base")
License
CC BY 4.0
Citation
If you use this model, please cite the following paper:
Authors
The model was trained by Machine Learning Research Team at Allegro and Linguistic Engineering Group at Institute of Computer Science, Polish Academy of Sciences.
You can contact us at: [email protected]