|
--- |
|
license: cc-by-4.0 |
|
language: |
|
- pl |
|
library_name: transformers |
|
tags: |
|
- tokenizer |
|
- fast-tokenizer |
|
- polish |
|
datasets: |
|
- radlab/legal-mc4-pl |
|
- radlab/wikipedia-pl |
|
- radlab/kgr10 |
|
- clarin-knext/msmarco-pl |
|
- clarin-knext/fiqa-pl |
|
- clarin-knext/scifact-pl |
|
- clarin-knext/nfcorpus-pl |
|
--- |
|
|
|
This is polish fast tokenizer. |
|
|
|
Number of documents used to train tokenizer: |
|
- 25 088 398 |
|
|
|
|
|
Sample usge with transformers: |
|
|
|
```python |
|
from transformers import AutoTokenizer |
|
|
|
tokenizer = AutoTokenizer.from_pretrained('radlab/polish-fast-tokenizer') |
|
tokenizer.decode(tokenizer("Ala ma kota i psa").input_ids) |
|
|
|
``` |