WikiText-WordLevel

This is a simple word-level tokenizer created using the Tokenizers library. It was trained for educational purposes on the combined train, validation, and test splits of the WikiText-103 corpus.

  • Tokenizer Type: Word-Level
  • Vocabulary Size: 75K
  • Special Tokens: <s> (start of sequence), </s> (end of sequence), <unk> (unknown token)
  • Normalization: NFC (Normalization Form Canonical Composition), Strip, Lowercase
  • Pre-tokenization: Whitespace
  • Code: wikitext-wordlevel.py

The tokenizer can be used as simple as follows.

tokenizer = Tokenizer.from_pretrained('dustalov/wikitext-wordlevel')

tokenizer.encode("I'll see you soon").ids  # => [68, 14, 2746, 577, 184, 595]

tokenizer.encode("I'll see you soon").tokens  # => ['i', "'", 'll', 'see', 'you', 'soon']

tokenizer.decode([68, 14, 2746, 577, 184, 595])  # => "i ' ll see you soon"
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model authors have turned it off explicitly.

Dataset used to train dustalov/wikitext-wordlevel