WikiText-WordLevel

This is a simple word-level tokenizer created using the Tokenizers library. It was trained for educational purposes on the combined train, validation, and test splits of the WikiText-103 corpus.

  • Tokenizer Type: Word-Level
  • Vocabulary Size: 75K
  • Special Tokens: <s> (start of sequence), </s> (end of sequence), <unk> (unknown token)
  • Normalization: NFC (Normalization Form Canonical Composition), Strip, Lowercase
  • Pre-tokenization: Whitespace
  • Code: wikitext-wordlevel.py

The tokenizer can be used as simple as follows.

tokenizer = Tokenizer.from_pretrained('dustalov/wikitext-wordlevel')

tokenizer.encode("I'll see you soon").ids  # => [68, 14, 2746, 577, 184, 595]

tokenizer.encode("I'll see you soon").tokens  # => ['i', "'", 'll', 'see', 'you', 'soon']

tokenizer.decode([68, 14, 2746, 577, 184, 595])  # => "i ' ll see you soon"
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference API
Inference API (serverless) has been turned off for this model.

Dataset used to train dustalov/wikitext-wordlevel