Tokenizers documentation

Decoders

You are viewing v0.13.4.rc2 version. A newer version v0.20.3 is available.
Hugging Face's logo
Join the Hugging Face community

and get access to the augmented documentation experience

to get started

Decoders

Python
Rust
Node

BPEDecoder

class tokenizers.decoders.BPEDecoder

( suffix = '</w>' )

Parameters

  • suffix (str, optional, defaults to </w>) — The suffix that was used to caracterize an end-of-word. This suffix will be replaced by whitespaces during the decoding

BPEDecoder Decoder

ByteLevel

class tokenizers.decoders.ByteLevel

( )

ByteLevel Decoder

This decoder is to be used in tandem with the ByteLevel PreTokenizer.

CTC

class tokenizers.decoders.CTC

( pad_token = '<pad>' word_delimiter_token = '|' cleanup = True )

Parameters

  • pad_token (str, optional, defaults to <pad>) — The pad token used by CTC to delimit a new token.
  • word_delimiter_token (str, optional, defaults to |) — The word delimiter token. It will be replaced by a
  • cleanup (bool, optional, defaults to True) — Whether to cleanup some tokenization artifacts. Mainly spaces before punctuation, and some abbreviated english forms.

CTC Decoder

Metaspace

class tokenizers.decoders.Metaspace

( )

Parameters

  • replacement (str, optional, defaults to ) — The replacement character. Must be exactly one character. By default we use the (U+2581) meta symbol (Same as in SentencePiece).
  • add_prefix_space (bool, optional, defaults to True) — Whether to add a space to the first word if there isn’t already one. This lets us treat hello exactly like say hello.

Metaspace Decoder

WordPiece

class tokenizers.decoders.WordPiece

( prefix = '##' cleanup = True )

Parameters

  • prefix (str, optional, defaults to ##) — The prefix to use for subwords that are not a beginning-of-word
  • cleanup (bool, optional, defaults to True) — Whether to cleanup some tokenization artifacts. Mainly spaces before punctuation, and some abbreviated english forms.

WordPiece Decoder