Clarifying bos_token

#27
by taohoang - opened

Hi,

In the definition of DataCollatorSpeechSeq2SeqWithPadding in https://huggingface.co./blog/fine-tune-whisper, I am trying to understand the following part:


# if bos token is appended in previous tokenization step,
# cut bos token here as it's append later anyways
if (labels[:, 0] == self.processor.tokenizer.bos_token_id).all().cpu().item():
       labels = labels[:, 1:]

Where will bos token be appended later in training?

After loading the tokenizer, it seems bos_token is <|endoftext|> instead of <|startoftranscript|>:


tokenizer = WhisperTokenizer.from_pretrained("openai/whisper-small", language="Hindi", task="transcribe")

Will this affect the checking for bos_token above?

Sign up or log in to comment