Proper capitalization and punctuation in the input results in erroneous output.

#4
by drmeir - opened

I find that the model works well when the text is in lower case and totally lacks punctuation. However, when some words are properly capitalized in the input and some punctuation appears, the output is not what one would expect. Example:

Input:
Washington is in the U.S. Moscow is in Russia.

Output:
<unk>ashington is in the <unk>.<Unk>. <unk>oscow is in <unk>ussia,..

drmeir changed discussion title from Proper capitalization is cancelled in the output. to Proper capitalization and punctuation in the input results in erroneous output.

Upper-cased letters will be OOV (out of vocabulary) as the model and tokenizer were trained with only lower-cased text. OOVs are passed through to the output, as seen here.
Even if a fix is applied to map unknown tokens to the output, the model will perform very poorly given that it has never seen such sentences.
Similar for punctuation in the input... the model was trained on unpunctuated texts, and will be confused when the input has punctuation. Even though it's covered by the tokenizer and technically not OOV, the language model has not been trained to handle it.
The model was effectively trained to handle ASR output, so I'd strongly recommend not trying to use it on partially upper-cased or punctuated text.

But what do we do with things like "24/7", "R&D", "9-11" etc. in the input text? There are potentially a lot of such things and it is hard to catch all of them in the preprocessing. Is it possible to output OOV verbatim as they appear in the input?

Sign up or log in to comment