Model generates wrong output with example text

#5
by marcvw-sightengine - opened

Hello there,

The model generates wrong outputs with the example text. Maybe I missed something...

from transformers import pipeline
pipe = pipeline("translation", model="Helsinki-NLP/opus-mt-tc-big-fr-en")
print(pipe("J'ai adoré l'Angleterre."))

# expected output: I loved England.

Prints:

[{'translation_text': "''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''"}]

Different texts can generate gibberish. The same behaviour happens when using the other example code (using MarianMTModel and MarianTokenizer).

The same code with a smaller model ("Helsinki-NLP/opus-mt-fr-en") works fine.

Hi, I was able to get a proper translation the following way:

from transformers import MarianMTModel, MarianTokenizer

model_path = "/path/to/model/opus-mt-tc-big-fr-en"
tokenizer = MarianTokenizer.from_pretrained(model_path)
model = MarianMTModel.from_pretrained(model_path).to("cuda:0")

src_text = [ "J'ai adoré l'Angleterre." ]
translated = model.generate(**tokenizer(src_text, return_tensors="pt", padding=True).to("cuda:0"),
                            max_length=16384, num_beams=1)

for t in translated:
    print( tokenizer.decode(t, skip_special_tokens=True) )

Then the response was:

I loved England.

Sign up or log in to comment