Finetuned version of hmByT5 on DE1, DE2, DE3 and DE7 parts of the IDCAR2019-POCR dataset to correct OCR mistakes. The max_length was set to 350.
Performance
SacreBLEU eval dataset: 10.83
SacreBLEU eval model: 72.35
Usage
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
example_sentence = "Anvpreiſungq. Haupidepot für Wien: In der Stadt, obere Bräunerſtraße Nr. 1137 in der Varfüͤmerie-Handlung zur"
tokenizer = AutoTokenizer.from_pretrained("Var3n/hmByT5_anno")
model = AutoModelForSeq2SeqLM.from_pretrained("Var3n/hmByT5_anno")
input = tokenizer(example_sentence, return_tensors="pt").input_ids
output = model.generate(input, max_new_tokens=len(input[0]), num_beams=4, do_sample=True)
text = tokenizer.decode(output[0], skip_special_tokens=True)
- Downloads last month
- 25
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social
visibility and check back later, or deploy to Inference Endpoints (dedicated)
instead.