SkitCon
/

gec-spanish-BARTO-SYNTHETIC

Text2Text Generation

Inference Endpoints

Model card Files Files and versions Community

gec-spanish-BARTO-SYNTHETIC / README.md

SkitCon's picture

Update README.md

e448d64 verified 3 months ago

|

history blame contribute delete

1.58 kB

	---
	language:
	- es
	metrics:
	- bleu
	base_model:
	- vgaraujov/bart-base-spanish
	pipeline_tag: text2text-generation
	library_name: transformers
	tags:
	- gec
	- spanish
	- seq2seq
	- bart
	- cows-l2h
	---

	This model has been trained on 80% of the COWS-L2H dataset and 80,984 SYNTHETICALLY-GENERATED errorful sentences for grammatical error correction of Spanish text. The corpus was sentencized, so the model has been fine-tuned for SENTENCE CORRECTION. This model will likely not perform well on an entire paragraph. To correct a paragraph, sentencize the text and run the model for each sentence.

	The synthetic data was generated based on a rule-based algorithm from well-formed Spanish sentences. The code for synthetic generaton is available in the Github repo for this project: https://github.com/SkitCon/synth_gec_es

	BLEU: 0.851 on COWS-L2H

	Example usage:

	```python
	from transformers import AutoTokenizer, BartForConditionalGeneration

	tokenizer = AutoTokenizer.from_pretrained("SkitCon/gec-spanish-BARTO-SYNTHETIC")
	model = BartForConditionalGeneration.from_pretrained("SkitCon/gec-spanish-BARTO-SYNTHETIC")

	input_sentences = ["Yo va al tienda.", "Espero que tú ganas."]

	tokenized_text = tokenizer(input_sentences, max_length=128, padding="max_length", truncation=True, return_tensors="pt")

	input_ids = tokenized_text["input_ids"].squeeze()
	attention_mask = tokenized_text["attention_mask"].squeeze()

	outputs = model.generate(input_ids=input_ids, attention_mask=attention_mask)

	for sentence in tokenizer.batch_decode(outputs, skip_special_tokens=True):
	print(sentence)
	```