library_name: transformers
tags:
- catalan
- natural-language-processing
- text-generation
- gpt2
license: mit
Model Card for CatGPT
CatGPT is a Catalan natural language model inspired by GPT-2. It is designed to generate coherent and contextually relevant text in Catalan. The model is intended primarily for educational and experimental purposes, providing a lightweight tool for exploring natural language processing in Catalan.
Model Details
Model Description
CatGPT follows the architecture of GPT-2 but is trained from scratch with a specific focus on the Catalan language. The model's smaller size makes it accessible and easy to deploy, though it does not aim for high-performance text generation. Its design choices ensure it can be used efficiently for training and inference within the Catalan language context.
- Developed by: Roger Baiges
- Model type: Causal Language Model (GPT-2 based)
- Language(s): Catalan
- License: MIT
- Finetuned from model: Trained from scratch
Model Sources
- Repository: GitHub - CatGPT
- Demo: CatGPT Demo
Uses
Direct Use
CatGPT can be used as a text generator in Catalan. It's suitable for creating educational content, generating sample text, or experimenting with language modeling in Catalan.
Downstream Use
The model can be fine-tuned for specific tasks like text completion, dialogue systems, or creative writing in Catalan.
Out-of-Scope Use
This model is not suitable for tasks requiring high accuracy or dealing with complex language understanding, such as legal or medical text generation. It is also not recommended for use in generating content that requires a deep understanding of context or nuance.
Bias, Risks, and Limitations
Biases
As with most language models, CatGPT may reflect biases present in the training data. Given the training datasets are primarily web-scraped data, the model might inadvertently generate biased or inappropriate content.
Limitations
- The model's small size limits its ability to generate high-quality text.
- It may not perform well in generating text that requires nuanced understanding or in contexts outside the training data.
- The model may struggle with certain dialects or less common expressions in Catalan.
Recommendations
Users should monitor outputs for bias and inappropriate content. Fine-tuning with carefully curated data can help mitigate some biases.
How to Get Started with the Model
To use CatGPT, you can load the model and tokenizer as follows:
from transformers import AutoModelForCausalLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("baiges/CatGPT")
model = AutoModelForCausalLM.from_pretrained("baiges/CatGPT")
input_text = "La intel·ligència artificial"
inputs = tokenizer(input_text, return_tensors="pt")
outputs = model.generate(inputs.input_ids, max_length=100, num_return_sequences=1)
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(generated_text)