metadata

library_name: transformers
tags:
  - catalan
  - natural-language-processing
  - text-generation
  - gpt2
license: mit

Model Card for CatGPT

CatGPT is a Catalan natural language model inspired by GPT-2. It is designed to generate coherent and contextually relevant text in Catalan. The model is intended primarily for educational and experimental purposes, providing a lightweight tool for exploring natural language processing in Catalan.

Model Details

Model Description

CatGPT follows the architecture of GPT-2 but is trained from scratch with a specific focus on the Catalan language. The model's smaller size makes it accessible and easy to deploy, though it does not aim for high-performance text generation. Its design choices ensure it can be used efficiently for training and inference within the Catalan language context.

Developed by: Roger Baiges
Model type: Causal Language Model (GPT-2 based)
Language(s): Catalan
License: MIT
Finetuned from model: Trained from scratch

Model Sources

Repository: GitHub - CatGPT
Demo: CatGPT Demo

Uses

Direct Use

CatGPT can be used as a text generator in Catalan. It's suitable for creating educational content, generating sample text, or experimenting with language modeling in Catalan.

Downstream Use

The model can be fine-tuned for specific tasks like text completion, dialogue systems, or creative writing in Catalan.

Out-of-Scope Use

This model is not suitable for tasks requiring high accuracy or dealing with complex language understanding, such as legal or medical text generation. It is also not recommended for use in generating content that requires a deep understanding of context or nuance.

Bias, Risks, and Limitations

Biases

As with most language models, CatGPT may reflect biases present in the training data. Given the training datasets are primarily web-scraped data, the model might inadvertently generate biased or inappropriate content.

Limitations

The model's small size limits its ability to generate high-quality text.
It may not perform well in generating text that requires nuanced understanding or in contexts outside the training data.
The model may struggle with certain dialects or less common expressions in Catalan.

Recommendations

Users should monitor outputs for bias and inappropriate content. Fine-tuning with carefully curated data can help mitigate some biases.

How to Get Started with the Model

To use CatGPT, you can load the model and tokenizer as follows:

from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("baiges/CatGPT")
model = AutoModelForCausalLM.from_pretrained("baiges/CatGPT")

input_text = "La intel·ligència artificial"
inputs = tokenizer(input_text, return_tensors="pt")
outputs = model.generate(inputs.input_ids, max_length=100, num_return_sequences=1)

generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(generated_text)