|
--- |
|
license: mit |
|
datasets: |
|
- abisee/cnn_dailymail |
|
language: |
|
- en |
|
metrics: |
|
- rouge |
|
- bleu |
|
base_model: |
|
- google-t5/t5-small |
|
pipeline_tag: summarization |
|
library_name: transformers |
|
--- |
|
# Model Card for t5_small Summarization Model |
|
|
|
## Model Details |
|
|
|
- Model Architecture: T5 (Text-to-Text Transfer Transformer) |
|
- Variant: t5-small |
|
- Task: Text Summarization |
|
- Framework: Hugging Face Transformers |
|
|
|
## Training Data |
|
|
|
- Dataset: CNN/DailyMail |
|
- Content: News articles and their summaries |
|
- Size: Approximately 300,000 article-summary pairs |
|
|
|
## Training Procedure |
|
|
|
- Fine-tuning method: Using Hugging Face Transformers library |
|
- Hyperparameters: |
|
- Learning rate: 5e-5 |
|
- Batch size: 8 |
|
- Number of epochs: 3 |
|
- Optimizer: AdamW |
|
|
|
## How to Use |
|
|
|
1. Install the Hugging Face Transformers library: |
|
``` |
|
pip install transformers |
|
``` |
|
|
|
2. Load the model: |
|
```python |
|
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM |
|
|
|
tokenizer = AutoTokenizer.from_pretrained("t5-small") |
|
model = AutoModelForSeq2SeqLM.from_pretrained("t5-small") |
|
``` |
|
|
|
3. Generate a summary: |
|
```python |
|
input_text = "Your input text here" |
|
inputs = tokenizer("summarize: " + input_text, return_tensors="pt", max_length=512, truncation=True) |
|
summary_ids = model.generate(inputs["input_ids"], max_length=150, min_length=40, length_penalty=2.0, num_beams=4, early_stopping=True) |
|
summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True) |
|
``` |
|
|
|
## Evaluation |
|
|
|
- Metric: ROUGE scores (Recall-Oriented Understudy for Gisting Evaluation) |
|
- Exact scores not available, but typically evaluated on: |
|
- ROUGE-1 (unigram overlap) |
|
- ROUGE-2 (bigram overlap) |
|
- ROUGE-L (longest common subsequence) |
|
|
|
## Limitations |
|
|
|
- Performance may be lower compared to larger T5 variants |
|
- Optimized for news article summarization, may not perform as well on other text types |
|
- Limited to input sequences of 512 tokens |
|
- Generated summaries may sometimes contain factual inaccuracies |
|
|
|
## Ethical Considerations |
|
|
|
- May inherit biases present in the CNN/DailyMail dataset |
|
- Not suitable for summarizing sensitive or critical information without human review |
|
- Users should be aware of potential biases and inaccuracies in generated summaries |
|
- Should not be used as a sole source of information for decision-making processes |