---
library_name: transformers
license: mit
datasets:
- kimleang123/khmer-text-dataset
language:
- km
base_model:
- google/mt5-small
pipeline_tag: summarization
---
# Khmer mT5 Summarization Model

## 📌 Introduction
This repository contains a **fine-tuned mT5 model for Khmer text summarization**. The model is based on Google's [mT5-small](https://huggingface.co./google/mt5-small) and fine-tuned on a dataset of Khmer text and corresponding summaries.

Fine-tuning was performed using the Hugging Face `Trainer` API, optimizing the model to **generate concise and meaningful summaries of Khmer text**.

---

## 🚀 Model Details
- **Base Model:** `google/mt5-small`
- **Fine-tuned for:** Khmer text summarization
- **Training Dataset:** `kimleang123/khmer-text-dataset`
- **Framework:** Hugging Face `transformers`
- **Task Type:** Sequence-to-Sequence (Seq2Seq)
- **Input:** Khmer text (articles, paragraphs, or documents)
- **Output:** Summarized Khmer text
- **Training Hardware:** GPU (Tesla T4)
- **Evaluation Metric:** ROUGE Score

---

## 🔧 Installation & Setup
### 1️⃣ Install Dependencies
Ensure you have `transformers`, `torch`, and `datasets` installed:
```bash
pip install transformers torch datasets
```

### 2️⃣ Load the Model
To load and use the fine-tuned model:
```python
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

model_name = "songhieng/khmer-mt5-summarization"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
```

---

## 📌 How to Use
### 1️⃣ Using Python Code
```python
def summarize_khmer(text, max_length=150):
    input_text = f"summarize: {text}"
    inputs = tokenizer(input_text, return_tensors="pt", truncation=True, max_length=512)
    summary_ids = model.generate(**inputs, max_length=max_length, num_beams=5, length_penalty=2.0, early_stopping=True)
    summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
    return summary

khmer_text = "កម្ពុជាមានប្រជាជនប្រមាណ ១៦ លាននាក់ ហើយវាគឺជាប្រទេសនៅតំបន់អាស៊ីអាគ្នេយ៍។"
summary = summarize_khmer(khmer_text)
print("🔹 Khmer Summary:", summary)
```

### 2️⃣ Using Hugging Face Pipeline
For a simpler approach:
```python
from transformers import pipeline

summarizer = pipeline("summarization", model="songhieng/khmer-mt5-summarization")
khmer_text = "កម្ពុជាមានប្រជាជនប្រមាណ ១៦ លាននាក់ ហើយវាគឺជាប្រទេសនៅតំបន់អាស៊ីអាគ្នេយ៍។"
summary = summarizer(khmer_text, max_length=150, min_length=30, do_sample=False)
print("🔹 Khmer Summary:", summary[0]['summary_text'])
```

### 3️⃣ Deploy as an API using FastAPI
You can create a simple API for summarization:
```python
from fastapi import FastAPI

app = FastAPI()

@app.post("/summarize/")
def summarize(text: str):
    inputs = tokenizer(f"summarize: {text}", return_tensors="pt", truncation=True, max_length=512)
    summary_ids = model.generate(**inputs, max_length=150, num_beams=5, length_penalty=2.0, early_stopping=True)
    summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
    return {"summary": summary}

# Run with: uvicorn filename:app --reload
```

---

## 📊 Model Evaluation
The model was evaluated using **ROUGE scores**, which measure how similar the generated summaries are to the ground truth summaries.

```python
from datasets import load_metric

rouge = load_metric("rouge")

def compute_metrics(pred):
    labels_ids = pred.label_ids
    pred_ids = pred.predictions
    decoded_preds = tokenizer.batch_decode(pred_ids, skip_special_tokens=True)
    decoded_labels = tokenizer.batch_decode(labels_ids, skip_special_tokens=True)
    return rouge.compute(predictions=decoded_preds, references=decoded_labels)

trainer.evaluate()
```

---

## 💾 Saving & Uploading the Model
After fine-tuning, the model was uploaded to Hugging Face Hub:
```python
model.push_to_hub("songhieng/khmer-mt5-summarization")
tokenizer.push_to_hub("songhieng/khmer-mt5-summarization")
```
To download it later:
```python
model = AutoModelForSeq2SeqLM.from_pretrained("songhieng/khmer-mt5-summarization")
tokenizer = AutoTokenizer.from_pretrained("songhieng/khmer-mt5-summarization")
```

---

## 🎯 Summary
| **Feature** | **Details** |
|------------|------------|
| **Base Model** | `google/mt5-small` |
| **Task** | Summarization |
| **Language** | Khmer (ខ្មែរ) |
| **Dataset** | `kimleang123/khmer-text-dataset` |
| **Framework** | Hugging Face Transformers |
| **Evaluation Metric** | ROUGE Score |
| **Deployment** | Hugging Face Model Hub, API (FastAPI), Python Code |

---

## 🤝 Contributing
Contributions are welcome! Feel free to **open issues or submit pull requests** if you find any improvements.

### 📬 Contact
If you have any questions, feel free to reach out via [Hugging Face Discussions](https://huggingface.co./) or create an issue in the repository.

📌 **Built for Khmer NLP Community** 🇰🇭 🚀