--- library_name: transformers license: mit datasets: - kimleang123/khmer-text-dataset language: - km base_model: - google/mt5-small pipeline_tag: summarization --- # Khmer mT5 Summarization Model ## πŸ“Œ Introduction This repository contains a **fine-tuned mT5 model for Khmer text summarization**. The model is based on Google's [mT5-small](https://huggingface.co./google/mt5-small) and fine-tuned on a dataset of Khmer text and corresponding summaries. Fine-tuning was performed using the Hugging Face `Trainer` API, optimizing the model to **generate concise and meaningful summaries of Khmer text**. --- ## πŸš€ Model Details - **Base Model:** `google/mt5-small` - **Fine-tuned for:** Khmer text summarization - **Training Dataset:** `kimleang123/khmer-text-dataset` - **Framework:** Hugging Face `transformers` - **Task Type:** Sequence-to-Sequence (Seq2Seq) - **Input:** Khmer text (articles, paragraphs, or documents) - **Output:** Summarized Khmer text - **Training Hardware:** GPU (Tesla T4) - **Evaluation Metric:** ROUGE Score --- ## πŸ”§ Installation & Setup ### 1️⃣ Install Dependencies Ensure you have `transformers`, `torch`, and `datasets` installed: ```bash pip install transformers torch datasets ``` ### 2️⃣ Load the Model To load and use the fine-tuned model: ```python from transformers import AutoModelForSeq2SeqLM, AutoTokenizer model_name = "songhieng/khmer-mt5-summarization" tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForSeq2SeqLM.from_pretrained(model_name) ``` --- ## πŸ“Œ How to Use ### 1️⃣ Using Python Code ```python def summarize_khmer(text, max_length=150): input_text = f"summarize: {text}" inputs = tokenizer(input_text, return_tensors="pt", truncation=True, max_length=512) summary_ids = model.generate(**inputs, max_length=max_length, num_beams=5, length_penalty=2.0, early_stopping=True) summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True) return summary khmer_text = "αž€αž˜αŸ’αž–αž»αž‡αžΆαž˜αžΆαž“αž”αŸ’αžšαž‡αžΆαž‡αž“αž”αŸ’αžšαž˜αžΆαžŽ ៑៦ αž›αžΆαž“αž“αžΆαž€αŸ‹ αž αžΎαž™αžœαžΆαž‚αžΊαž‡αžΆαž”αŸ’αžšαž‘αŸαžŸαž“αŸ…αžαŸ†αž”αž“αŸ‹αž’αžΆαžŸαŸŠαžΈαž’αžΆαž‚αŸ’αž“αŸαž™αŸαŸ”" summary = summarize_khmer(khmer_text) print("πŸ”Ή Khmer Summary:", summary) ``` ### 2️⃣ Using Hugging Face Pipeline For a simpler approach: ```python from transformers import pipeline summarizer = pipeline("summarization", model="songhieng/khmer-mt5-summarization") khmer_text = "αž€αž˜αŸ’αž–αž»αž‡αžΆαž˜αžΆαž“αž”αŸ’αžšαž‡αžΆαž‡αž“αž”αŸ’αžšαž˜αžΆαžŽ ៑៦ αž›αžΆαž“αž“αžΆαž€αŸ‹ αž αžΎαž™αžœαžΆαž‚αžΊαž‡αžΆαž”αŸ’αžšαž‘αŸαžŸαž“αŸ…αžαŸ†αž”αž“αŸ‹αž’αžΆαžŸαŸŠαžΈαž’αžΆαž‚αŸ’αž“αŸαž™αŸαŸ”" summary = summarizer(khmer_text, max_length=150, min_length=30, do_sample=False) print("πŸ”Ή Khmer Summary:", summary[0]['summary_text']) ``` ### 3️⃣ Deploy as an API using FastAPI You can create a simple API for summarization: ```python from fastapi import FastAPI app = FastAPI() @app.post("/summarize/") def summarize(text: str): inputs = tokenizer(f"summarize: {text}", return_tensors="pt", truncation=True, max_length=512) summary_ids = model.generate(**inputs, max_length=150, num_beams=5, length_penalty=2.0, early_stopping=True) summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True) return {"summary": summary} # Run with: uvicorn filename:app --reload ``` --- ## πŸ“Š Model Evaluation The model was evaluated using **ROUGE scores**, which measure how similar the generated summaries are to the ground truth summaries. ```python from datasets import load_metric rouge = load_metric("rouge") def compute_metrics(pred): labels_ids = pred.label_ids pred_ids = pred.predictions decoded_preds = tokenizer.batch_decode(pred_ids, skip_special_tokens=True) decoded_labels = tokenizer.batch_decode(labels_ids, skip_special_tokens=True) return rouge.compute(predictions=decoded_preds, references=decoded_labels) trainer.evaluate() ``` --- ## πŸ’Ύ Saving & Uploading the Model After fine-tuning, the model was uploaded to Hugging Face Hub: ```python model.push_to_hub("songhieng/khmer-mt5-summarization") tokenizer.push_to_hub("songhieng/khmer-mt5-summarization") ``` To download it later: ```python model = AutoModelForSeq2SeqLM.from_pretrained("songhieng/khmer-mt5-summarization") tokenizer = AutoTokenizer.from_pretrained("songhieng/khmer-mt5-summarization") ``` --- ## 🎯 Summary | **Feature** | **Details** | |------------|------------| | **Base Model** | `google/mt5-small` | | **Task** | Summarization | | **Language** | Khmer (αžαŸ’αž˜αŸ‚αžš) | | **Dataset** | `kimleang123/khmer-text-dataset` | | **Framework** | Hugging Face Transformers | | **Evaluation Metric** | ROUGE Score | | **Deployment** | Hugging Face Model Hub, API (FastAPI), Python Code | --- ## 🀝 Contributing Contributions are welcome! Feel free to **open issues or submit pull requests** if you find any improvements. ### πŸ“¬ Contact If you have any questions, feel free to reach out via [Hugging Face Discussions](https://huggingface.co./) or create an issue in the repository. πŸ“Œ **Built for Khmer NLP Community** πŸ‡°πŸ‡­ πŸš€