|
--- |
|
license: mit |
|
language: |
|
- ko |
|
- vi |
|
metrics: |
|
- bleu |
|
base_model: |
|
- facebook/mbart-large-50-many-to-many-mmt |
|
pipeline_tag: translation |
|
library_name: transformers |
|
tags: |
|
- mbart |
|
- mbart-50 |
|
- text2text-generation |
|
--- |
|
|
|
|
|
|
|
|
|
|
|
# Model Card for mbart-large-50-mmt-ko-vi |
|
|
|
This model is fine-tuned from mBART-large-50 using multilingual translation data of Korean legal documents for Korean-to-Vietnamese translation tasks. |
|
|
|
--- |
|
|
|
## Table of Contents |
|
|
|
- [Model Card for mbart-large-50-mmt-ko-vi](#model-card-for-mbart-large-50-mmt-ko-vi) |
|
- [Table of Contents](#table-of-contents) |
|
- [Model Details](#model-details) |
|
- [Model Description](#model-description) |
|
- [Uses](#uses) |
|
- [Direct Use](#direct-use) |
|
- [Out-of-Scope Use](#out-of-scope-use) |
|
- [Bias, Risks, and Limitations](#bias-risks-and-limitations) |
|
- [Training Details](#training-details) |
|
- [Training Data](#training-data) |
|
- [Training Procedure](#training-procedure) |
|
- [Preprocessing](#preprocessing) |
|
- [Speeds, Sizes, Times](#speeds-sizes-times) |
|
- [Evaluation](#evaluation) |
|
- [Testing Data](#testing-data) |
|
- [Metrics](#metrics) |
|
- [Results](#results) |
|
- [Environmental Impact](#environmental-impact) |
|
- [Technical Specifications](#technical-specifications) |
|
- [Citation](#citation) |
|
- [Model Card Contact](#model-card-contact) |
|
|
|
--- |
|
|
|
## Model Details |
|
|
|
### Model Description |
|
|
|
- **Developed by:** Jaeyoon Myoung, Heewon Kwak |
|
- **Shared by:** ofu |
|
- **Model type:** Language model (Translation) |
|
- **Language(s) (NLP):** Korean, Vietnamese |
|
- **License:** Apache 2.0 |
|
- **Parent Model:** facebook/mbart-large-50-many-to-many-mmt |
|
|
|
--- |
|
|
|
## Uses |
|
|
|
### Direct Use |
|
|
|
This model is used for text translation from Korean to Vietnamese. |
|
|
|
### Out-of-Scope Use |
|
|
|
This model is not suitable for translation tasks involving languages other than Korean. |
|
|
|
--- |
|
|
|
## Bias, Risks, and Limitations |
|
|
|
The model may contain biases inherited from the training data and may produce inappropriate translations for sensitive topics. |
|
|
|
--- |
|
|
|
## Training Details |
|
|
|
### Training Data |
|
|
|
The model was trained using multilingual translation data of Korean legal documents provided by AI Hub. |
|
|
|
### Training Procedure |
|
|
|
#### Preprocessing |
|
|
|
- Removed unnecessary whitespace, special characters, and line breaks. |
|
|
|
### Speeds, Sizes, Times |
|
- **Training Time:** 1 hour 25 minutes (5,100 seconds) on Nvidia RTX 4090 |
|
- **Throughput:** ~3.51 samples/second |
|
- **Total Training Samples:** 17,922 |
|
- **Model Checkpoint Size:** Approximately 2.3GB |
|
- **Gradient Accumulation Steps:** 4 |
|
- **FP16 Mixed Precision Enabled:** Yes |
|
|
|
### Training hyperparameters |
|
|
|
The following hyperparameters were used during training: |
|
- **learning_rate**: `0.0001` |
|
- **train_batch_size**: `8` (per device) |
|
- **eval_batch_size**: `8` (per device) |
|
- **seed**: `42` |
|
- **distributed_type**: `single-node` (since `_n_gpu=1` and no distributed training setup is indicated) |
|
- **num_devices**: `1` (single NVIDIA GPU: RTX 4090) |
|
- **gradient_accumulation_steps**: `4` |
|
- **total_train_batch_size**: `32` (calculated as `train_batch_size * gradient_accumulation_steps`) |
|
- **total_eval_batch_size**: `8` (evaluation does not use gradient accumulation) |
|
- **optimizer**: `AdamW` (indicated by `optim=OptimizerNames.ADAMW_TORCH`) |
|
- **lr_scheduler_type**: `linear` (indicated by `lr_scheduler_type=SchedulerType.LINEAR`) |
|
- **lr_scheduler_warmup_steps**: `100` |
|
- **num_epochs**: `3` |
|
|
|
--- |
|
|
|
## Evaluation |
|
|
|
### Testing Data |
|
|
|
The evaluation used a dataset partially extracted from Korean labor law precedents. |
|
|
|
### Metrics |
|
|
|
- BLEU |
|
|
|
### Results |
|
|
|
- **BLEU Score:** 29.69 |
|
- **Accuracy:** 95.65% |
|
|
|
--- |
|
|
|
## Environmental Impact |
|
|
|
- **Hardware Type:** NVIDIA RTX 4090 |
|
- **Power Consumption:** ~450W |
|
- **Training Time:** 1 hour 25 minutes (1.42 hours) |
|
- **Electricity Consumption:** ~0.639 kWh |
|
- **Carbon Emission Factor (South Korea):** 0.459 kgCO₂/kWh |
|
- **Estimated Carbon Emissions:** ~0.293 kgCO₂ |
|
|
|
--- |
|
|
|
## Technical Specifications |
|
|
|
- **Model Architecture:** |
|
Based on mBART-large-50, a multilingual sequence-to-sequence transformer model designed for translation tasks. The architecture includes 24 encoder and 24 decoder layers with 1,024 hidden units. |
|
|
|
- **Software:** |
|
- sacrebleu for evaluation |
|
- Hugging Face Transformers library for fine-tuning |
|
- Python 3.11.9 and PyTorch 2.4.0 |
|
|
|
- **Hardware:** |
|
NVIDIA RTX 4090 with 24GB VRAM was used for training and inference. |
|
|
|
- **Tokenization and Preprocessing:** |
|
The tokenization was performed using the SentencePiece model pre-trained with mBART-large-50. Text preprocessing included removing special characters, unnecessary whitespace, and normalizing line breaks. |
|
|
|
--- |
|
|
|
## Citation |
|
|
|
Currently, there are no papers or blog posts available for this model. |
|
|
|
--- |
|
|
|
## Model Card Contact |
|
|
|
- **Contact Email:** [email protected] | [email protected] |
|
|