2f-dev's picture
Update README.md
5f4b2b3 verified
---
license: mit
language:
- ko
- vi
metrics:
- bleu
base_model:
- facebook/mbart-large-50-many-to-many-mmt
pipeline_tag: translation
library_name: transformers
tags:
- mbart
- mbart-50
- text2text-generation
---
# Model Card for mbart-large-50-mmt-ko-vi
This model is fine-tuned from mBART-large-50 using multilingual translation data of Korean legal documents for Korean-to-Vietnamese translation tasks.
---
## Table of Contents
- [Model Card for mbart-large-50-mmt-ko-vi](#model-card-for-mbart-large-50-mmt-ko-vi)
- [Table of Contents](#table-of-contents)
- [Model Details](#model-details)
- [Model Description](#model-description)
- [Uses](#uses)
- [Direct Use](#direct-use)
- [Out-of-Scope Use](#out-of-scope-use)
- [Bias, Risks, and Limitations](#bias-risks-and-limitations)
- [Training Details](#training-details)
- [Training Data](#training-data)
- [Training Procedure](#training-procedure)
- [Preprocessing](#preprocessing)
- [Speeds, Sizes, Times](#speeds-sizes-times)
- [Evaluation](#evaluation)
- [Testing Data](#testing-data)
- [Metrics](#metrics)
- [Results](#results)
- [Environmental Impact](#environmental-impact)
- [Technical Specifications](#technical-specifications)
- [Citation](#citation)
- [Model Card Contact](#model-card-contact)
---
## Model Details
### Model Description
- **Developed by:** Jaeyoon Myoung, Heewon Kwak
- **Shared by:** ofu
- **Model type:** Language model (Translation)
- **Language(s) (NLP):** Korean, Vietnamese
- **License:** Apache 2.0
- **Parent Model:** facebook/mbart-large-50-many-to-many-mmt
---
## Uses
### Direct Use
This model is used for text translation from Korean to Vietnamese.
### Out-of-Scope Use
This model is not suitable for translation tasks involving languages other than Korean.
---
## Bias, Risks, and Limitations
The model may contain biases inherited from the training data and may produce inappropriate translations for sensitive topics.
---
## Training Details
### Training Data
The model was trained using multilingual translation data of Korean legal documents provided by AI Hub.
### Training Procedure
#### Preprocessing
- Removed unnecessary whitespace, special characters, and line breaks.
### Speeds, Sizes, Times
- **Training Time:** 1 hour 25 minutes (5,100 seconds) on Nvidia RTX 4090
- **Throughput:** ~3.51 samples/second
- **Total Training Samples:** 17,922
- **Model Checkpoint Size:** Approximately 2.3GB
- **Gradient Accumulation Steps:** 4
- **FP16 Mixed Precision Enabled:** Yes
### Training hyperparameters
The following hyperparameters were used during training:
- **learning_rate**: `0.0001`
- **train_batch_size**: `8` (per device)
- **eval_batch_size**: `8` (per device)
- **seed**: `42`
- **distributed_type**: `single-node` (since `_n_gpu=1` and no distributed training setup is indicated)
- **num_devices**: `1` (single NVIDIA GPU: RTX 4090)
- **gradient_accumulation_steps**: `4`
- **total_train_batch_size**: `32` (calculated as `train_batch_size * gradient_accumulation_steps`)
- **total_eval_batch_size**: `8` (evaluation does not use gradient accumulation)
- **optimizer**: `AdamW` (indicated by `optim=OptimizerNames.ADAMW_TORCH`)
- **lr_scheduler_type**: `linear` (indicated by `lr_scheduler_type=SchedulerType.LINEAR`)
- **lr_scheduler_warmup_steps**: `100`
- **num_epochs**: `3`
---
## Evaluation
### Testing Data
The evaluation used a dataset partially extracted from Korean labor law precedents.
### Metrics
- BLEU
### Results
- **BLEU Score:** 29.69
- **Accuracy:** 95.65%
---
## Environmental Impact
- **Hardware Type:** NVIDIA RTX 4090
- **Power Consumption:** ~450W
- **Training Time:** 1 hour 25 minutes (1.42 hours)
- **Electricity Consumption:** ~0.639 kWh
- **Carbon Emission Factor (South Korea):** 0.459 kgCO₂/kWh
- **Estimated Carbon Emissions:** ~0.293 kgCO₂
---
## Technical Specifications
- **Model Architecture:**
Based on mBART-large-50, a multilingual sequence-to-sequence transformer model designed for translation tasks. The architecture includes 24 encoder and 24 decoder layers with 1,024 hidden units.
- **Software:**
- sacrebleu for evaluation
- Hugging Face Transformers library for fine-tuning
- Python 3.11.9 and PyTorch 2.4.0
- **Hardware:**
NVIDIA RTX 4090 with 24GB VRAM was used for training and inference.
- **Tokenization and Preprocessing:**
The tokenization was performed using the SentencePiece model pre-trained with mBART-large-50. Text preprocessing included removing special characters, unnecessary whitespace, and normalizing line breaks.
---
## Citation
Currently, there are no papers or blog posts available for this model.
---
## Model Card Contact
- **Contact Email:** [email protected] | [email protected]