Marco-LLM-GLO
Introduction
Marco-LLM is a series of advanced multilingual language models designed to bridge the performance gap between high-resource languages and low-resource languages. This repository contains the Marco-LLM base language model with 7 billion parameters.
The model has undergone extensive multilingual continual pretraining on a diverse dataset containing over 5 trillion tokens, with a particular focus on enhancing performance in low-resource languages while maintaining strong capabilities in high-resource languages like English and Chinese.
Compared to state-of-the-art open-source language models, Marco-LLM demonstrates significant improvements in multilingual tasks, including machine translation, question answering, and reasoning across multiple languages. For more details, please refer to our Hugging Face page.
Model Details
Marco-LLM includes a 7B parameter model based on the Transformer architecture. The key features of Marco-LLM are:
Multilingual Training: The model is trained on a large-scale multilingual dataset covering 29 languages, including both high-resource languages (e.g., English, Chinese) and low-resource languages (e.g., Kazakh, Nepali).
Enhanced Tokenizer: An improved tokenizer is used to better handle multilingual data, ensuring higher efficiency and accuracy in tokenization.
Post-Training: Marco-LLM supports various post-training methods, such as Supervised Fine-tuning (SFT) and Direct Preference Optimization (DPO), to further enhance performance for specific tasks and languages.
Usage
It is not advised to use the base language models for direct text generation tasks. Instead, it is recommended to apply post-training methods such as Supervised Fine-tuning (SFT), Reinforcement Learning with Human Feedback (RLHF), or continued pretraining to adapt the models for specific use cases.
Citation
If you find our work helpful, please give us a citation.
@article{unique_identifier,
title={Marco-LLM: Bridging Languages via Massive Multilingual Training for Cross-Lingual Enhancement},
journal={arXiv},
volume={},
number={2412.04003},
year={2024},
url={https://arxiv.org/abs/2412.04003}
}
- Downloads last month
- 0
Model tree for AIDC-AI/Marco-LLM-GLO
Base model
Qwen/Qwen2-7B