Safetensors
qwen2

Marco-LLM-GLO

Introduction

Marco-LLM is a series of advanced multilingual language models designed to bridge the performance gap between high-resource languages and low-resource languages. This repository contains the Marco-LLM base language model with 7 billion parameters.

The model has undergone extensive multilingual continual pretraining on a diverse dataset containing over 5 trillion tokens, with a particular focus on enhancing performance in low-resource languages while maintaining strong capabilities in high-resource languages like English and Chinese.

Compared to state-of-the-art open-source language models, Marco-LLM demonstrates significant improvements in multilingual tasks, including machine translation, question answering, and reasoning across multiple languages. For more details, please refer to our Hugging Face page.

Model Details

Marco-LLM includes a 7B parameter model based on the Transformer architecture. The key features of Marco-LLM are:

  • Multilingual Training: The model is trained on a large-scale multilingual dataset covering 29 languages, including both high-resource languages (e.g., English, Chinese) and low-resource languages (e.g., Kazakh, Nepali).

  • Enhanced Tokenizer: An improved tokenizer is used to better handle multilingual data, ensuring higher efficiency and accuracy in tokenization.

  • Post-Training: Marco-LLM supports various post-training methods, such as Supervised Fine-tuning (SFT) and Direct Preference Optimization (DPO), to further enhance performance for specific tasks and languages.

Usage

It is not advised to use the base language models for direct text generation tasks. Instead, it is recommended to apply post-training methods such as Supervised Fine-tuning (SFT), Reinforcement Learning with Human Feedback (RLHF), or continued pretraining to adapt the models for specific use cases.

Citation

If you find our work helpful, please give us a citation.

@article{unique_identifier,

title={Marco-LLM: Bridging Languages via Massive Multilingual Training for Cross-Lingual Enhancement},

journal={arXiv},

volume={},

number={2412.04003},

year={2024},

url={https://arxiv.org/abs/2412.04003}

}
Downloads last month
0
Safetensors
Model size
7.62B params
Tensor type
BF16
·
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model has no library tag.

Model tree for AIDC-AI/Marco-LLM-GLO

Base model

Qwen/Qwen2-7B
Finetuned
(63)
this model