File size: 2,572 Bytes
419ef3d 243eb77 419ef3d 243eb77 b58e088 243eb77 6ddbdf3 243eb77 b58e088 243eb77 b58e088 243eb77 b58e088 243eb77 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 |
---
license: apache-2.0
datasets:
- faur-ai/fulg
language:
- ro
---
# LLMic Model Card
[LLMic: Romanian Foundation Language Model](https://arxiv.org/abs/2501.07721)
## Model Summary
LLMic is a bilingual Romanian-English foundation model. LLmic is a 3B
parameters dense decoder-only Transformer model based on Llama2.
## Architecture
| Parameter | Value |
|-----------|---------|
| Sequence Length | 2048 |
| Number of Layers | 24 |
| Embedding Size | 2,560 |
| FFN Hidden Size | 10,240 |
| Number of Heads | 20 |
| Number of KV Heads | 5 |
| Activation Function | SiLU |
| Position Encodings | RoPE (Θ=500,000) |
| Layer Norm | RMSNorm (ε=10⁻⁵) |
| Tied Embeddings | No |
## Intended Use
Our model is designed to accelerate research on Romanian language models, serving as a building block for generative AI applications.
## Use with transformers
```python
from transformers import AutoTokenizer, AutoModelForCausalLM, TextStreamer
device = "cuda"
model_id = "faur-ai/LLMic"
prompt = "Capitala României este"
model = AutoModelForCausalLM.from_pretrained(model_id).to(device)
tokenizer = AutoTokenizer.from_pretrained(model_id)
streamer = TextStreamer(tokenizer)
inputs = tokenizer.encode(
prompt,
add_special_tokens=False,
return_tensors='pt',
).to(device)
outputs = model.generate(
streamer=streamer,
input_ids=inputs,
temperature=0.8,
do_sample=True
)
```
## Data Overview
### Training Datasets
| Source | Size |
|---------|------|
| *Romanian (300B)* | |
| Web Sources | 621 GB |
| Discussions, Curated & Parallel | 10 GB |
| *English (700B)* | |
| FineWebEdu | -- |
| Dolma Subset | 109 GB |
#### Benchmark datasets
We evaluated LLMic on the WMT16 language translation benchmark for English-to-Romanian.
| Model | Score |
|--------|--------|
| LLMIC | 41.01 |
| mBART | 38.50 |
| Llama-3.1-8B-Instruct | 29.02 |
| RoMistral-7b-Instruct | 27.70 |
| RoLlama3-8b-Instruct | 27.31 |
| Mistral-7B-Instruct-v0.2 | 26.19 |
| RoGemma-7b-Instruct | 25.96 |
| Gemma-1.1-7b-it | 25.48 |
## Citation
**BibTeX:**
```
@misc{bădoiu2025llmicromanianfoundationlanguage,
title={LLMic: Romanian Foundation Language Model},
author={Vlad-Andrei Bădoiu and Mihai-Valentin Dumitru and Alexandru M. Gherghescu and Alexandru Agache and Costin Raiciu},
year={2025},
eprint={2501.07721},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2501.07721},
}
```
|