Safetensors
Romanian
llama
File size: 2,572 Bytes
419ef3d
 
243eb77
 
 
 
419ef3d
243eb77
 
 
b58e088
243eb77
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6ddbdf3
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
243eb77
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b58e088
 
243eb77
 
b58e088
243eb77
 
b58e088
243eb77
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
---

license: apache-2.0
datasets:
- faur-ai/fulg
language:
- ro
---


# LLMic Model Card

[LLMic: Romanian Foundation Language Model](https://arxiv.org/abs/2501.07721)

## Model Summary

LLMic is a bilingual Romanian-English foundation model. LLmic is a 3B
parameters dense decoder-only Transformer model based on Llama2.

## Architecture

| Parameter | Value |
|-----------|---------|
| Sequence Length | 2048 |
| Number of Layers | 24 |
| Embedding Size | 2,560 |
| FFN Hidden Size | 10,240 |
| Number of Heads | 20 |
| Number of KV Heads | 5 |
| Activation Function | SiLU |
| Position Encodings | RoPE (Θ=500,000) |
| Layer Norm | RMSNorm (ε=10⁻⁵) |
| Tied Embeddings | No |

## Intended Use

Our model is designed to accelerate research on Romanian language models, serving as a building block for generative AI applications.

##  Use with transformers 

```python

from transformers import AutoTokenizer, AutoModelForCausalLM, TextStreamer



device = "cuda"

model_id = "faur-ai/LLMic"

prompt = "Capitala României este"



model = AutoModelForCausalLM.from_pretrained(model_id).to(device)

tokenizer = AutoTokenizer.from_pretrained(model_id)

streamer = TextStreamer(tokenizer)



inputs = tokenizer.encode(

    prompt,

    add_special_tokens=False,

    return_tensors='pt',

).to(device)



outputs = model.generate(

    streamer=streamer,

    input_ids=inputs,

    temperature=0.8,

    do_sample=True

)

```

## Data Overview 

### Training Datasets 

| Source | Size |
|---------|------|
| *Romanian (300B)* | |
| Web Sources | 621 GB |
| Discussions, Curated & Parallel | 10 GB |
| *English (700B)* | |
| FineWebEdu | -- |
| Dolma Subset | 109 GB |

#### Benchmark datasets

We evaluated LLMic on the WMT16 language translation benchmark for English-to-Romanian.

| Model | Score |
|--------|--------|
| LLMIC | 41.01 |
| mBART | 38.50 |
| Llama-3.1-8B-Instruct | 29.02 |
| RoMistral-7b-Instruct | 27.70 |
| RoLlama3-8b-Instruct | 27.31 |
| Mistral-7B-Instruct-v0.2 | 26.19 |
| RoGemma-7b-Instruct | 25.96 |
| Gemma-1.1-7b-it | 25.48 |


## Citation

**BibTeX:**

```

@misc{bădoiu2025llmicromanianfoundationlanguage,

      title={LLMic: Romanian Foundation Language Model}, 

      author={Vlad-Andrei Bădoiu and Mihai-Valentin Dumitru and Alexandru M. Gherghescu and Alexandru Agache and Costin Raiciu},

      year={2025},

      eprint={2501.07721},

      archivePrefix={arXiv},

      primaryClass={cs.CL},

      url={https://arxiv.org/abs/2501.07721}, 

}

```