File size: 3,284 Bytes
db959b7
 
1283418
 
 
 
 
 
68b4650
db959b7
 
 
1283418
db959b7
1283418
 
 
db959b7
1283418
db959b7
 
1283418
db959b7
1283418
db959b7
 
1283418
 
 
 
 
 
 
 
 
 
 
 
 
 
 
db959b7
 
1283418
 
 
db959b7
1283418
 
 
db959b7
 
 
1283418
db959b7
1283418
db959b7
1283418
 
db959b7
1283418
 
 
db959b7
1283418
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
---
library_name: transformers
license: mit
datasets:
- OLAIR/Open-R1-Ko-SFT-v2.0
language:
- ko
base_model:
- deepseek-ai/DeepSeek-R1-Distill-Qwen-14B
---


## 1. Overview

**Model Name:** OLAIR/ko-r1-14b-v2.0.3  
**Model Type:** Large Language Model (LLM) for Korean language understanding and reasoning  
**Version:** 2.0.3

This model is designed to provide Korean language capabilities with a focus on reasoning tasks. It is a scaled-up version of [OLAIR/ko-r1-7b-v2.0.3](https://huggingface.co./OLAIR/ko-r1-7b-v2.0.3) for experimental purposes.


## 2. Benchmark Performance

The model's performance has been evaluated using the HAE-RAE Reasoning Challenge (HRC), which measures reasoning abilities across five domains.


| Model                                 | Chemistry | Math   | Physics | Physics Word Puzzles | Puzzles | Average |
|---------------------------------------|-----------|--------|---------|----------------------|---------|---------|
| o1-2024-12-17                         | 57.14     | 78.18  | 77.78   | 80.00                | 84.62   | 75.54   |
| o3-mini-high                          | 57.14     | 81.82  | 77.78   | 70.00                | 69.23   | 71.19   |
| o3-mini-2025-01-31                     | 50.00     | 80.00  | 70.37   | 50.00                | 76.92   | 65.46   |
| o1-mini-2024-09-12                     | 42.86     | 56.36  | 70.37   | 60.00                | 15.38   | 48.99   |
| Deepseek-R1                           | 50.00     | 54.55  | 62.96   | 70.00                | 7.69    | 49.04   |
| gpt-4o-2024-11-20                      | 35.71     | 32.73  | 51.85   | 50.00                | 53.85   | 44.83   |
| **Ko-R1-14B-v2.0.3**                   | 28.57      | 50.9  | 48.14  | 30.00                | 30.77   | 37.68   |
| Exaone-3.5-32B-Instruct               | 21.43     | 30.91  | 25.93   | 50.00                | 38.46   | 33.35   |
| Qwen2.5-72B-Instruct                  | 35.71     | 30.91  | 51.85   | 20.00                | 23.08   | 32.31   |
| **Ko-R1-7B-v2.0.3**                   | 7.14      | 61.82  | 40.74   | 40.00                | 0.00    | 29.94   |
| Ko-R1-7B-v1                           | 7.14      | 63.64  | 37.04   | 40.00                | 0.00    | 29.56   |
| gpt-4o-mini-2024-07-18                 | 21.43     | 29.09  | 37.04   | 50.00                | 0.00    | 27.51   |
| UNIVA-Bllossom_DeepSeek-llama3.1-Bllossom-8B | 28.57     | 16.36  | 33.33   | 10.00                | 15.38   | 20.73   |


**Comparison of Average Scores**
Ko-R1-14B-v2.0.3 outperforms non-reasoning models of way bigger size like Exaone-3.5-32B-Instruct, Qwen2.5-72B-Instruct, and gpt-4o-mini-2024-07-18. 
<p align="center"><img src="https://cdn-uploads.huggingface.co/production/uploads/60d3e619b8448e1785bbda2a/GA8UI4jr4GlSRrFJBa8nl.png" width="525"/>

**Training Rewards**
Even when trained with the same datasets, bigger models just learn MORE.
<p align="center"><img src="https://cdn-uploads.huggingface.co/production/uploads/60d3e619b8448e1785bbda2a/XAeOlEBnMSp9pQP0I1IU7.png" width="525"/> 



## 3. Limitations

- The model is still vulnerable to Korean-related inputs, leading to endless loops of thinking. We are working to fix it.

## ETC
How to Cite

```
To be added
```

Contact 
```
[email protected]
```