1. Overview

Model Name: OLAIR/ko-r1-14b-v2.0.3
Model Type: Large Language Model (LLM) for Korean language understanding and reasoning
Version: 2.0.3

This model is designed to provide Korean language capabilities with a focus on reasoning tasks. It is a scaled-up version of OLAIR/ko-r1-7b-v2.0.3 for experimental purposes.

2. Benchmark Performance

The model's performance has been evaluated using the HAE-RAE Reasoning Challenge (HRC), which measures reasoning abilities across five domains.

Model Chemistry Math Physics Physics Word Puzzles Puzzles Average
o1-2024-12-17 57.14 78.18 77.78 80.00 84.62 75.54
o3-mini-high 57.14 81.82 77.78 70.00 69.23 71.19
o3-mini-2025-01-31 50.00 80.00 70.37 50.00 76.92 65.46
o1-mini-2024-09-12 42.86 56.36 70.37 60.00 15.38 48.99
Deepseek-R1 50.00 54.55 62.96 70.00 7.69 49.04
gpt-4o-2024-11-20 35.71 32.73 51.85 50.00 53.85 44.83
Ko-R1-14B-v2.0.3 28.57 50.9 48.14 30.00 30.77 37.68
Exaone-3.5-32B-Instruct 21.43 30.91 25.93 50.00 38.46 33.35
Qwen2.5-72B-Instruct 35.71 30.91 51.85 20.00 23.08 32.31
Ko-R1-7B-v2.0.3 7.14 61.82 40.74 40.00 0.00 29.94
Ko-R1-7B-v1 7.14 63.64 37.04 40.00 0.00 29.56
gpt-4o-mini-2024-07-18 21.43 29.09 37.04 50.00 0.00 27.51
UNIVA-Bllossom_DeepSeek-llama3.1-Bllossom-8B 28.57 16.36 33.33 10.00 15.38 20.73

Comparison of Average Scores Ko-R1-14B-v2.0.3 outperforms non-reasoning models of way bigger size like Exaone-3.5-32B-Instruct, Qwen2.5-72B-Instruct, and gpt-4o-mini-2024-07-18.

Training Rewards Even when trained with the same datasets, bigger models just learn MORE.

3. Limitations

  • The model is still vulnerable to Korean-related inputs, leading to endless loops of thinking. We are working to fix it.

ETC

How to Cite

To be added

Contact

[email protected]
Downloads last month
238
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.

Model tree for OLAIR/ko-r1-14b-v2.0.3

Finetuned
(65)
this model

Dataset used to train OLAIR/ko-r1-14b-v2.0.3