---
license: apache-2.0
language:
- ru
- en
base_model:
- jinaai/jina-embeddings-v3
---

## **JinaJudge: Proxy Judgement for Russian LLM Arena**

### **Description**
This model is trained to replicate the judgement patterns of GPT-4-1106-Preview in the [Russian LLM Arena](https://huggingface.co./spaces/Vikhrmodels/arenahardlb), designed for faster and more cost-effective evaluation of language models. While the model's focus is on Russian LLM evaluation, it can also be used for English-centric models.

---

### **Model Details**

This is a small upgrade to the [kaleinaNyan/jina-v3-rullmarena-judge](https://huggingface.co./kaleinaNyan/jina-v3-rullmarena-judge) model:
- Number of decoder blocks increased from 4 to 5.
- Hidden activations dimensionality reduced from 1024 to 512 (via a projection layer after JINA encoder).
- The resulting model size went from 614M params to 589M params.
- I also tweaked some training hyperparameters, but training data composition is the same.

Surprisingly, these changes gave a tangible performance improvement, so I decided to upload the model. As it turned out (after evaluation on the train set), previous model was not expressive enough.

---

### **Evaluation**
The validation process was based on **existing judgements** from the Russian LLM Arena, which were already available. These judgements were filtered and simplified to match the three-class structure used in training.

NOTE: values in parenthesis show relative improvement compared to previous model.

**Models evaluated**:
- **gemma-2-9b-it-sppo-iter3**
- **glm-4-9b-chat**
- **gpt-3.5-turbo-1106**
- **mistral-7b-instruct-v0.3**
- **storm-7b**

**Validation Performance**:
- **Accuracy**: 80.76% (+2.67)
- **Precision**: 78.56% (+2.74)
- **Recall**: 79.48% (+2.71)
- **F1-score**: 79.00% (+2.73)

For the **test** phase, new judgements were generated using GPT-4 for the `kolibri-mistral-0427-upd` model.

**Test Performance**:
- **Accuracy**: 82.72% (+2.64)
- **Precision**: 80.11% (+3.43)
- **Recall**: 82.42% (+4.69)
- **F1-score**: 81.18% (+4.10)

---

### **Usage Example**

```python
from transformers import AutoModel

jina = AutoModel.from_pretrained("kaleinaNyan/jina-v3-rullmarena-judge-300924", trust_remote_code=True)

prompt_template = """
<user prompt>
{user_prompt}
<end>
<assistant A answer>
{assistant_a}
<end>
<assistant B answer>
{assistant_b}
<end>
""".strip()

prompt = "your prompt"
assistant_a = "assistant a response"
assistant_b = "assistant b response"

example = prompt_template.format(
    user_prompt=user_prompt,
    assistant_a=assistant_a,
    assistant_b=assistant_b,
)

judgement = jina([example])[0].argmax()

judgement_map = {
  0: "A is better than B",
  1: "A == B",
  2: "B is better than A"
}

print(judgement_map[judgement])
```

---

### **Generated ranking**

The ranking was obtained using a modified [Russian LLM Arena code](https://github.com/oKatanaaa/ru_llm_arena). 
All judgements were regenerated using the jina-judge model.

| Model                               | Score | 95% CI               | Average #Tokens |
|--------------------------------------|-------|----------------------|-----------------|
| gpt-4-1106-preview                   | 81.6  | (-2.3, 3.0)          | 541             |
| gpt-4.0-mini                         | 76.0  | (-2.7, 2.4)          | 448             |
| qwen-2.5-72b-it                      | 72.5  | (-3.6, 3.6)          | 557             |
| gemma-2-9b-it-sppo-iter3             | 72.1  | (-3.7, 3.6)          | 569             |
| gemma-2-27b-it                       | 71.1  | (-3.3, 3.2)          | 482             |
| gemma-2-9b-it                        | 70.8  | (-3.4, 3.5)          | 569             |
| t-lite-instruct-0.1                  | 68.3  | (-3.8, 4.5)          | 810             |
| suzume-llama-3-8b-multilingual-orpo  | 62.9  | (-3.9, 4.0)          | 682             |
| glm-4-9b-chat                        | 60.5  | (-3.9, 4.0)          | 516             |
| sfr-iterative-dpo-llama-3-8b-r       | 59.9  | (-4.0, 4.3)          | 682             |
| c4ai-command-r-v01                   | 56.9  | (-4.2, 3.8)          | 516             |
| phi-3-medium-4k-instruct             | 56.4  | (-2.8, 3.3)          | 566             |
| mistral-nemo-instruct-2407           | 56.1  | (-2.9, 3.4)          | 682             |
| yandex_gpt_pro                       | 51.7  | (-3.4, 3.4)          | 345             |
| suzume-llama-3-8b-multilingual       | 51.3  | (-3.4, 4.0)          | 489             |
| hermes-2-theta-llama-3-8b            | 50.9  | (-3.2, 3.4)          | 485             |
| starling-1m-7b-beta                  | 50.2  | (-3.3, 3.4)          | 495             |
| gpt-3.5-turbo-0125                   | 50.0  | (0.0, 0.0)           | 220             |
| llama-3-instruct-8b-sppo-iter3       | 49.8  | (-3.4, 4.0)          | 763             |
| llama-3-8b-saiga-suzume-ties         | 48.2  | (-4.1, 3.9)          | 569             |
| llama-3-smaug-8b                     | 46.6  | (-3.9, 3.8)          | 763             |
| vikhr-it-5.4-fp16-orpo-v2            | 46.6  | (-3.7, 4.0)          | 379             |
| aya-23-8b                            | 46.3  | (-3.8, 3.9)          | 571             |
| saiga-llama3-8b_v6                   | 45.5  | (-3.8, 3.9)          | 471             |
| vikhr-it-5.2-fp16-cp                 | 43.8  | (-3.9, 4.0)          | 543             |
| qwen2-7b-instruct                    | 43.7  | (-2.5, 2.7)          | 492             |
| opencchat-3.5-0106                   | 43.4  | (-3.3, 3.7)          | 485             |
| gpt-3.5-turbo-1106                   | 41.7  | (-2.9, 3.5)          | 220             |
| kolibri-mistral-0427-upd             | 41.5  | (-3.2, 3.5)          | 551             |
| paralex-llama-3-8b-sft               | 40.6  | (-3.8, 3.3)          | 688             |
| mistral-7b-instruct-v0.3             | 40.3  | (-3.3, 3.4)          | 469             |
| llama-3-instruct-8b-simpo            | 40.2  | (-2.9, 3.7)          | 551             |
| gigachat_pro                         | 40.2  | (-3.2, 3.5)          | 294             |
| hermes-2-pro-llama-3-8b              | 39.5  | (-2.9, 3.4)          | 689             |
| vikhr-it-5.3-fp16-32k                | 39.5  | (-2.8, 3.2)          | 519             |
| opencchat-3.6-8b-2204522             | 37.7  | (-3.3, 3.7)          | 409             |
| meta-llama-3-8b-instruct             | 37.5  | (-3.1, 3.5)          | 450             |
| kolibri-vikhr-mistral-0427           | 37.1  | (-3.1, 3.8)          | 488             |
| neural-chat-v3.3                     | 36.5  | (-2.7, 3.6)          | 523             |
| vikhr-it-5.1-fp16                    | 36.4  | (-3.5, 3.5)          | 448             |
| gigachat-lite                        | 36.0  | (-2.8, 3.0)          | 523             |
| saiga-7b                             | 25.9  | (-3.1, 3.7)          | 927             |
| storm-7b                             | 25.1  | (-3.6, 4.1)          | 419             |
| snorkel-mistral-pairrm-dpo           | 16.5  | (-3.8, 3.2)          | 773             |