refactor: update llmarena link

274cf92 verified 13 days ago

7.13 kB

	---
	license: apache-2.0
	language:
	- ru
	- en
	base_model:
	- jinaai/jina-embeddings-v3
	---

	## JinaJudge: Proxy Judgement for Russian LLM Arena

	### Description
	This model is trained to replicate the judgement patterns of GPT-4-1106-Preview in the [Russian LLM Arena](https://huggingface.co./spaces/Vikhrmodels/arenahardlb), designed for faster and more cost-effective evaluation of language models. While the model's focus is on Russian LLM evaluation, it can also be used for English-centric models.

	---

	### Model Details

	This is a small upgrade to the [kaleinaNyan/jina-v3-rullmarena-judge](https://huggingface.co./kaleinaNyan/jina-v3-rullmarena-judge) model:
	- Number of decoder blocks increased from 4 to 5.
	- Hidden activations dimensionality reduced from 1024 to 512 (via a projection layer after JINA encoder).
	- The resulting model size went from 614M params to 589M params.
	- I also tweaked some training hyperparameters, but training data composition is the same.

	Surprisingly, these changes gave a tangible performance improvement, so I decided to upload the model. As it turned out (after evaluation on the train set), previous model was not expressive enough.

	---

	### Evaluation
	The validation process was based on existing judgements from the Russian LLM Arena, which were already available. These judgements were filtered and simplified to match the three-class structure used in training.

	NOTE: values in parenthesis show relative improvement compared to previous model.

	Models evaluated:
	- gemma-2-9b-it-sppo-iter3
	- glm-4-9b-chat
	- gpt-3.5-turbo-1106
	- mistral-7b-instruct-v0.3
	- storm-7b

	Validation Performance:
	- Accuracy: 80.76% (+2.67)
	- Precision: 78.56% (+2.74)
	- Recall: 79.48% (+2.71)
	- F1-score: 79.00% (+2.73)

	For the test phase, new judgements were generated using GPT-4 for the `kolibri-mistral-0427-upd` model.

	Test Performance:
	- Accuracy: 82.72% (+2.64)
	- Precision: 80.11% (+3.43)
	- Recall: 82.42% (+4.69)
	- F1-score: 81.18% (+4.10)

	---

	### Usage Example

	```python
	from transformers import AutoModel

	jina = AutoModel.from_pretrained("kaleinaNyan/jina-v3-rullmarena-judge-300924", trust_remote_code=True)

	prompt_template = """
	<user prompt>
	{user_prompt}
	<end>
	<assistant A answer>
	{assistant_a}
	<end>
	<assistant B answer>
	{assistant_b}
	<end>
	""".strip()

	prompt = "your prompt"
	assistant_a = "assistant a response"
	assistant_b = "assistant b response"

	example = prompt_template.format(
	user_prompt=user_prompt,
	assistant_a=assistant_a,
	assistant_b=assistant_b,
	)

	judgement = jina([example])[0].argmax()

	judgement_map = {
	0: "A is better than B",
	1: "A == B",
	2: "B is better than A"
	}

	print(judgement_map[judgement])
	```

	---

	### Generated ranking

	The ranking was obtained using a modified [Russian LLM Arena code](https://github.com/oKatanaaa/ru_llm_arena).
	All judgements were regenerated using the jina-judge model.

	\| Model \| Score \| 95% CI \| Average #Tokens \|
	\|--------------------------------------\|-------\|----------------------\|-----------------\|
	\| gpt-4-1106-preview \| 81.6 \| (-2.3, 3.0) \| 541 \|
	\| gpt-4.0-mini \| 76.0 \| (-2.7, 2.4) \| 448 \|
	\| qwen-2.5-72b-it \| 72.5 \| (-3.6, 3.6) \| 557 \|
	\| gemma-2-9b-it-sppo-iter3 \| 72.1 \| (-3.7, 3.6) \| 569 \|
	\| gemma-2-27b-it \| 71.1 \| (-3.3, 3.2) \| 482 \|
	\| gemma-2-9b-it \| 70.8 \| (-3.4, 3.5) \| 569 \|
	\| t-lite-instruct-0.1 \| 68.3 \| (-3.8, 4.5) \| 810 \|
	\| suzume-llama-3-8b-multilingual-orpo \| 62.9 \| (-3.9, 4.0) \| 682 \|
	\| glm-4-9b-chat \| 60.5 \| (-3.9, 4.0) \| 516 \|
	\| sfr-iterative-dpo-llama-3-8b-r \| 59.9 \| (-4.0, 4.3) \| 682 \|
	\| c4ai-command-r-v01 \| 56.9 \| (-4.2, 3.8) \| 516 \|
	\| phi-3-medium-4k-instruct \| 56.4 \| (-2.8, 3.3) \| 566 \|
	\| mistral-nemo-instruct-2407 \| 56.1 \| (-2.9, 3.4) \| 682 \|
	\| yandex_gpt_pro \| 51.7 \| (-3.4, 3.4) \| 345 \|
	\| suzume-llama-3-8b-multilingual \| 51.3 \| (-3.4, 4.0) \| 489 \|
	\| hermes-2-theta-llama-3-8b \| 50.9 \| (-3.2, 3.4) \| 485 \|
	\| starling-1m-7b-beta \| 50.2 \| (-3.3, 3.4) \| 495 \|
	\| gpt-3.5-turbo-0125 \| 50.0 \| (0.0, 0.0) \| 220 \|
	\| llama-3-instruct-8b-sppo-iter3 \| 49.8 \| (-3.4, 4.0) \| 763 \|
	\| llama-3-8b-saiga-suzume-ties \| 48.2 \| (-4.1, 3.9) \| 569 \|
	\| llama-3-smaug-8b \| 46.6 \| (-3.9, 3.8) \| 763 \|
	\| vikhr-it-5.4-fp16-orpo-v2 \| 46.6 \| (-3.7, 4.0) \| 379 \|
	\| aya-23-8b \| 46.3 \| (-3.8, 3.9) \| 571 \|
	\| saiga-llama3-8b_v6 \| 45.5 \| (-3.8, 3.9) \| 471 \|
	\| vikhr-it-5.2-fp16-cp \| 43.8 \| (-3.9, 4.0) \| 543 \|
	\| qwen2-7b-instruct \| 43.7 \| (-2.5, 2.7) \| 492 \|
	\| opencchat-3.5-0106 \| 43.4 \| (-3.3, 3.7) \| 485 \|
	\| gpt-3.5-turbo-1106 \| 41.7 \| (-2.9, 3.5) \| 220 \|
	\| kolibri-mistral-0427-upd \| 41.5 \| (-3.2, 3.5) \| 551 \|
	\| paralex-llama-3-8b-sft \| 40.6 \| (-3.8, 3.3) \| 688 \|
	\| mistral-7b-instruct-v0.3 \| 40.3 \| (-3.3, 3.4) \| 469 \|
	\| llama-3-instruct-8b-simpo \| 40.2 \| (-2.9, 3.7) \| 551 \|
	\| gigachat_pro \| 40.2 \| (-3.2, 3.5) \| 294 \|
	\| hermes-2-pro-llama-3-8b \| 39.5 \| (-2.9, 3.4) \| 689 \|
	\| vikhr-it-5.3-fp16-32k \| 39.5 \| (-2.8, 3.2) \| 519 \|
	\| opencchat-3.6-8b-2204522 \| 37.7 \| (-3.3, 3.7) \| 409 \|
	\| meta-llama-3-8b-instruct \| 37.5 \| (-3.1, 3.5) \| 450 \|
	\| kolibri-vikhr-mistral-0427 \| 37.1 \| (-3.1, 3.8) \| 488 \|
	\| neural-chat-v3.3 \| 36.5 \| (-2.7, 3.6) \| 523 \|
	\| vikhr-it-5.1-fp16 \| 36.4 \| (-3.5, 3.5) \| 448 \|
	\| gigachat-lite \| 36.0 \| (-2.8, 3.0) \| 523 \|
	\| saiga-7b \| 25.9 \| (-3.1, 3.7) \| 927 \|
	\| storm-7b \| 25.1 \| (-3.6, 4.1) \| 419 \|
	\| snorkel-mistral-pairrm-dpo \| 16.5 \| (-3.8, 3.2) \| 773 \|

	---
	license: apache-2.0
	language:
	- ru
	- en
	base_model:
	- jinaai/jina-embeddings-v3
	---

	## JinaJudge: Proxy Judgement for Russian LLM Arena

	### Description
	This model is trained to replicate the judgement patterns of GPT-4-1106-Preview in the [Russian LLM Arena](https://huggingface.co./spaces/Vikhrmodels/arenahardlb), designed for faster and more cost-effective evaluation of language models. While the model's focus is on Russian LLM evaluation, it can also be used for English-centric models.

	---

	### Model Details

	This is a small upgrade to the [kaleinaNyan/jina-v3-rullmarena-judge](https://huggingface.co./kaleinaNyan/jina-v3-rullmarena-judge) model:
	- Number of decoder blocks increased from 4 to 5.
	- Hidden activations dimensionality reduced from 1024 to 512 (via a projection layer after JINA encoder).
	- The resulting model size went from 614M params to 589M params.
	- I also tweaked some training hyperparameters, but training data composition is the same.

	Surprisingly, these changes gave a tangible performance improvement, so I decided to upload the model. As it turned out (after evaluation on the train set), previous model was not expressive enough.

	---

	### Evaluation
	The validation process was based on existing judgements from the Russian LLM Arena, which were already available. These judgements were filtered and simplified to match the three-class structure used in training.

	NOTE: values in parenthesis show relative improvement compared to previous model.

	Models evaluated:
	- gemma-2-9b-it-sppo-iter3
	- glm-4-9b-chat
	- gpt-3.5-turbo-1106
	- mistral-7b-instruct-v0.3
	- storm-7b

	Validation Performance:
	- Accuracy: 80.76% (+2.67)
	- Precision: 78.56% (+2.74)
	- Recall: 79.48% (+2.71)
	- F1-score: 79.00% (+2.73)

	For the test phase, new judgements were generated using GPT-4 for the `kolibri-mistral-0427-upd` model.

	Test Performance:
	- Accuracy: 82.72% (+2.64)
	- Precision: 80.11% (+3.43)
	- Recall: 82.42% (+4.69)
	- F1-score: 81.18% (+4.10)

	---

	### Usage Example

	```python
	from transformers import AutoModel

	jina = AutoModel.from_pretrained("kaleinaNyan/jina-v3-rullmarena-judge-300924", trust_remote_code=True)

	prompt_template = """
	<user prompt>
	{user_prompt}
	<end>
	<assistant A answer>
	{assistant_a}
	<end>
	<assistant B answer>
	{assistant_b}
	<end>
	""".strip()

	prompt = "your prompt"
	assistant_a = "assistant a response"
	assistant_b = "assistant b response"

	example = prompt_template.format(
	user_prompt=user_prompt,
	assistant_a=assistant_a,
	assistant_b=assistant_b,
	)

	judgement = jina([example])[0].argmax()

	judgement_map = {
	0: "A is better than B",
	1: "A == B",
	2: "B is better than A"
	}

	print(judgement_map[judgement])
	```

	---

	### Generated ranking

	The ranking was obtained using a modified [Russian LLM Arena code](https://github.com/oKatanaaa/ru_llm_arena).
	All judgements were regenerated using the jina-judge model.

	\| Model \| Score \| 95% CI \| Average #Tokens \|
	\|--------------------------------------\|-------\|----------------------\|-----------------\|
	\| gpt-4-1106-preview \| 81.6 \| (-2.3, 3.0) \| 541 \|
	\| gpt-4.0-mini \| 76.0 \| (-2.7, 2.4) \| 448 \|
	\| qwen-2.5-72b-it \| 72.5 \| (-3.6, 3.6) \| 557 \|
	\| gemma-2-9b-it-sppo-iter3 \| 72.1 \| (-3.7, 3.6) \| 569 \|
	\| gemma-2-27b-it \| 71.1 \| (-3.3, 3.2) \| 482 \|
	\| gemma-2-9b-it \| 70.8 \| (-3.4, 3.5) \| 569 \|
	\| t-lite-instruct-0.1 \| 68.3 \| (-3.8, 4.5) \| 810 \|
	\| suzume-llama-3-8b-multilingual-orpo \| 62.9 \| (-3.9, 4.0) \| 682 \|
	\| glm-4-9b-chat \| 60.5 \| (-3.9, 4.0) \| 516 \|
	\| sfr-iterative-dpo-llama-3-8b-r \| 59.9 \| (-4.0, 4.3) \| 682 \|
	\| c4ai-command-r-v01 \| 56.9 \| (-4.2, 3.8) \| 516 \|
	\| phi-3-medium-4k-instruct \| 56.4 \| (-2.8, 3.3) \| 566 \|
	\| mistral-nemo-instruct-2407 \| 56.1 \| (-2.9, 3.4) \| 682 \|
	\| yandex_gpt_pro \| 51.7 \| (-3.4, 3.4) \| 345 \|
	\| suzume-llama-3-8b-multilingual \| 51.3 \| (-3.4, 4.0) \| 489 \|
	\| hermes-2-theta-llama-3-8b \| 50.9 \| (-3.2, 3.4) \| 485 \|
	\| starling-1m-7b-beta \| 50.2 \| (-3.3, 3.4) \| 495 \|
	\| gpt-3.5-turbo-0125 \| 50.0 \| (0.0, 0.0) \| 220 \|
	\| llama-3-instruct-8b-sppo-iter3 \| 49.8 \| (-3.4, 4.0) \| 763 \|
	\| llama-3-8b-saiga-suzume-ties \| 48.2 \| (-4.1, 3.9) \| 569 \|
	\| llama-3-smaug-8b \| 46.6 \| (-3.9, 3.8) \| 763 \|
	\| vikhr-it-5.4-fp16-orpo-v2 \| 46.6 \| (-3.7, 4.0) \| 379 \|
	\| aya-23-8b \| 46.3 \| (-3.8, 3.9) \| 571 \|
	\| saiga-llama3-8b_v6 \| 45.5 \| (-3.8, 3.9) \| 471 \|
	\| vikhr-it-5.2-fp16-cp \| 43.8 \| (-3.9, 4.0) \| 543 \|
	\| qwen2-7b-instruct \| 43.7 \| (-2.5, 2.7) \| 492 \|
	\| opencchat-3.5-0106 \| 43.4 \| (-3.3, 3.7) \| 485 \|
	\| gpt-3.5-turbo-1106 \| 41.7 \| (-2.9, 3.5) \| 220 \|
	\| kolibri-mistral-0427-upd \| 41.5 \| (-3.2, 3.5) \| 551 \|
	\| paralex-llama-3-8b-sft \| 40.6 \| (-3.8, 3.3) \| 688 \|
	\| mistral-7b-instruct-v0.3 \| 40.3 \| (-3.3, 3.4) \| 469 \|
	\| llama-3-instruct-8b-simpo \| 40.2 \| (-2.9, 3.7) \| 551 \|
	\| gigachat_pro \| 40.2 \| (-3.2, 3.5) \| 294 \|
	\| hermes-2-pro-llama-3-8b \| 39.5 \| (-2.9, 3.4) \| 689 \|
	\| vikhr-it-5.3-fp16-32k \| 39.5 \| (-2.8, 3.2) \| 519 \|
	\| opencchat-3.6-8b-2204522 \| 37.7 \| (-3.3, 3.7) \| 409 \|
	\| meta-llama-3-8b-instruct \| 37.5 \| (-3.1, 3.5) \| 450 \|
	\| kolibri-vikhr-mistral-0427 \| 37.1 \| (-3.1, 3.8) \| 488 \|
	\| neural-chat-v3.3 \| 36.5 \| (-2.7, 3.6) \| 523 \|
	\| vikhr-it-5.1-fp16 \| 36.4 \| (-3.5, 3.5) \| 448 \|
	\| gigachat-lite \| 36.0 \| (-2.8, 3.0) \| 523 \|
	\| saiga-7b \| 25.9 \| (-3.1, 3.7) \| 927 \|
	\| storm-7b \| 25.1 \| (-3.6, 4.1) \| 419 \|
	\| snorkel-mistral-pairrm-dpo \| 16.5 \| (-3.8, 3.2) \| 773 \|