Hermes-2.5-Yi-1.5-9B-Chat

This model is a fine-tuned version of 01-ai/Yi-1.5-9B-Chat on the teknium/OpenHermes-2.5 dataset. I'm very happy with the results. The model now seems a lot smarter and "aware" in certain situations (first look, so I might change my opinion with more usage). It got quite an big edge on the AGIEval Benchmark for models in it's class. I plan to extend its context length to 32k with POSE.

Model Details

Base Model: 01-ai/Yi-1.5-9B-Chat
chat-template: chatml
Dataset: teknium/OpenHermes-2.5
Sequence Length: 8192 tokens
Training:
Epochs: 1
Hardware: 4 Nodes x 4 NVIDIA A100 40GB GPUs
Duration: 48:32:13
Cluster: KIT SCC Cluster

Benchmark n_shots=0

Benchmark	Score
ARC (Challenge)	52.47%
ARC (Easy)	81.65%
BoolQ	87.22%
HellaSwag	60.52%
OpenBookQA	33.60%
PIQA	81.12%
Winogrande	72.22%
AGIEval	38.46%
TruthfulQA	44.22%
MMLU	59.72%
IFEval	47.96%

For detailed benchmark results, including sub-categories and various metrics, please refer to the full benchmark table at the end of this README.

GGUF and Quantizations

llama.cpp b3166
juvi21/Hermes-2.5-Yi-1.5-9B-Chat-GGUF is availabe in:
F16 Q8_0 Q6_KQ5_K_M Q4_K_M Q3_K_M Q2_K

Usage

To use this model, you can load it using the Hugging Face Transformers library:

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("juvi21/Hermes-2.5-Yi-1.5-9B-Chat")
tokenizer = AutoTokenizer.from_pretrained("juvi21/Hermes-2.5-Yi-1.5-9B-Chat")

# Generate text
input_text = "What is the question to 42?"
inputs = tokenizer(input_text, return_tensors="pt")
outputs = model.generate(**inputs)
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(generated_text)

chatml

<|im_start|>system
{system_prompt}<|im_end|>
<|im_start|>user
Knock Knock, who is there?<|im_end|>
<|im_start|>assistant
Hi there! <|im_end|>

License

This model is released under the Apache 2.0 license.

Acknowledgements

Special thanks to:

Teknium for the great OpenHermes-2.5 dataset
01-ai for their great model
KIT SCC for FLOPS

Citation

If you use this model in your research, consider citing. Although definetly cite NousResearch and 01-ai:

@misc{
  author = {juvi21},
  title = Hermes-2.5-Yi-1.5-9B-Chat},
  year = {2024},
}

full-benchmark-results

Tasks	Version	Filter	Metric		Value		Stderr
agieval	N/A	none	acc	↑	0.5381	±	0.0049
		none	acc_norm	↑	0.5715	±	0.0056
- agieval_aqua_rat	1	none	acc	↑	0.3858	±	0.0306
		none	acc_norm	↑	0.3425	±	0.0298
- agieval_gaokao_biology	1	none	acc	↑	0.6048	±	0.0338
		none	acc_norm	↑	0.6000	±	0.0339
- agieval_gaokao_chemistry	1	none	acc	↑	0.4879	±	0.0348
		none	acc_norm	↑	0.4106	±	0.0343
- agieval_gaokao_chinese	1	none	acc	↑	0.5935	±	0.0314
		none	acc_norm	↑	0.5813	±	0.0315
- agieval_gaokao_english	1	none	acc	↑	0.8235	±	0.0218
		none	acc_norm	↑	0.8431	±	0.0208
- agieval_gaokao_geography	1	none	acc	↑	0.7085	±	0.0323
		none	acc_norm	↑	0.6985	±	0.0326
- agieval_gaokao_history	1	none	acc	↑	0.7830	±	0.0269
		none	acc_norm	↑	0.7660	±	0.0277
- agieval_gaokao_mathcloze	1	none	acc	↑	0.0508	±	0.0203
- agieval_gaokao_mathqa	1	none	acc	↑	0.3761	±	0.0259
		none	acc_norm	↑	0.3590	±	0.0256
- agieval_gaokao_physics	1	none	acc	↑	0.4950	±	0.0354
		none	acc_norm	↑	0.4700	±	0.0354
- agieval_jec_qa_ca	1	none	acc	↑	0.6557	±	0.0150
		none	acc_norm	↑	0.5926	±	0.0156
- agieval_jec_qa_kd	1	none	acc	↑	0.7310	±	0.0140
		none	acc_norm	↑	0.6610	±	0.0150
- agieval_logiqa_en	1	none	acc	↑	0.5177	±	0.0196
		none	acc_norm	↑	0.4839	±	0.0196
- agieval_logiqa_zh	1	none	acc	↑	0.4854	±	0.0196
		none	acc_norm	↑	0.4501	±	0.0195
- agieval_lsat_ar	1	none	acc	↑	0.2913	±	0.0300
		none	acc_norm	↑	0.2696	±	0.0293
- agieval_lsat_lr	1	none	acc	↑	0.7196	±	0.0199
		none	acc_norm	↑	0.6824	±	0.0206
- agieval_lsat_rc	1	none	acc	↑	0.7212	±	0.0274
		none	acc_norm	↑	0.6989	±	0.0280
- agieval_math	1	none	acc	↑	0.0910	±	0.0091
- agieval_sat_en	1	none	acc	↑	0.8204	±	0.0268
		none	acc_norm	↑	0.8301	±	0.0262
- agieval_sat_en_without_passage	1	none	acc	↑	0.5194	±	0.0349
		none	acc_norm	↑	0.4806	±	0.0349
- agieval_sat_math	1	none	acc	↑	0.5864	±	0.0333
		none	acc_norm	↑	0.5409	±	0.0337
arc_challenge	1	none	acc	↑	0.5648	±	0.0145
		none	acc_norm	↑	0.5879	±	0.0144
arc_easy	1	none	acc	↑	0.8241	±	0.0078
		none	acc_norm	↑	0.8165	±	0.0079
boolq	2	none	acc	↑	0.8624	±	0.0060
hellaswag	1	none	acc	↑	0.5901	±	0.0049
		none	acc_norm	↑	0.7767	±	0.0042
ifeval	2	none	inst_level_loose_acc	↑	0.5156	±	N/A
		none	inst_level_strict_acc	↑	0.4748	±	N/A
		none	prompt_level_loose_acc	↑	0.3863	±	0.0210
		none	prompt_level_strict_acc	↑	0.3309	±	0.0202
mmlu	N/A	none	acc	↑	0.6942	±	0.0037
- abstract_algebra	0	none	acc	↑	0.4900	±	0.0502
- anatomy	0	none	acc	↑	0.6815	±	0.0402
- astronomy	0	none	acc	↑	0.7895	±	0.0332
- business_ethics	0	none	acc	↑	0.7600	±	0.0429
- clinical_knowledge	0	none	acc	↑	0.7132	±	0.0278
- college_biology	0	none	acc	↑	0.8056	±	0.0331
- college_chemistry	0	none	acc	↑	0.5300	±	0.0502
- college_computer_science	0	none	acc	↑	0.6500	±	0.0479
- college_mathematics	0	none	acc	↑	0.4100	±	0.0494
- college_medicine	0	none	acc	↑	0.6763	±	0.0357
- college_physics	0	none	acc	↑	0.5000	±	0.0498
- computer_security	0	none	acc	↑	0.8200	±	0.0386
- conceptual_physics	0	none	acc	↑	0.7489	±	0.0283
- econometrics	0	none	acc	↑	0.5877	±	0.0463
- electrical_engineering	0	none	acc	↑	0.6759	±	0.0390
- elementary_mathematics	0	none	acc	↑	0.6481	±	0.0246
- formal_logic	0	none	acc	↑	0.5873	±	0.0440
- global_facts	0	none	acc	↑	0.3900	±	0.0490
- high_school_biology	0	none	acc	↑	0.8613	±	0.0197
- high_school_chemistry	0	none	acc	↑	0.6453	±	0.0337
- high_school_computer_science	0	none	acc	↑	0.8300	±	0.0378
- high_school_european_history	0	none	acc	↑	0.8182	±	0.0301
- high_school_geography	0	none	acc	↑	0.8485	±	0.0255
- high_school_government_and_politics	0	none	acc	↑	0.8964	±	0.0220
- high_school_macroeconomics	0	none	acc	↑	0.7923	±	0.0206
- high_school_mathematics	0	none	acc	↑	0.4407	±	0.0303
- high_school_microeconomics	0	none	acc	↑	0.8655	±	0.0222
- high_school_physics	0	none	acc	↑	0.5298	±	0.0408
- high_school_psychology	0	none	acc	↑	0.8679	±	0.0145
- high_school_statistics	0	none	acc	↑	0.6898	±	0.0315
- high_school_us_history	0	none	acc	↑	0.8873	±	0.0222
- high_school_world_history	0	none	acc	↑	0.8312	±	0.0244
- human_aging	0	none	acc	↑	0.7085	±	0.0305
- human_sexuality	0	none	acc	↑	0.7557	±	0.0377
- humanities	N/A	none	acc	↑	0.6323	±	0.0067
- international_law	0	none	acc	↑	0.8099	±	0.0358
- jurisprudence	0	none	acc	↑	0.7685	±	0.0408
- logical_fallacies	0	none	acc	↑	0.7975	±	0.0316
- machine_learning	0	none	acc	↑	0.5179	±	0.0474
- management	0	none	acc	↑	0.8835	±	0.0318
- marketing	0	none	acc	↑	0.9017	±	0.0195
- medical_genetics	0	none	acc	↑	0.8000	±	0.0402
- miscellaneous	0	none	acc	↑	0.8225	±	0.0137
- moral_disputes	0	none	acc	↑	0.7283	±	0.0239
- moral_scenarios	0	none	acc	↑	0.4860	±	0.0167
- nutrition	0	none	acc	↑	0.7353	±	0.0253
- other	N/A	none	acc	↑	0.7287	±	0.0077
- philosophy	0	none	acc	↑	0.7170	±	0.0256
- prehistory	0	none	acc	↑	0.7346	±	0.0246
- professional_accounting	0	none	acc	↑	0.5638	±	0.0296
- professional_law	0	none	acc	↑	0.5163	±	0.0128
- professional_medicine	0	none	acc	↑	0.6875	±	0.0282
- professional_psychology	0	none	acc	↑	0.7092	±	0.0184
- public_relations	0	none	acc	↑	0.6727	±	0.0449
- security_studies	0	none	acc	↑	0.7347	±	0.0283
- social_sciences	N/A	none	acc	↑	0.7910	±	0.0072
- sociology	0	none	acc	↑	0.8060	±	0.0280
- stem	N/A	none	acc	↑	0.6581	±	0.0081
- us_foreign_policy	0	none	acc	↑	0.8900	±	0.0314
- virology	0	none	acc	↑	0.5301	±	0.0389
- world_religions	0	none	acc	↑	0.8012	±	0.0306
openbookqa	1	none	acc	↑	0.3280	±	0.0210
		none	acc_norm	↑	0.4360	±	0.0222
piqa	1	none	acc	↑	0.7982	±	0.0094
		none	acc_norm	↑	0.8074	±	0.0092
truthfulqa	N/A	none	acc	↑	0.4746	±	0.0116
		none	bleu_acc	↑	0.4700	±	0.0175
		none	bleu_diff	↑	0.3214	±	0.6045
		none	bleu_max	↑	22.5895	±	0.7122
		none	rouge1_acc	↑	0.4798	±	0.0175
		none	rouge1_diff	↑	0.0846	±	0.7161
		none	rouge1_max	↑	48.7180	±	0.7833
		none	rouge2_acc	↑	0.4149	±	0.0172
		none	rouge2_diff	↑	-0.4656	±	0.8375
		none	rouge2_max	↑	34.0585	±	0.8974
		none	rougeL_acc	↑	0.4651	±	0.0175
		none	rougeL_diff	↑	-0.2804	±	0.7217
		none	rougeL_max	↑	45.2232	±	0.7971
- truthfulqa_gen	3	none	bleu_acc	↑	0.4700	±	0.0175
		none	bleu_diff	↑	0.3214	±	0.6045
		none	bleu_max	↑	22.5895	±	0.7122
		none	rouge1_acc	↑	0.4798	±	0.0175
		none	rouge1_diff	↑	0.0846	±	0.7161
		none	rouge1_max	↑	48.7180	±	0.7833
		none	rouge2_acc	↑	0.4149	±	0.0172
		none	rouge2_diff	↑	-0.4656	±	0.8375
		none	rouge2_max	↑	34.0585	±	0.8974
		none	rougeL_acc	↑	0.4651	±	0.0175
		none	rougeL_diff	↑	-0.2804	±	0.7217
		none	rougeL_max	↑	45.2232	±	0.7971
- truthfulqa_mc1	2	none	acc	↑	0.3905	±	0.0171
- truthfulqa_mc2	2	none	acc	↑	0.5587	±	0.0156
winogrande	1	none	acc	↑	0.7388	±	0.0123

Groups	Version	Filter	Metric		Value		Stderr
agieval	N/A	none	acc	↑	0.5381	±	0.0049
		none	acc_norm	↑	0.5715	±	0.0056
mmlu	N/A	none	acc	↑	0.6942	±	0.0037
- humanities	N/A	none	acc	↑	0.6323	±	0.0067
- other	N/A	none	acc	↑	0.7287	±	0.0077
- social_sciences	N/A	none	acc	↑	0.7910	±	0.0072
- stem	N/A	none	acc	↑	0.6581	±	0.0081
truthfulqa	N/A	none	acc	↑	0.4746	±	0.0116
		none	bleu_acc	↑	0.4700	±	0.0175
		none	bleu_diff	↑	0.3214	±	0.6045
		none	bleu_max	↑	22.5895	±	0.7122
		none	rouge1_acc	↑	0.4798	±	0.0175
		none	rouge1_diff	↑	0.0846	±	0.7161
		none	rouge1_max	↑	48.7180	±	0.7833
		none	rouge2_acc	↑	0.4149	±	0.0172
		none	rouge2_diff	↑	-0.4656	±	0.8375
		none	rouge2_max	↑	34.0585	±	0.8974
		none	rougeL_acc	↑	0.4651	±	0.0175
		none	rougeL_diff	↑	-0.2804	±	0.7217
		none	rougeL_max	↑	45.2232	±	0.7971

juvi21
/

Hermes-2.5-Yi-1.5-9B-Chat