Adding Evaluation Results (#1)

7797fde verified 8 months ago

7.46 kB

	---
	language:
	- en
	license: apache-2.0
	tags:
	- distilabel
	- dpo
	- rlaif
	- rlhf
	- merge
	- mergekit
	datasets:
	- argilla/distilabel-intel-orca-dpo-pairs
	model-index:
	- name: distilabeled-Marcoro14-7B-slerp-full
	results:
	- task:
	type: text-generation
	name: Text Generation
	dataset:
	name: AI2 Reasoning Challenge (25-Shot)
	type: ai2_arc
	config: ARC-Challenge
	split: test
	args:
	num_few_shot: 25
	metrics:
	- type: acc_norm
	value: 70.65
	name: normalized accuracy
	source:
	url: https://huggingface.co./spaces/HuggingFaceH4/open_llm_leaderboard?query=argilla/distilabeled-Marcoro14-7B-slerp-full
	name: Open LLM Leaderboard
	- task:
	type: text-generation
	name: Text Generation
	dataset:
	name: HellaSwag (10-Shot)
	type: hellaswag
	split: validation
	args:
	num_few_shot: 10
	metrics:
	- type: acc_norm
	value: 87.55
	name: normalized accuracy
	source:
	url: https://huggingface.co./spaces/HuggingFaceH4/open_llm_leaderboard?query=argilla/distilabeled-Marcoro14-7B-slerp-full
	name: Open LLM Leaderboard
	- task:
	type: text-generation
	name: Text Generation
	dataset:
	name: MMLU (5-Shot)
	type: cais/mmlu
	config: all
	split: test
	args:
	num_few_shot: 5
	metrics:
	- type: acc
	value: 65.33
	name: accuracy
	source:
	url: https://huggingface.co./spaces/HuggingFaceH4/open_llm_leaderboard?query=argilla/distilabeled-Marcoro14-7B-slerp-full
	name: Open LLM Leaderboard
	- task:
	type: text-generation
	name: Text Generation
	dataset:
	name: TruthfulQA (0-shot)
	type: truthful_qa
	config: multiple_choice
	split: validation
	args:
	num_few_shot: 0
	metrics:
	- type: mc2
	value: 64.21
	source:
	url: https://huggingface.co./spaces/HuggingFaceH4/open_llm_leaderboard?query=argilla/distilabeled-Marcoro14-7B-slerp-full
	name: Open LLM Leaderboard
	- task:
	type: text-generation
	name: Text Generation
	dataset:
	name: Winogrande (5-shot)
	type: winogrande
	config: winogrande_xl
	split: validation
	args:
	num_few_shot: 5
	metrics:
	- type: acc
	value: 82.0
	name: accuracy
	source:
	url: https://huggingface.co./spaces/HuggingFaceH4/open_llm_leaderboard?query=argilla/distilabeled-Marcoro14-7B-slerp-full
	name: Open LLM Leaderboard
	- task:
	type: text-generation
	name: Text Generation
	dataset:
	name: GSM8k (5-shot)
	type: gsm8k
	config: main
	split: test
	args:
	num_few_shot: 5
	metrics:
	- type: acc
	value: 70.66
	name: accuracy
	source:
	url: https://huggingface.co./spaces/HuggingFaceH4/open_llm_leaderboard?query=argilla/distilabeled-Marcoro14-7B-slerp-full
	name: Open LLM Leaderboard
	---
	# ⚗️ distilabeled Marcoro14 7B Slerp


	<p align="center">
	<a href="https://github.com/argilla-io/distilabel">
	<img src="https://raw.githubusercontent.com/argilla-io/distilabel/main/docs/assets/distilabel-badge-light.png" alt="Built with Distilabel" width="200" height="32"/>
	</a>
	</p>


	## Introduction

	This model is a new DPO fine-tune of our new open dataset [argilla/distilabel-intel-orca-dpo-pairs](https://huggingface.co./datasets/argilla/distilabel-intel-orca-dpo-pairs), on the [mlabonne/Marcoro14-7B-slerp](https://huggingface.co./mlabonne/Marcoro14-7B-slerp) model. You can find more information of the "distilabeled" dataset used at this repo [argilla/distilabeled-Hermes-2.5-Mistral-7B](https://huggingface.co./argilla/distilabeled-Hermes-2.5-Mistral-7B/blob/main/README.md#introduction), and visit [distilabel](https://github.com/argilla-io/distilabel).

	The difference between this model and [argilla/distilabeled-Marcoro14-7B-slerp](https://huggingface.co./argilla/distilabeled-Marcoro14-7B-slerp)
	is that this model has been fine-tuned for a whole epoch instead instead of 200 steps, so it has seen the whole dataset.

	## Training details

	As we did with [Notus](https://argilla.io/blog/notus7b/), we wanted a reproducible recipe to test the impact of data quality.

	And we're lucky to have so many amazing folks in the open community contributing reproducible, easy-to-use training scripts and recipes. This time, [Maxime Labonne](https://twitter.com/maximelabonne) had shared a [Colab](https://colab.research.google.com/drive/15iFBr1xWgztXvhrj5I9fBv20c7CFOPBE?usp=sharing) to fine-tune OpenHermes with DPO and the original Intel's dataset, perfect! We just updated the base model to [mlabonne/Marcoro14-7B-slerp](https://huggingface.co./mlabonne/Marcoro14-7B-slerp), and applied the same dataset recipe we used for [argilla/distilabeled-Hermes-2.5-Mistral-7B](https://huggingface.co./argilla/distilabeled-Hermes-2.5-Mistral-7B/blob/main/README.md#introduction):

	```python
	from datasets import load_dataset

	# Instead of this:
	# dataset = load_dataset("Intel/orca_dpo_pairs", split="train")

	# we did this
	dataset = load_dataset("argilla/distilabel-intel-orca-dpo-pairs", split="train")

	dataset = dataset.filter(
	lambda r:
	r["status"] != "tie" and
	r["chosen_score"] >= 8 and
	not r["in_gsm8k_train"]
	)
	```

	## Benchmark results
	For benchmarking we used the famous "Nous" or "Teknium" benchmark. You can find below an overview, including our first experiment with a less ambitious dataset filtering (removing ties and `score>5`).

	For running the benchmark we used another awesome contribution from Maxime: [LLM AutoEval](https://github.com/mlabonne/llm-autoeval), check it out!

	\| Model \|AGIEval\|GPT4ALL\|TruthfulQA\|Bigbench\|Average\|
	\|-------------------------\|------:\|------:\|---------:\|-------:\|------:\|
	\|[argilla/distilabeled-Marcoro14-7B-slerp-full](https://huggingface.co./argilla/distilabeled-Marcoro14-7B-slerp-full)\| 45.17\| 76.59\| 64.68\| 48.15\| 58.65\|
	\|[argilla/distilabeled-Marcoro14-7B-slerp](https://huggingface.co./argilla/distilabeled-Marcoro14-7B-slerp)\| 45.4\| 76.47\| 65.46\| 47.19\| 58.63\|
	\|[Marcoro14-7B-slerp](https://huggingface.co./mlabonne/Marcoro14-7B-slerp) \| 44.66\| 76.24\| 64.15\| 45.64\| 57.67\|
	\|[argilla/distilabeled-Hermes-2.5-Mistral-7B](https://huggingface.co./argilla/distilabeled-Hermes-2.5-Mistral-7B) \| 44.64 \| 73.35 \| 55.96 \| 42.21 \| 54.04 \|

	### Training Hardware

	We used 1 x A100 80GB in runpod for less than 2 hours.

	## Acknowledgements

	We'd like to thank the amazing open community and in particular:

	* The Intel team for publishing a great open dataset and show how well it worked in the first place
	* Teknium and NousResearch for their awesome work and models.
	* Maxime for sharing such great resources.

	# [Open LLM Leaderboard Evaluation Results](https://huggingface.co./spaces/HuggingFaceH4/open_llm_leaderboard)
	Detailed results can be found [here](https://huggingface.co./datasets/open-llm-leaderboard/details_argilla__distilabeled-Marcoro14-7B-slerp-full)

	\| Metric \|Value\|
	\|---------------------------------\|----:\|
	\|Avg. \|73.40\|
	\|AI2 Reasoning Challenge (25-Shot)\|70.65\|
	\|HellaSwag (10-Shot) \|87.55\|
	\|MMLU (5-Shot) \|65.33\|
	\|TruthfulQA (0-shot) \|64.21\|
	\|Winogrande (5-shot) \|82.00\|
	\|GSM8k (5-shot) \|70.66\|