Nandine-7b / EVAL.md

Create EVAL.md

22a3f05 verified 12 months ago

4.71 kB

	## Nous Benchmark

	\| Model \|AGIEval\|GPT4All\|TruthfulQA\|Bigbench\|Average\|
	\|---------------------------------------------------------\|------:\|------:\|---------:\|-------:\|------:\|
	\|[Nandine-7b](https://huggingface.co./sethuiyer/Nandine-7b)\| 43.54\| 76.41\| 61.73\| 45.27\| 56.74\|

	### AGIEval
	\| Task \|Version\| Metric \|Value\| \|Stderr\|
	\|------------------------------\|------:\|--------\|----:\|---\|-----:\|
	\|agieval_aqua_rat \| 0\|acc \|23.62\|± \| 2.67\|
	\| \| \|acc_norm\|22.05\|± \| 2.61\|
	\|agieval_logiqa_en \| 0\|acc \|37.94\|± \| 1.90\|
	\| \| \|acc_norm\|38.71\|± \| 1.91\|
	\|agieval_lsat_ar \| 0\|acc \|26.09\|± \| 2.90\|
	\| \| \|acc_norm\|22.61\|± \| 2.76\|
	\|agieval_lsat_lr \| 0\|acc \|47.45\|± \| 2.21\|
	\| \| \|acc_norm\|50.00\|± \| 2.22\|
	\|agieval_lsat_rc \| 0\|acc \|60.97\|± \| 2.98\|
	\| \| \|acc_norm\|59.85\|± \| 2.99\|
	\|agieval_sat_en \| 0\|acc \|77.18\|± \| 2.93\|
	\| \| \|acc_norm\|77.67\|± \| 2.91\|
	\|agieval_sat_en_without_passage\| 0\|acc \|45.63\|± \| 3.48\|
	\| \| \|acc_norm\|45.15\|± \| 3.48\|
	\|agieval_sat_math \| 0\|acc \|35.91\|± \| 3.24\|
	\| \| \|acc_norm\|32.27\|± \| 3.16\|

	Average: 43.54%

	### GPT4All
	\| Task \|Version\| Metric \|Value\| \|Stderr\|
	\|-------------\|------:\|--------\|----:\|---\|-----:\|
	\|arc_challenge\| 0\|acc \|63.74\|± \| 1.40\|
	\| \| \|acc_norm\|63.99\|± \| 1.40\|
	\|arc_easy \| 0\|acc \|85.94\|± \| 0.71\|
	\| \| \|acc_norm\|83.50\|± \| 0.76\|
	\|boolq \| 1\|acc \|87.80\|± \| 0.57\|
	\|hellaswag \| 0\|acc \|67.50\|± \| 0.47\|
	\| \| \|acc_norm\|85.31\|± \| 0.35\|
	\|openbookqa \| 0\|acc \|38.20\|± \| 2.18\|
	\| \| \|acc_norm\|49.40\|± \| 2.24\|
	\|piqa \| 0\|acc \|82.97\|± \| 0.88\|
	\| \| \|acc_norm\|84.33\|± \| 0.85\|
	\|winogrande \| 0\|acc \|80.51\|± \| 1.11\|

	Average: 76.41%

	### TruthfulQA
	\| Task \|Version\|Metric\|Value\| \|Stderr\|
	\|-------------\|------:\|------\|----:\|---\|-----:\|
	\|truthfulqa_mc\| 1\|mc1 \|45.78\|± \| 1.74\|
	\| \| \|mc2 \|61.73\|± \| 1.54\|

	Average: 61.73%

	### Bigbench
	\| Task \|Version\| Metric \|Value\| \|Stderr\|
	\|------------------------------------------------\|------:\|---------------------\|----:\|---\|-----:\|
	\|bigbench_causal_judgement \| 0\|multiple_choice_grade\|57.89\|± \| 3.59\|
	\|bigbench_date_understanding \| 0\|multiple_choice_grade\|65.58\|± \| 2.48\|
	\|bigbench_disambiguation_qa \| 0\|multiple_choice_grade\|38.76\|± \| 3.04\|
	\|bigbench_geometric_shapes \| 0\|multiple_choice_grade\|20.06\|± \| 2.12\|
	\| \| \|exact_str_match \| 5.85\|± \| 1.24\|
	\|bigbench_logical_deduction_five_objects \| 0\|multiple_choice_grade\|30.20\|± \| 2.06\|
	\|bigbench_logical_deduction_seven_objects \| 0\|multiple_choice_grade\|20.71\|± \| 1.53\|
	\|bigbench_logical_deduction_three_objects \| 0\|multiple_choice_grade\|52.67\|± \| 2.89\|
	\|bigbench_movie_recommendation \| 0\|multiple_choice_grade\|43.60\|± \| 2.22\|
	\|bigbench_navigate \| 0\|multiple_choice_grade\|50.50\|± \| 1.58\|
	\|bigbench_reasoning_about_colored_objects \| 0\|multiple_choice_grade\|73.15\|± \| 0.99\|
	\|bigbench_ruin_names \| 0\|multiple_choice_grade\|46.65\|± \| 2.36\|
	\|bigbench_salient_translation_error_detection \| 0\|multiple_choice_grade\|25.25\|± \| 1.38\|
	\|bigbench_snarks \| 0\|multiple_choice_grade\|75.14\|± \| 3.22\|
	\|bigbench_sports_understanding \| 0\|multiple_choice_grade\|73.12\|± \| 1.41\|
	\|bigbench_temporal_sequences \| 0\|multiple_choice_grade\|47.20\|± \| 1.58\|
	\|bigbench_tracking_shuffled_objects_five_objects \| 0\|multiple_choice_grade\|23.04\|± \| 1.19\|
	\|bigbench_tracking_shuffled_objects_seven_objects\| 0\|multiple_choice_grade\|18.69\|± \| 0.93\|
	\|bigbench_tracking_shuffled_objects_three_objects\| 0\|multiple_choice_grade\|52.67\|± \| 2.89\|

	Average: 45.27%

	Average score: 56.74%

	Elapsed time: 01:47:54

	## Nous Benchmark

	\| Model \|AGIEval\|GPT4All\|TruthfulQA\|Bigbench\|Average\|
	\|---------------------------------------------------------\|------:\|------:\|---------:\|-------:\|------:\|
	\|[Nandine-7b](https://huggingface.co./sethuiyer/Nandine-7b)\| 43.54\| 76.41\| 61.73\| 45.27\| 56.74\|

	### AGIEval
	\| Task \|Version\| Metric \|Value\| \|Stderr\|
	\|------------------------------\|------:\|--------\|----:\|---\|-----:\|
	\|agieval_aqua_rat \| 0\|acc \|23.62\|± \| 2.67\|
	\| \| \|acc_norm\|22.05\|± \| 2.61\|
	\|agieval_logiqa_en \| 0\|acc \|37.94\|± \| 1.90\|
	\| \| \|acc_norm\|38.71\|± \| 1.91\|
	\|agieval_lsat_ar \| 0\|acc \|26.09\|± \| 2.90\|
	\| \| \|acc_norm\|22.61\|± \| 2.76\|
	\|agieval_lsat_lr \| 0\|acc \|47.45\|± \| 2.21\|
	\| \| \|acc_norm\|50.00\|± \| 2.22\|
	\|agieval_lsat_rc \| 0\|acc \|60.97\|± \| 2.98\|
	\| \| \|acc_norm\|59.85\|± \| 2.99\|
	\|agieval_sat_en \| 0\|acc \|77.18\|± \| 2.93\|
	\| \| \|acc_norm\|77.67\|± \| 2.91\|
	\|agieval_sat_en_without_passage\| 0\|acc \|45.63\|± \| 3.48\|
	\| \| \|acc_norm\|45.15\|± \| 3.48\|
	\|agieval_sat_math \| 0\|acc \|35.91\|± \| 3.24\|
	\| \| \|acc_norm\|32.27\|± \| 3.16\|

	Average: 43.54%

	### GPT4All
	\| Task \|Version\| Metric \|Value\| \|Stderr\|
	\|-------------\|------:\|--------\|----:\|---\|-----:\|
	\|arc_challenge\| 0\|acc \|63.74\|± \| 1.40\|
	\| \| \|acc_norm\|63.99\|± \| 1.40\|
	\|arc_easy \| 0\|acc \|85.94\|± \| 0.71\|
	\| \| \|acc_norm\|83.50\|± \| 0.76\|
	\|boolq \| 1\|acc \|87.80\|± \| 0.57\|
	\|hellaswag \| 0\|acc \|67.50\|± \| 0.47\|
	\| \| \|acc_norm\|85.31\|± \| 0.35\|
	\|openbookqa \| 0\|acc \|38.20\|± \| 2.18\|
	\| \| \|acc_norm\|49.40\|± \| 2.24\|
	\|piqa \| 0\|acc \|82.97\|± \| 0.88\|
	\| \| \|acc_norm\|84.33\|± \| 0.85\|
	\|winogrande \| 0\|acc \|80.51\|± \| 1.11\|

	Average: 76.41%

	### TruthfulQA
	\| Task \|Version\|Metric\|Value\| \|Stderr\|
	\|-------------\|------:\|------\|----:\|---\|-----:\|
	\|truthfulqa_mc\| 1\|mc1 \|45.78\|± \| 1.74\|
	\| \| \|mc2 \|61.73\|± \| 1.54\|

	Average: 61.73%

	### Bigbench
	\| Task \|Version\| Metric \|Value\| \|Stderr\|
	\|------------------------------------------------\|------:\|---------------------\|----:\|---\|-----:\|
	\|bigbench_causal_judgement \| 0\|multiple_choice_grade\|57.89\|± \| 3.59\|
	\|bigbench_date_understanding \| 0\|multiple_choice_grade\|65.58\|± \| 2.48\|
	\|bigbench_disambiguation_qa \| 0\|multiple_choice_grade\|38.76\|± \| 3.04\|
	\|bigbench_geometric_shapes \| 0\|multiple_choice_grade\|20.06\|± \| 2.12\|
	\| \| \|exact_str_match \| 5.85\|± \| 1.24\|
	\|bigbench_logical_deduction_five_objects \| 0\|multiple_choice_grade\|30.20\|± \| 2.06\|
	\|bigbench_logical_deduction_seven_objects \| 0\|multiple_choice_grade\|20.71\|± \| 1.53\|
	\|bigbench_logical_deduction_three_objects \| 0\|multiple_choice_grade\|52.67\|± \| 2.89\|
	\|bigbench_movie_recommendation \| 0\|multiple_choice_grade\|43.60\|± \| 2.22\|
	\|bigbench_navigate \| 0\|multiple_choice_grade\|50.50\|± \| 1.58\|
	\|bigbench_reasoning_about_colored_objects \| 0\|multiple_choice_grade\|73.15\|± \| 0.99\|
	\|bigbench_ruin_names \| 0\|multiple_choice_grade\|46.65\|± \| 2.36\|
	\|bigbench_salient_translation_error_detection \| 0\|multiple_choice_grade\|25.25\|± \| 1.38\|
	\|bigbench_snarks \| 0\|multiple_choice_grade\|75.14\|± \| 3.22\|
	\|bigbench_sports_understanding \| 0\|multiple_choice_grade\|73.12\|± \| 1.41\|
	\|bigbench_temporal_sequences \| 0\|multiple_choice_grade\|47.20\|± \| 1.58\|
	\|bigbench_tracking_shuffled_objects_five_objects \| 0\|multiple_choice_grade\|23.04\|± \| 1.19\|
	\|bigbench_tracking_shuffled_objects_seven_objects\| 0\|multiple_choice_grade\|18.69\|± \| 0.93\|
	\|bigbench_tracking_shuffled_objects_three_objects\| 0\|multiple_choice_grade\|52.67\|± \| 2.89\|

	Average: 45.27%

	Average score: 56.74%

	Elapsed time: 01:47:54