macadeliccc's picture
Adding Evaluation Results (#1)
6e65d1b verified
metadata
license: cc
library_name: transformers
model-index:
  - name: SOLAR-10.7b-Instruct-truthy-dpo
    results:
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: AI2 Reasoning Challenge (25-Shot)
          type: ai2_arc
          config: ARC-Challenge
          split: test
          args:
            num_few_shot: 25
        metrics:
          - type: acc_norm
            value: 72.1
            name: normalized accuracy
        source:
          url: >-
            https://huggingface.co./spaces/HuggingFaceH4/open_llm_leaderboard?query=macadeliccc/SOLAR-10.7b-Instruct-truthy-dpo
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: HellaSwag (10-Shot)
          type: hellaswag
          split: validation
          args:
            num_few_shot: 10
        metrics:
          - type: acc_norm
            value: 88.44
            name: normalized accuracy
        source:
          url: >-
            https://huggingface.co./spaces/HuggingFaceH4/open_llm_leaderboard?query=macadeliccc/SOLAR-10.7b-Instruct-truthy-dpo
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: MMLU (5-Shot)
          type: cais/mmlu
          config: all
          split: test
          args:
            num_few_shot: 5
        metrics:
          - type: acc
            value: 65.45
            name: accuracy
        source:
          url: >-
            https://huggingface.co./spaces/HuggingFaceH4/open_llm_leaderboard?query=macadeliccc/SOLAR-10.7b-Instruct-truthy-dpo
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: TruthfulQA (0-shot)
          type: truthful_qa
          config: multiple_choice
          split: validation
          args:
            num_few_shot: 0
        metrics:
          - type: mc2
            value: 76.75
        source:
          url: >-
            https://huggingface.co./spaces/HuggingFaceH4/open_llm_leaderboard?query=macadeliccc/SOLAR-10.7b-Instruct-truthy-dpo
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: Winogrande (5-shot)
          type: winogrande
          config: winogrande_xl
          split: validation
          args:
            num_few_shot: 5
        metrics:
          - type: acc
            value: 82.72
            name: accuracy
        source:
          url: >-
            https://huggingface.co./spaces/HuggingFaceH4/open_llm_leaderboard?query=macadeliccc/SOLAR-10.7b-Instruct-truthy-dpo
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: GSM8k (5-shot)
          type: gsm8k
          config: main
          split: test
          args:
            num_few_shot: 5
        metrics:
          - type: acc
            value: 59.21
            name: accuracy
        source:
          url: >-
            https://huggingface.co./spaces/HuggingFaceH4/open_llm_leaderboard?query=macadeliccc/SOLAR-10.7b-Instruct-truthy-dpo
          name: Open LLM Leaderboard

SOLAR-10.7b-Instruct-truthy-dpo

orca-bagel

This model is a finetune of macadeliccc/SOLAR-10.7b-Instruct-truthy-dpo

Process

  1. I finetuned upstageai/Solar-10.7b-Instruct-v0.1 with 1 epoch of Intel/orca_dpo_pairs (12.4k samples)
  2. I futher finetuned that model with 3 epochs of jondurbin/truthy-dpo-v0.1 (1.04k samples)
  3. This process is experimental and the base model linked above is more tested at this time.

GGUF

Available here

Evaluations

----Benchmark Complete---- + 2024-01-26 20:57:38 + Time taken: 25.4 mins + Prompt Format: ChatML + Model: macadeliccc/SOLAR-10.7b-Instruct-truthy-dpo-GGUF + Score (v2): 74.11 + Parseable: 171.0

Batch completed Time taken: 25.5 mins

Model AGIEval GPT4All TruthfulQA Bigbench Average
SOLAR-10.7b-Instruct-truthy-dpo 48.69 73.82 76.81 45.71 61.26

AGIEval

Task Version Metric Value Stderr
agieval_aqua_rat 0 acc 27.95 ± 2.82
acc_norm 27.95 ± 2.82
agieval_logiqa_en 0 acc 42.40 ± 1.94
acc_norm 42.24 ± 1.94
agieval_lsat_ar 0 acc 25.65 ± 2.89
acc_norm 23.91 ± 2.82
agieval_lsat_lr 0 acc 54.12 ± 2.21
acc_norm 54.51 ± 2.21
agieval_lsat_rc 0 acc 69.89 ± 2.80
acc_norm 69.89 ± 2.80
agieval_sat_en 0 acc 80.10 ± 2.79
acc_norm 80.10 ± 2.79
agieval_sat_en_without_passage 0 acc 50.00 ± 3.49
acc_norm 49.51 ± 3.49
agieval_sat_math 0 acc 42.27 ± 3.34
acc_norm 41.36 ± 3.33

Average: 48.69%

GPT4All

Task Version Metric Value Stderr
arc_challenge 0 acc 59.90 ± 1.43
acc_norm 63.91 ± 1.40
arc_easy 0 acc 80.85 ± 0.81
acc_norm 78.16 ± 0.85
boolq 1 acc 88.20 ± 0.56
hellaswag 0 acc 68.34 ± 0.46
acc_norm 86.39 ± 0.34
openbookqa 0 acc 37.60 ± 2.17
acc_norm 46.80 ± 2.23
piqa 0 acc 78.84 ± 0.95
acc_norm 78.78 ± 0.95
winogrande 0 acc 74.51 ± 1.22

Average: 73.82%

TruthfulQA

Task Version Metric Value Stderr
truthfulqa_mc 1 mc1 61.81 ± 1.70
mc2 76.81 ± 1.42

Average: 76.81%

Bigbench

Task Version Metric Value Stderr
bigbench_causal_judgement 0 multiple_choice_grade 50.53 ± 3.64
bigbench_date_understanding 0 multiple_choice_grade 63.14 ± 2.51
bigbench_disambiguation_qa 0 multiple_choice_grade 47.67 ± 3.12
bigbench_geometric_shapes 0 multiple_choice_grade 26.18 ± 2.32
exact_str_match 0.00 ± 0.00
bigbench_logical_deduction_five_objects 0 multiple_choice_grade 28.60 ± 2.02
bigbench_logical_deduction_seven_objects 0 multiple_choice_grade 21.29 ± 1.55
bigbench_logical_deduction_three_objects 0 multiple_choice_grade 47.33 ± 2.89
bigbench_movie_recommendation 0 multiple_choice_grade 39.80 ± 2.19
bigbench_navigate 0 multiple_choice_grade 63.80 ± 1.52
bigbench_reasoning_about_colored_objects 0 multiple_choice_grade 59.05 ± 1.10
bigbench_ruin_names 0 multiple_choice_grade 40.18 ± 2.32
bigbench_salient_translation_error_detection 0 multiple_choice_grade 46.69 ± 1.58
bigbench_snarks 0 multiple_choice_grade 65.19 ± 3.55
bigbench_sports_understanding 0 multiple_choice_grade 72.41 ± 1.42
bigbench_temporal_sequences 0 multiple_choice_grade 60.30 ± 1.55
bigbench_tracking_shuffled_objects_five_objects 0 multiple_choice_grade 25.76 ± 1.24
bigbench_tracking_shuffled_objects_seven_objects 0 multiple_choice_grade 17.43 ± 0.91
bigbench_tracking_shuffled_objects_three_objects 0 multiple_choice_grade 47.33 ± 2.89

Average: 45.71%

Average score: 61.26%

Elapsed time: 02:16:03

Open LLM Leaderboard Evaluation Results

Detailed results can be found here

Metric Value
Avg. 74.11
AI2 Reasoning Challenge (25-Shot) 72.10
HellaSwag (10-Shot) 88.44
MMLU (5-Shot) 65.45
TruthfulQA (0-shot) 76.75
Winogrande (5-shot) 82.72
GSM8k (5-shot) 59.21