leaderboard-pr-bot's picture
Adding Evaluation Results
0e9d339 verified
|
raw
history blame
7.85 kB
metadata
language:
  - en
license: apache-2.0
tags:
  - text-generation-inference
  - transformers
  - unsloth
  - llama
  - trl
  - sft
base_model: unsloth/llama-3.2-3b-instruct-bnb-4bit
datasets:
  - Lyte/Reasoning-Paused
pipeline_tag: text-generation
model-index:
  - name: Llama-3.2-3B-Overthinker
    results:
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: IFEval (0-Shot)
          type: HuggingFaceH4/ifeval
          args:
            num_few_shot: 0
        metrics:
          - type: inst_level_strict_acc and prompt_level_strict_acc
            value: 64.08
            name: strict accuracy
        source:
          url: >-
            https://huggingface.co./spaces/open-llm-leaderboard/open_llm_leaderboard?query=Lyte/Llama-3.2-3B-Overthinker
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: BBH (3-Shot)
          type: BBH
          args:
            num_few_shot: 3
        metrics:
          - type: acc_norm
            value: 20.1
            name: normalized accuracy
        source:
          url: >-
            https://huggingface.co./spaces/open-llm-leaderboard/open_llm_leaderboard?query=Lyte/Llama-3.2-3B-Overthinker
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: MATH Lvl 5 (4-Shot)
          type: hendrycks/competition_math
          args:
            num_few_shot: 4
        metrics:
          - type: exact_match
            value: 2.64
            name: exact match
        source:
          url: >-
            https://huggingface.co./spaces/open-llm-leaderboard/open_llm_leaderboard?query=Lyte/Llama-3.2-3B-Overthinker
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: GPQA (0-shot)
          type: Idavidrein/gpqa
          args:
            num_few_shot: 0
        metrics:
          - type: acc_norm
            value: 1.23
            name: acc_norm
        source:
          url: >-
            https://huggingface.co./spaces/open-llm-leaderboard/open_llm_leaderboard?query=Lyte/Llama-3.2-3B-Overthinker
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: MuSR (0-shot)
          type: TAUR-Lab/MuSR
          args:
            num_few_shot: 0
        metrics:
          - type: acc_norm
            value: 3.9
            name: acc_norm
        source:
          url: >-
            https://huggingface.co./spaces/open-llm-leaderboard/open_llm_leaderboard?query=Lyte/Llama-3.2-3B-Overthinker
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: MMLU-PRO (5-shot)
          type: TIGER-Lab/MMLU-Pro
          config: main
          split: test
          args:
            num_few_shot: 5
        metrics:
          - type: acc
            value: 22.06
            name: accuracy
        source:
          url: >-
            https://huggingface.co./spaces/open-llm-leaderboard/open_llm_leaderboard?query=Lyte/Llama-3.2-3B-Overthinker
          name: Open LLM Leaderboard

Model Overview:

  • Training Data: This model was trained on a dataset with columns for initial reasoning, step-by-step thinking, verifications after each step, and final answers based on full context. Is it better than the original base model? Hard to say without proper evaluations, and I don’t have the resources to run them manually.

  • Context Handling: The model benefits from larger contexts (minimum 4k up to 16k tokens, though it was trained on 32k tokens). It tends to "overthink," so providing a longer context helps it perform better.

  • Performance: Based on my very few manual tests, the model seems to excel in conversational settings—especially for mental health, creative tasks and explaining stuff. However, I encourage you to try it out yourself using this Colab Notebook.

  • Dataset Note: The publicly available dataset is only a partial version. The full dataset was originally designed for a custom Mixture of Experts (MoE) architecture, but I couldn't afford to run the full experiment.

  • Acknowledgment: Special thanks to KingNish for reigniting my passion to revisit this project. I almost abandoned it after my first attempt a month ago. Enjoy this experimental model!

Inference Code:

  • Feel free to make the steps and verifications collapsable and the initial reasoning too, you can show only the final answer to get an o1 feel(i don't know)
  • Note: A feature we have here is the ability to control how many steps and verifications you want.
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "Lyte/Llama-3.2-3B-Overthinker"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype="auto", device_map="auto")

def generate_response(prompt, max_tokens=16384, temperature=0.8, top_p=0.95, repeat_penalty=1.1, num_steps=3):
    messages = [{"role": "user", "content": prompt}]

    # Generate reasoning
    reasoning_template = tokenizer.apply_chat_template(messages, tokenize=False, add_reasoning_prompt=True)
    reasoning_inputs = tokenizer(reasoning_template, return_tensors="pt").to(model.device)

    reasoning_ids = model.generate(
        **reasoning_inputs,
        max_new_tokens=max_tokens // 3,
        temperature=temperature,
        top_p=top_p,
        repetition_penalty=repeat_penalty
    )
    reasoning_output = tokenizer.decode(reasoning_ids[0, reasoning_inputs.input_ids.shape[1]:], skip_special_tokens=True)
    
    # Generate thinking (step-by-step and verifications)
    messages.append({"role": "reasoning", "content": reasoning_output})
    thinking_template = tokenizer.apply_chat_template(messages, tokenize=False, add_thinking_prompt=True, num_steps=num_steps)
    thinking_inputs = tokenizer(thinking_template, return_tensors="pt").to(model.device)

    thinking_ids = model.generate(
        **thinking_inputs,
        max_new_tokens=max_tokens // 3,
        temperature=temperature,
        top_p=top_p,
        repetition_penalty=repeat_penalty
    )
    thinking_output = tokenizer.decode(thinking_ids[0, thinking_inputs.input_ids.shape[1]:], skip_special_tokens=True)
    
    # Generate final answer
    messages.append({"role": "thinking", "content": thinking_output})
    answer_template = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
    answer_inputs = tokenizer(answer_template, return_tensors="pt").to(model.device)

    answer_ids = model.generate(
        **answer_inputs,
        max_new_tokens=max_tokens // 3,
        temperature=temperature,
        top_p=top_p,
        repetition_penalty=repeat_penalty
    )
    answer_output = tokenizer.decode(answer_ids[0, answer_inputs.input_ids.shape[1]:], skip_special_tokens=True)
    return reasoning_output, thinking_output, answer_output

# Example usage:
prompt = "Explain the process of photosynthesis."
response = generate_response(prompt, num_steps=5)

print("Response:", response)

Uploaded model

  • Developed by: Lyte
  • License: apache-2.0
  • Finetuned from model : unsloth/llama-3.2-3b-instruct-bnb-4bit

This llama model was trained 2x faster with Unsloth and Huggingface's TRL library.

Open LLM Leaderboard Evaluation Results

Detailed results can be found here

Metric Value
Avg. 19.00
IFEval (0-Shot) 64.08
BBH (3-Shot) 20.10
MATH Lvl 5 (4-Shot) 2.64
GPQA (0-shot) 1.23
MuSR (0-shot) 3.90
MMLU-PRO (5-shot) 22.06