leaderboard-pr-bot's picture
Adding Evaluation Results
ca9ea73
|
raw
history blame
1.59 kB

CAMEL-33B-Combined-Data is a chat large language model obtained by finetuning LLaMA-33B model on a total of 229K conversations collected through our CAMEL framework, 100K English public conversations from ShareGPT that can be found here, and 52K instructions from Alpaca dataset that can be found here. We evaluate our model offline using EleutherAI's language model evaluation harness used by Huggingface's Open LLM Benchmark. CAMEL-33B scores an average of 64.2.

Regarding the prompt format, we follow the same prompt as LMSYS's FastChat Vicuna-13B-1.1 conversation template. It assumes a conversation between a user and AI assistant seperated by a </s> at the end of every role message. More details can be found here.


license: cc-by-nc-4.0

Open LLM Leaderboard Evaluation Results

Detailed results can be found here

Metric Value
Avg. 50.79
ARC (25-shot) 62.97
HellaSwag (10-shot) 83.83
MMLU (5-shot) 58.98
TruthfulQA (0-shot) 50.21
Winogrande (5-shot) 78.3
GSM8K (5-shot) 14.1
DROP (3-shot) 7.12