Spaces:

inflaton-ai
/

logical-reasoning

Build error

App Files Files Community

dh-mc commited on Sep 14, 2024

Commit

444a581

1 Parent(s): 4cd13da

https://github.com/mazzzystar/TurtleBenchmark

Browse files

Files changed (23) hide show

.gitattributes +2 -0
datasets/TurtleBenchmark/.gitignore +3 -0
datasets/TurtleBenchmark/README.md +88 -0
datasets/TurtleBenchmark/README_en.md +78 -0
datasets/TurtleBenchmark/evaluation/.env.example +13 -0
datasets/TurtleBenchmark/evaluation/chinese/data/cases.list +3 -0
datasets/TurtleBenchmark/evaluation/chinese/data/results_0shot.json +3 -0
datasets/TurtleBenchmark/evaluation/chinese/data/results_2shot.json +3 -0
datasets/TurtleBenchmark/evaluation/chinese/data/sorted_cases.list +3 -0
datasets/TurtleBenchmark/evaluation/chinese/data/sorted_cases.txt +3 -0
datasets/TurtleBenchmark/evaluation/chinese/data/stories.json +3 -0
datasets/TurtleBenchmark/evaluation/chinese/data/titles.txt +3 -0
datasets/TurtleBenchmark/evaluation/chinese/evaluate.py +377 -0
datasets/TurtleBenchmark/evaluation/chinese/imgs/Turtle-Benchmark-over-32stories.png +0 -0
datasets/TurtleBenchmark/evaluation/chinese/imgs/Turtle-Benchmark-result.png +0 -0
datasets/TurtleBenchmark/evaluation/chinese/imgs/average_model_accuracy_over_stories_2-shot.png +0 -0
datasets/TurtleBenchmark/evaluation/chinese/model_configs.py +144 -0
datasets/TurtleBenchmark/evaluation/chinese/prompt.py +77 -0
datasets/TurtleBenchmark/evaluation/english/data/cases.list +3 -0
datasets/TurtleBenchmark/evaluation/english/data/stories.json +3 -0
datasets/TurtleBenchmark/evaluation/english/evaluate.py +620 -0
datasets/TurtleBenchmark/evaluation/english/prompt.py +74 -0
datasets/TurtleBenchmark/requirements.txt +3 -0

.gitattributes CHANGED Viewed

@@ -34,6 +34,8 @@ unsloth/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *.jsonl filter=lfs diff=lfs merge=lfs -text
 *.txt filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
 datasets/mgtv/ filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *.jsonl filter=lfs diff=lfs merge=lfs -text
+*.json filter=lfs diff=lfs merge=lfs -text
+*.list filter=lfs diff=lfs merge=lfs -text
 *.txt filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
 datasets/mgtv/ filter=lfs diff=lfs merge=lfs -text

datasets/TurtleBenchmark/.gitignore ADDED Viewed

	@@ -0,0 +1,3 @@

+.env
+venv
+logs_with_*

datasets/TurtleBenchmark/README.md ADDED Viewed

	@@ -0,0 +1,88 @@

+# 海龟 Benchmark
+[English](./README_en.md)
+海龟 Benchmark 是一个新颖的、无法作弊的基准测试，专注于评估 LLM 的逻辑推理和上下文理解能力。评测数据集全部来自几千名真实用户在[海龟汤](https://www.tanghenre.com)游戏中的输入数据。
+初步测评结果见：[用 2 万条真人AI海龟汤数据评估大模型推理能力](https://mazzzystar.github.io/2024/08/09/turtle-benchmark-zh/)
+### 特点
+- **无需背景知识**：不依赖背景知识和模型记忆能力，可以从 200 字以内的故事中获得作出判断所需的全部信息，让模型评测专注于推理能力。
+- **客观且无偏见**：衡量猜测的正确性，结果是客观的、和人的感受无关。
+- **可量化的结果**：明确、可测量的结果（正确/错误/未知），便于比较。
+- **无法作弊**：使用真实用户提出的问题，并且随着线上游戏的进行，新数据会动态产生，使得作弊变得不可能。
+### 数据集
+- 32 个独特的"海龟汤"故事。
+- 1537 个来自用户问题的人工标注标签: 对(T)、错(F)、不相关(N)
+- 我们的评估日志。
+> [!IMPORTANT]
+> 我们在标注时发现：存在部分样本既可以标注为错(F)，也可以标注为不相关(N)，因此我们将**错**(F)和**不相关**(N)进行了合并、不做区分，在计算准确率时也在代码中做了合并——这会降低难度，因此我们可能在未来，将类别标注重新变为三类：对、错、不相关，并在 [4448](https://github.com/mazzzystar/TurtleBenchmark/blob/dev/evaluation/chinese/data/sorted_cases.list) 条样本上重新标注、测试模型表现。
+### 使用方法
+```bash
+cd evaluation
+mv .env.example .env
+# 添加API密钥。
+# Evaluate Chinese or English.
+cd chinese
+# zero-shot，评估更快更省钱，默认 2-shot
+python evaluate.py --shot 0
+```
+### 结果
+#### 1. 总体准确率
+每个模型在所有测试案例中的总体准确率。
+![总体基准测试结果](/evaluation/chinese/imgs/Turtle-Benchmark-result.png)
+#### 2. 故事的平均准确率
+为了减轻模型在某个具有大量测试样本的故事上表现不佳带来的偏差，我们分别计算了每个模型在所有 32 个故事中的准确率，并除以 32。
+![32个故事的结果](/evaluation/chinese/imgs/Turtle-Benchmark-over-32stories.png)
+#### 3. 性能图表
+这个散点图比较了 2-shot 学习场景中每个模型的总体准确率（x 轴）和平均故事准确率（y 轴）。
+![2-Shot学习性能](/evaluation/chinese/imgs/average_model_accuracy_over_stories_2-shot.png)
+### 评测
+根据这些结果，我们可以清楚地看到各种模型之间的性能差异：
+1. **第一梯队**：Claude 3.5 Sonnet 作为无可争议的领导者脱颖而出，明显优于所有其他模型。
+2. **第二梯队**：GPT-4o、Qwen-2（通义千问）、Moonshot AI（月之暗面）、LLama3.1 405B 和 Minimax 构成第二梯队。虽然我们避免了进一步的细分，但在这个组内，按照所列顺序，性能明显下降。
+3. **第三梯队**：豆包（Douban）、DeepSeek 和 LLama3.1 70B 构成第三梯队。
+4. **第四梯队**：GPT-4o-mini 独自位于第四梯队。
+5. **过时**：GPT-3.5 的性能表明它在这个背景下不再具有竞争力。
+需要注意的是，这项评估只针对**模型的中文语言理解和推理能力**。未来，视资源和资金情况，我们计划将所有故事和测试问题翻译成英文，并使用英文提示重新运行测试。这将有助于消除可能归因于语言差异的任何性能差异。
+## TODO
+- [ ] 将标注类别重新改为三类：T/F/N，并在[4448](https://github.com/mazzzystar/TurtleBenchmark/blob/dev/evaluation/chinese/data/sorted_cases.list) 条样本上重新标注、测试模型表现。
+- [x] 将数据集和测试样例翻译成英文。
+- [ ] 人工逐条二次确认翻译后的标注准确率。
+- [ ] 使用英语 prompt，在英文模型上评测并给出结果。
+### 致谢
+衷心感谢：
+- 五源资本 的**石允丰**（Steven Shi）为这项研究所需的 token 提供慷慨的财务支持。
+- 实习生**赵乾之**（Jerry Zhao）和我一起手工标注了 26,000 条数据。

datasets/TurtleBenchmark/README_en.md ADDED Viewed

	@@ -0,0 +1,78 @@

+# Turtle Benchmark
+[中文](./README.md)
+Turtle Benchmark is a novel, uncheatable benchmark for evaluating Large Language Models (LLMs) based on the "Turtle Soup"(海龟汤) game, focusing on logical reasoning and contextual understanding.
+### Highlights
+- **Objective and Unbiased**: Eliminates the need for background knowledge, focusing purely on reasoning abilities.
+- **Quantifiable Results**: Clear, measurable outcomes (correct/incorrect/unknown) for easy comparison.
+- **Constantly Evolving**: Uses real user-generated questions, making it impossible to "game" the system.
+- **Language Understanding**: Tests the model's ability to comprehend context and make logical inferences.
+### Usage
+```bash
+cd evaluation
+mv .env.example .env
+# add API key.
+# Evaluate Chinese or English.
+cd english
+# 0-shot for fast & cheap, default: 2-shot.
+python evaluate.py --shot 0
+```
+### Data
+- 32 unique "Turtle Soup" stories.
+- 1537 human-annotated labels from users' questions.
+- Our evaluation log.
+### Results
+#### 1. Overall Accuracy
+The overall accuracy of each model across all test cases.
+![Overall Benchmark Results](/evaluation/chinese/imgs/Turtle-Benchmark-result.png)
+#### 2. Average Accuracy Across Stories
+To mitigate potential bias from models performing poorly on specific stories with a large number of test samples, we calculated the average accuracy for each model across all 32 stories individually.
+![Results Across 32 Stories](/evaluation/chinese/imgs/Turtle-Benchmark-over-32stories.png)
+#### 3. Performance Chart
+This scatter plot compares the overall accuracy (x-axis) with the average story accuracy (y-axis) for each model in the 2-shot learning scenario.
+![2-Shot Learning Performance](/evaluation/chinese/imgs/average_model_accuracy_over_stories_2-shot.png)
+### Interpretation
+Based on these results, we can clearly see the performance differences among the various models:
+1. **First Tier**: Claude 3.5 Sonnet stands out as the undisputed leader, significantly outperforming all other models.
+2. **Second Tier**: GPT-4o, Qwen-2(通义千问), Moonshot AI(月之暗面), LLama3.1 405B, and Minimax form the second tier. While we've avoided further subdivisions, there's a noticeable decrease in performance within this group, following the order listed.
+3. **Third Tier**: Douban(豆包), DeepSeek, and LLama3.1 70B constitute the third tier.
+4. **Fourth Tier**: GPT-4o-mini stands alone in the fourth tier.
+5. **Obsolete**: GPT-3.5's performance suggests it's no longer competitive in this context.
+It's important to note that this evaluation specifically targets the models' Chinese language understanding and reasoning capabilities. In the future, pending resources and funding, we plan to translate all stories and test questions into English and re-run the tests using English prompts. This will help eliminate any performance discrepancies that may be attributed to language differences.
+### Acknowledgments
+We would like to express our gratitude to:
+- **Steven Shi (石允丰)** from 5Y Capital for his generous financial support of the token usage required for this research.
+- **Jerry Zhao (赵乾之)** for his invaluable assistance in annotating over 26,000 data points.
+Your contributions have been instrumental in making this benchmark possible.

datasets/TurtleBenchmark/evaluation/.env.example ADDED Viewed

	@@ -0,0 +1,13 @@

+# This file contains the API keys for the various services used
+# in the evaluation, you can delete the keys that are not needed.
+OPENAI_API_KEY = "sk-"
+DEEPSEEK_API_KEY = "sk-"
+MOONSHOT_API_KEY = "sk-"
+ANTHROPIC_API_KEY = "sk-"
+ZHIPU_API_KEY = ""
+LEPTON_API_KEY = ""
+TOGETHER_API_KEY = ""
+DOUBAO_API_KEY = ""
+ANTHROPIC_API_KEY = ""
+MINIMAX_GROUP_ID = ""
+MINIMAX_API_KEY = ""

datasets/TurtleBenchmark/evaluation/chinese/data/cases.list ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:d1e4f7f2e6b9eea02fae049cad459176239e79e583fbad2ba405a8b88a5a1d18
+size 67259

datasets/TurtleBenchmark/evaluation/chinese/data/results_0shot.json ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:42f0990abd76801f5c918e9e5771a981f5a9f60fed5d94cd09ce6775c480be30
+size 666692

datasets/TurtleBenchmark/evaluation/chinese/data/results_2shot.json ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:c3d11da6b7f8c9332dba33d7936c29b8d8441de2daca71f1444ed4d33dc90f84
+size 666710

datasets/TurtleBenchmark/evaluation/chinese/data/sorted_cases.list ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:9d3ca17f0975ab55997c61969223547dac0da73e968c1d1a5b778cb1368e810d
+size 183764

datasets/TurtleBenchmark/evaluation/chinese/data/sorted_cases.txt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:c0c24879b4a67912493c8bd3d804fc3aac72213e5d84d9ebe7d806bae89de65e
+size 67262

datasets/TurtleBenchmark/evaluation/chinese/data/stories.json ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:b895306b193d5b1c1d4980239c0094a48b48b39d6a98f79508248c41a9855403
+size 20329

datasets/TurtleBenchmark/evaluation/chinese/data/titles.txt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:c85c3e0ab6f9f8ce57aaa65991f8a38d996efbf5643283f997e2baf1185ff672
+size 339

datasets/TurtleBenchmark/evaluation/chinese/evaluate.py ADDED Viewed

	@@ -0,0 +1,377 @@

+import argparse
+import concurrent.futures
+import json
+import os
+import random
+from functools import partial
+import requests
+from anthropic import Anthropic
+from openai import OpenAI
+from together import Together
+from tqdm import tqdm
+from model_configs import models
+from prompt import simple_system_prompt, system_prompt_with_2shots
+# Load stories
+with open("data/stories.json", "r", encoding="utf-8") as f:
+    stories = json.load(f)
+def load_test_cases(filename):
+    with open(filename, "r", encoding="utf-8") as f:
+        _test_cases = []
+        for line in f:
+            parts = line.strip().replace(" ", "").split("\t")
+            if len(parts) != 3:
+                print(f"Invalid test case: {line}")
+                continue
+            if parts[2] not in ["T", "F", "N"]:
+                print(f"Skipping line with invalid ground truth: {line}")
+                continue
+            _test_cases.append(parts)
+        return _test_cases
+def starts_with_answer(response, answer):
+    return response.strip().lower().startswith(answer)
+def call_api(model, prompt, user_input):
+    try:
+        if model["type"] == "openai":
+            if model["name"] == "Doubao-4k":
+                client = OpenAI(
+                    api_key=model["config"]["apiKey"],
+                    base_url=model["config"]["baseURL"]
+                )
+                messages = [
+                    {"role": "system", "content": prompt},
+                    {"role": "user", "content": user_input}
+                ]
+                response = client.chat.completions.create(
+                    model=model["config"]["model"],
+                    messages=messages,
+                    max_tokens=model["config"]["maxTokens"],
+                    temperature=model["config"]["temperature"],
+                    top_p=model["config"]["top_p"],
+                    stream=False
+                )
+                return response.choices[0].message.content
+            else:
+                url = model["config"]["baseURL"] + "/chat/completions"
+                headers = {
+                    "Content-Type": "application/json",
+                    "Authorization": f"Bearer {model['config']['apiKey']}"
+                }
+                data = {
+                    "model": model["config"]["model"],
+                    "messages": [
+                        {"role": "system", "content": prompt},
+                        {"role": "user", "content": user_input}
+                    ],
+                    "max_tokens": model["config"]["maxTokens"],
+                    "temperature": model["config"]["temperature"],
+                }
+                if "top_p" in model["config"]:
+                    data["top_p"] = model["config"]["top_p"]
+                response = requests.post(url, headers=headers, json=data)
+                if response.status_code != 200:
+                    raise Exception(f"API call failed with status {response.status_code}: {response.text}")
+                result = response.json()
+                return result["choices"][0]["message"]["content"]
+        elif model["type"] == "together":
+            client = Together(api_key=model["config"]["apiKey"])
+            messages = [
+                {"role": "system", "content": prompt},
+                {"role": "user", "content": user_input}
+            ]
+            response = client.chat.completions.create(
+                model=model["config"]["model"],
+                messages=messages,
+                max_tokens=model["config"]["maxTokens"],
+                temperature=model["config"]["temperature"],
+                top_p=model["config"]["top_p"],
+                repetition_penalty=model["config"]["repetition_penalty"],
+                stop=model["config"]["stop"],
+                stream=False
+            )
+            return response.choices[0].message.content
+        elif model["type"] == "anthropic":
+            client = Anthropic(api_key=model["config"]["apiKey"])
+            message = client.messages.create(
+                model=model["config"]["model"],
+                max_tokens=model["config"]["maxTokens"],
+                temperature=model["config"]["temperature"],
+                system=prompt,
+                messages=[
+                    {
+                        "role": "user",
+                        "content": [
+                            {
+                                "type": "text",
+                                "text": user_input
+                            }
+                        ]
+                    }
+                ]
+            )
+            return message.content[0].text
+        elif model["type"] == "minimax":
+            url = f"https://api.minimax.chat/v1/text/chatcompletion_v2?GroupId={model['config']['groupId']}"
+            headers = {
+                "Authorization": f"Bearer {model['config']['apiKey']}",
+                "Content-Type": "application/json"
+            }
+            payload = {
+                "model": model["config"]["model"],
+                "messages": [
+                    {
+                        "role": "system",
+                        "name": "MM智能助理",
+                        "content": prompt
+                    },
+                    {
+                        "role": "user",
+                        "content": user_input
+                    }
+                ],
+                "tools": [],
+                "tool_choice": "none",
+                "stream": False,
+                "max_tokens": model["config"]["maxTokens"],
+                "temperature": model["config"]["temperature"],
+                "top_p": model["config"]["top_p"]
+            }
+            response = requests.post(url, headers=headers, json=payload)
+            if response.status_code != 200:
+                raise Exception(f"API call failed with status {response.status_code}: {response.text}")
+            result = response.json()
+            return result["choices"][0]["message"]["content"]
+        else:
+            raise ValueError(f"Unsupported model type: {model['type']}")
+    except Exception as e:
+        print(f"Error in call_api for model {model['name']}: {str(e)}")
+        return None
+def call_api_with_timeout(model, prompt, user_input, timeout=20):
+    try:
+        return call_api(model, prompt, user_input)
+    except Exception as e:
+        print(f"Error in call_api for model {model['name']}: {str(e)}")
+        return None
+def evaluate_models(models, test_cases, stories, shot_type):
+    results = {model['name']: {'correct': 0, 'total': 0} for model in models}
+    logs = {model['name']: [] for model in models}
+    challenging_cases = []
+    all_cases = []
+    # Determine the appropriate log folder based on shot_type
+    log_folder = f"logs_with_{shot_type}shots"
+    os.makedirs(log_folder, exist_ok=True)
+    # Find the last tested sample
+    last_tested = 0
+    for i in range(len(test_cases), 0, -1):
+        if os.path.exists(f"{log_folder}/all_cases_simple_prompt_{i}.json"):
+            with open(f"{log_folder}/all_cases_simple_prompt_{i}.json", "r", encoding="utf-8") as f:
+                all_cases = json.load(f)
+            last_tested = i
+            break
+    # Update results with previously tested samples
+    for case in all_cases:
+        for model_name, result in case['results'].items():
+            if result is not None:
+                results[model_name]['total'] += 1
+                if (case['ground_truth'] == "T" and result == "T") or \
+                   ((case['ground_truth'] == "F" or case['ground_truth'] == "N") and result != "T"):
+                    results[model_name]['correct'] += 1
+    # Start from the next untested sample
+    start_index = len(all_cases)
+    for i, (user_input, story_title, ground_truth) in enumerate(tqdm(test_cases[start_index:]), start_index + 1):
+        try:
+            story = next((s for s in stories if s["title"] == story_title), None)
+            if not story:
+                print(f"Story not found: {story_title}")
+                continue
+            # Use the appropriate prompt based on shot_type
+            if shot_type == "2":
+                prompt_template = system_prompt_with_2shots
+            else:
+                prompt_template = simple_system_prompt
+            prompt = prompt_template.replace("{surface}", story["surface"]).replace("{bottom}", story["bottom"])
+            gt_map = {"T": "对", "F": "错", "N": "不知道"}
+            case_results = {}
+            all_responses_valid = True
+            # Use ThreadPoolExecutor for concurrent API calls
+            with concurrent.futures.ThreadPoolExecutor(max_workers=len(models)) as executor:
+                future_to_model = {executor.submit(partial(call_api_with_timeout, timeout=20), model, prompt, user_input): model for model in models}
+                for future in concurrent.futures.as_completed(future_to_model):
+                    model = future_to_model[future]
+                    try:
+                        response = future.result()
+                        if response is None:
+                            all_responses_valid = False
+                            print(f"Timeout or error for model {model['name']}")
+                        else:
+                            case_results[model['name']] = response
+                    except Exception as exc:
+                        print(f'{model["name"]} generated an exception: {exc}')
+                        all_responses_valid = False
+            # If any model timed out or had an error, skip this entire test case
+            if not all_responses_valid:
+                print(f"Skipping test case {i} due to timeout or error")
+                continue
+            # Process all responses
+            for model in models:
+                if model['name'] not in case_results:
+                    continue
+                response = case_results[model['name']].strip().lower()
+                if starts_with_answer(response, "对") or starts_with_answer(response, "错") or starts_with_answer(response, "不知道"):
+                    results[model['name']]['total'] += 1
+                    # Save the actual model output
+                    if starts_with_answer(response, "对"):
+                        case_results[model['name']] = "T"
+                    elif starts_with_answer(response, "错"):
+                        case_results[model['name']] = "F"
+                    else:
+                        case_results[model['name']] = "N"
+                    # Calculate accuracy (merging N and F)
+                    if (ground_truth == "T" and case_results[model['name']] == "T") or \
+                       ((ground_truth == "F" or ground_truth == "N") and case_results[model['name']] != "T"):
+                        results[model['name']]['correct'] += 1
+                    else:
+                        # Print only wrong answers
+                        print(f"Wrong Answer - Model: {model['name']}, Input: {user_input}, Response: {response}, GT: {gt_map[ground_truth]}, Model Output: {case_results[model['name']]}")
+                else:
+                    # Handle invalid responses
+                    case_results[model['name']] = "Invalid"
+                    print(f"Invalid Response - Model: {model['name']}, Input: {user_input}, Response: {response}, GT: {gt_map[ground_truth]}, Model Output: {case_results[model['name']]}")
+                log_entry = {
+                    "Input": user_input,
+                    "Response": response,
+                    "GT": gt_map[ground_truth],
+                    "Model_Output": case_results[model['name']],
+                    "Accuracy": f"{results[model['name']]['correct']}/{results[model['name']]['total']} ({results[model['name']]['correct']/max(results[model['name']]['total'], 1):.2f})"
+                }
+                logs[model['name']].append(log_entry)
+            case = {
+                "input": user_input,
+                "story_title": story_title,
+                "ground_truth": ground_truth,
+                "results": case_results
+            }
+            all_cases.append(case)
+            if any(result != "T" for result in case_results.values()):
+                challenging_cases.append(case)
+            # Save log and print accuracy ranking every 10 steps
+            if i % 10 == 0 or i == len(test_cases):
+                print(f"\nCurrent rankings after {i} items:")
+                current_results = [(name, res['correct'] / max(res['total'], 1), res['correct'], res['total'])
+                                for name, res in results.items()]
+                current_results.sort(key=lambda x: x[1], reverse=True)
+                for rank, (name, accuracy, correct, total) in enumerate(current_results, 1):
+                    print(f"{rank}. {name}: {accuracy:.2f} ({correct}/{total})")
+                # Update challenging cases file
+                with open(f"{log_folder}/challenging_cases_simple_prompt_{i}.json", "w", encoding="utf-8") as f:
+                    json.dump(challenging_cases, f, ensure_ascii=False, indent=2)
+                # Update all cases file
+                with open(f"{log_folder}/all_cases_simple_prompt_{i}.json", "w", encoding="utf-8") as f:
+                    json.dump(all_cases, f, ensure_ascii=False, indent=2)
+        except Exception as e:
+            print(f"Error processing test case {i}: {str(e)}")
+            continue
+    # Final update to challenging cases file
+    final_index = start_index + len(test_cases[start_index:])
+    with open(f"{log_folder}/challenging_cases_simple_prompt_{final_index}.json", "w", encoding="utf-8") as f:
+        json.dump(challenging_cases, f, ensure_ascii=False, indent=2)
+    # Final update to all cases file
+    with open(f"{log_folder}/all_cases_simple_prompt_{final_index}.json", "w", encoding="utf-8") as f:
+        json.dump(all_cases, f, ensure_ascii=False, indent=2)
+    return results, challenging_cases, all_cases
+def save_all_cases(all_cases, output_file):
+    with open(output_file, "w", encoding="utf-8") as f:
+        json.dump(all_cases, f, ensure_ascii=False, indent=2)
+    print(f"All cases have been saved to {output_file}")
+def parse_challenging_cases(input_file, output_file):
+    with open(input_file, 'r', encoding='utf-8') as f:
+        challenging_cases = json.load(f)
+    with open(output_file, 'w', encoding='utf-8') as f:
+        for case in challenging_cases:
+            user_input = case['input']
+            story_title = case['story_title']
+            ground_truth = case['ground_truth']
+            f.write(f"{user_input}\t{story_title}\t{ground_truth}\n")
+    print(f"Parsed challenging cases have been written to {output_file}")
+def main():
+    # Parse command line arguments
+    parser = argparse.ArgumentParser(description="Run story understanding evaluation")
+    parser.add_argument("--shot", choices=["0", "2"], default="2", help="Number of shots (0 or 2)")
+    args = parser.parse_args()
+    _models = [model for model in models if model['name'] in ['DEEPSEEK', 'Kimi-Chat', 'GPT-4o-mini']]
+    test_cases = load_test_cases("data/cases.list")
+    _test_cases = random.sample(test_cases, k=100)
+    results, challenging_cases, all_cases = evaluate_models(_models, _test_cases, stories, args.shot)
+    final_results = [(name, res['correct'] / max(res['total'], 1), res['correct'], res['total'])
+                     for name, res in results.items()]
+    final_results.sort(key=lambda x: x[1], reverse=True)
+    print(f"\nFinal Rankings ({args.shot}-shot):")
+    for rank, (name, accuracy, correct, total) in enumerate(final_results, 1):
+        print(f"{rank}. {name}: {accuracy:.2f} ({correct}/{total})")
+    log_folder = f"logs_with_{args.shot}shots"
+    print(f"Evaluation complete. Logs have been saved in the '{log_folder}' directory.")
+if __name__ == "__main__":
+    main()

datasets/TurtleBenchmark/evaluation/chinese/imgs/Turtle-Benchmark-over-32stories.png ADDED Viewed

datasets/TurtleBenchmark/evaluation/chinese/imgs/Turtle-Benchmark-result.png ADDED Viewed

datasets/TurtleBenchmark/evaluation/chinese/imgs/average_model_accuracy_over_stories_2-shot.png ADDED Viewed

datasets/TurtleBenchmark/evaluation/chinese/model_configs.py ADDED Viewed

	@@ -0,0 +1,144 @@

+import os
+from dotenv import load_dotenv
+MAX_TOKENS = 5
+# Load environment variables
+load_dotenv()
+# Define the models and their configurations
+models = [
+    {
+        "name": "DEEPSEEK",
+        "config": {
+            "apiKey": os.getenv("DEEPSEEK_API_KEY"),
+            "baseURL": "https://api.deepseek.com",
+            "model": "deepseek-chat",
+            "maxTokens": MAX_TOKENS,
+            "temperature": 0.0,
+            "top_p": 1
+        },
+        "type": "openai"
+    },
+    {
+        "name": "GPT-3.5-Turbo",
+        "config": {
+            "apiKey": os.getenv("OPENAI_API_KEY"),
+            "baseURL": "https://api.openai.com/v1",
+            "model": "gpt-3.5-turbo",
+            "maxTokens": MAX_TOKENS,
+            "temperature": 0.0,
+            "top_p": 1
+        },
+        "type": "openai"
+    },
+    {
+        "name": "Kimi-Chat",
+        "config": {
+            "apiKey": os.getenv("MOONSHOT_API_KEY"),
+            "baseURL": "https://api.moonshot.cn/v1",
+            "model": "moonshot-v1-8k",
+            "maxTokens": MAX_TOKENS,
+            "temperature": 0.0,
+            "top_p": 1
+        },
+        "type": "openai"
+    },
+    {
+        "name": "GPT-4o",
+        "config": {
+            "apiKey": os.getenv("OPENAI_API_KEY"),
+            "baseURL": "https://api.openai.com/v1",
+            "model": "gpt-4o",
+            "maxTokens": MAX_TOKENS,
+            "temperature": 0.0,
+            "top_p": 1
+        },
+        "type": "openai"
+    },
+    {
+        "name": "GPT-4o-mini",
+        "config": {
+            "apiKey": os.getenv("OPENAI_API_KEY"),
+            "baseURL": "https://api.openai.com/v1",
+            "model": "gpt-4o-mini",
+            "maxTokens": MAX_TOKENS,
+            "temperature": 0.0,
+            "top_p": 1
+        },
+        "type": "openai"
+    },
+    {
+        "name": "Llama-3.1-405b",
+        "config": {
+            "apiKey": os.getenv("TOGETHER_API_KEY"),
+            "model": "meta-llama/Meta-Llama-3.1-405B-Instruct-Turbo",
+            "maxTokens": MAX_TOKENS,
+            "temperature": 0.0,
+            "top_p": 1,
+            "repetition_penalty": 1,
+            "stop": ["<|eot_id|>"]
+        },
+        "type": "together"
+    },
+    {
+        "name": "Llama3.1-70b",
+        "config": {
+            "apiKey": os.getenv("TOGETHER_API_KEY"),
+            "model": "meta-llama/Meta-Llama-3.1-70B-Instruct-Turbo",
+            "maxTokens": MAX_TOKENS,
+            "temperature": 0.0,
+            "top_p": 1,
+            "repetition_penalty": 1,
+            "stop": ["<|eot_id|>"]
+        },
+        "type": "together"
+    },
+    {
+        "name": "Qwen2-72B-Instruct",
+        "config": {
+            "apiKey": os.getenv("TOGETHER_API_KEY"),
+            "model": "Qwen/Qwen2-72B-Instruct",
+            "maxTokens": MAX_TOKENS,
+            "temperature": 0.0,
+            "top_p": 1,
+            "repetition_penalty": 1,
+            "stop": ["<|im_start|>", "<|im_end|>"]
+        },
+        "type": "together"
+    },
+    {
+        "name": "Doubao-4k",
+        "config": {
+            "apiKey": os.getenv("DOUBAO_API_KEY"),
+            "baseURL": "https://ark.cn-beijing.volces.com/api/v3",
+            "model": "ep-20240802142948-6vvc7",  # Replace with the actual endpoint ID if different
+            "maxTokens": MAX_TOKENS,
+            "temperature": 0.0,
+            "top_p": 1
+        },
+        "type": "openai"
+    },
+    {
+        "name": "Claude-3.5-Sonnet",
+        "config": {
+            "apiKey": os.getenv("ANTHROPIC_API_KEY"),
+            "model": "claude-3-5-sonnet-20240620",
+            "maxTokens": MAX_TOKENS,
+            "temperature": 0.0,
+        },
+        "type": "anthropic"
+    },
+    {
+        "name": "MiniMax-ABAB6.5s",
+        "config": {
+            "groupId": os.getenv("MINIMAX_GROUP_ID"),
+            "apiKey": os.getenv("MINIMAX_API_KEY"),
+            "model": "abab6.5s-chat",
+            "maxTokens": MAX_TOKENS,
+            "temperature": 0.01,  # must be (0, 1]
+            "top_p": 1
+        },
+        "type": "minimax"
+    },
+]

datasets/TurtleBenchmark/evaluation/chinese/prompt.py ADDED Viewed

	@@ -0,0 +1,77 @@

+simple_system_prompt = """
+你是一个游戏的裁判，这个游戏会给玩家展示<汤面>，并告诉你<汤底>，你需要根据<汤面>和<汤底>理解整个故事。玩家会根据<汤面>进行猜测，你需要判断玩家的猜测是否正确，请严格遵守你只回答指定三种答案：对、错、不知道。
+## 判定规则
+   - 玩家提出的猜测正确，或者答案是肯定的：请只回答"对"，不要做任何解释。
+   - 玩家提出的猜测错误，或者答案是否定的：请只回答"错"，不要做任何解释。
+   - 玩家提出的猜测，从<汤面>和<汤底>找不到答案，并且也无法通过推理得出此结论：请只回答"不知道"，不要做任何解释。
+## 注意
+1. 玩家只能看到<汤面>，所以他是基于<汤面>进行猜测的，例如：玩家问“他喝的不是海龟汤”，是在问<汤面>中他喝的是不是海龟汤，即使<汤底>中他曾经喝过其他的汤，你也应该判定他在<汤面>中喝的是否是海龟汤。
+2. 凡是无法从提供的故事中得出的结论，都应该回答"不知道"，例如：玩家提出的猜测是关于故事中的细节，而这些细节并没有在故事中提到，也无法通过推理得出此结论，那么你应该回答"不知道"。
+3. 严格遵守只回答指定三种答案：对、错、不知道。
+## 题目内容
+### 汤面
+{surface}
+### 汤底
+{bottom}
+现在，请判断以下玩家猜测:
+"""
+system_prompt_with_2shots = """
+你是一个游戏的裁判，这个游戏会给玩家展示<汤面>，并告诉你<汤底>，你需要根据<汤面>和<汤底>理解整个故事。玩家会根据<汤面>进行猜测，你需要判断玩家的猜测是否正确，请严格遵守你只回答指定三种答案：对、错、不知道。
+## 判定规则
+   - 玩家提出的猜测正确，或者答案是肯定的：请只回答"对"，不要做任何解释。
+   - 玩家提出的猜测错误，或者答案是否定的：请只回答"错"，不要做任何解释。
+   - 玩家提出的猜测，从<汤面>和<汤底>找不到答案，并且也无法通过推理得出此结论：请只回答"不知道"，不要做任何解释。
+## 注意
+- 请充分理解整个故事的起因、经过和结局，并进行合乎逻辑的推断，如果无法从提供的故事中得出的结论，你应该回答"不知道"，例如：玩家提出的猜测是关于故事中的细节，而这些细节并没有在故事中提到，也无法通过推理得出此结论，那么你应该回答"不知道"。
+- 严格遵守只回答指定三种答案：对、错、不知道。
+## 示例
+### 示例1：打嗝男子
+<汤面>
+一个男人走进一家酒吧，并向酒保要了一杯水。酒保却突然拿出一把手枪瞄准他，而男子竟只是笑著说：“谢谢你！”然后从容离开，请问发生了什么事？
+<汤底>
+男子打嗝，他希望喝一杯水来改善状况。酒保意识到这一点，选择拿枪吓他，男子一紧张之下，打嗝自然消失，因而衷心感谢酒保后就离开了。
+可能的猜测及对应的回答：
+问：男人有慢性病吗？ 答：不知道
+问：男人是被吓跑了吗 答：错
+问：酒保想杀死男人 答：错
+问：酒保是为了吓唬男人 答：对
+问：男子衷心感谢酒保 答：对
+### 示例2：四岁的妈妈
+<汤面>
+幼儿园五岁的小朋友，竟然说她的妈妈只有四岁，我很疑惑，便提出了去她家家访，随后我在她家看到了让我惊恐的一幕...
+<汤底>
+我在她家看到了一个个被铁链拴着的女人，而一旁站着一个长相凶狠丑陋的彪形大汉。幼儿园的那个小朋友突然露出了不属于她这个年龄该有的诡笑...原来她是二十五岁，只是从小得了一种长不大的病，而那个彪形大汉则是她的哥哥，她是为了给她哥哥找女人，便哄骗我们这些幼师来到家里进行迫害…而她那\"四岁的妈妈\"，是已经被骗来四年的女人…
+可能的猜测及对应的回答：
+问：小朋友已经死了 答：错
+问：小朋友其实是成年人 答：对
+问：小朋友有精神分裂 答：不知道
+问：我会有危险 答：对
+现在开始主持:
+## 题目内容
+### 汤面
+{surface}
+### 汤底
+{bottom}
+现在，请判断以下玩家猜测:
+"""

datasets/TurtleBenchmark/evaluation/english/data/cases.list ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:1f9175bf4ffdf27bf36a3847a416ea9730115ba5aea4938787e37f20f0960e66
+size 109980

datasets/TurtleBenchmark/evaluation/english/data/stories.json ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:7782461b007bd7404e2146358344179c3f0c575dc031384588e8efee6935899b
+size 24669

datasets/TurtleBenchmark/evaluation/english/evaluate.py ADDED Viewed

	@@ -0,0 +1,620 @@

+import os
+import json
+import asyncio
+import requests
+import aiohttp
+from prompt import simple_system_prompt, system_prompt_with_2shots
+from dotenv import load_dotenv
+from tqdm import tqdm
+from openai import OpenAI
+from anthropic import Anthropic
+from together import Together
+import concurrent.futures
+from functools import partial
+import threading
+from tqdm import tqdm
+import argparse
+import time
+# Load environment variables
+load_dotenv()
+MAX_TOKENS = 4
+# Define the models and their configurations
+models = [
+    # {
+    #     "name": "Gemini-1.5-Pro",
+    #     "config": {
+    #         "apiKey": os.getenv("GEMINI_API_KEY"),
+    #         "model": "gemini-1.5-pro",
+    #         "maxTokens": MAX_TOKENS,
+    #         "temperature": 0.0,
+    #     },
+    #     "type": "gemini"
+    # },
+    {
+        "name": "DEEPSEEK",
+        "config": {
+            "apiKey": os.getenv("DEEPSEEK_API_KEY"),
+            "baseURL": "https://api.deepseek.com",
+            "model": "deepseek-chat",
+            "maxTokens": MAX_TOKENS,
+            "temperature": 0.0,
+            "top_p": 0.7,
+        },
+        "type": "openai"
+    },
+    {
+        "name": "GPT-3.5-Turbo",
+        "config": {
+            "apiKey": os.getenv("OPENAI_API_KEY"),
+            "baseURL": "https://api.openai.com/v1",
+            "model": "gpt-3.5-turbo",
+            "maxTokens": MAX_TOKENS,
+            "temperature": 0.0,
+            "top_p": 0.7,
+        },
+        "type": "openai"
+    },
+    # {
+    #     "name": "Kimi-Chat",
+    #     "config": {
+    #         "apiKey": os.getenv("MOONSHOT_API_KEY"),
+    #         "baseURL": "https://api.moonshot.cn/v1",
+    #         "model": "moonshot-v1-8k",
+    #         "maxTokens": MAX_TOKENS,
+    #         "temperature": 0.0,
+    #         "top_p": 0.7,
+    #     },
+    #     "type": "openai"
+    # },
+    {
+        "name": "GPT-4o",
+        "config": {
+            "apiKey": os.getenv("OPENAI_API_KEY"),
+            "baseURL": "https://api.openai.com/v1",
+            "model": "gpt-4o-2024-05-13",
+            "maxTokens": MAX_TOKENS,
+            "temperature": 0.0,
+            "top_p": 0.7,
+        },
+        "type": "openai"
+    },
+    {
+        "name": "GPT-4o-mini",
+        "config": {
+            "apiKey": os.getenv("OPENAI_API_KEY"),
+            "baseURL": "https://api.openai.com/v1",
+            "model": "gpt-4o-mini",
+            "maxTokens": MAX_TOKENS,
+            "temperature": 0.0,
+            "top_p": 0.7,
+        },
+        "type": "openai"
+    },
+    {
+        "name": "Llama-3.1-405b",
+        "config": {
+            "apiKey": os.getenv("TOGETHER_API_KEY"),
+            "model": "meta-llama/Meta-Llama-3.1-405B-Instruct-Turbo",
+            "maxTokens": MAX_TOKENS,
+            "temperature": 0.0,
+            "top_p": 0.7,
+            "stop": ["<|eot_id|>"]
+        },
+        "type": "together"
+    },
+    {
+        "name": "Llama3.1-70b",
+        "config": {
+            "apiKey": os.getenv("TOGETHER_API_KEY"),
+            "model": "meta-llama/Meta-Llama-3.1-70B-Instruct-Turbo",
+            "maxTokens": MAX_TOKENS,
+            "temperature": 0.0,
+            "top_p": 0.7,
+            "stop": ["<|eot_id|>"]
+        },
+        "type": "together"
+    },
+    {
+        "name": "Qwen2-72B-Instruct",
+        "config": {
+            "apiKey": os.getenv("TOGETHER_API_KEY"),
+            "model": "Qwen/Qwen2-72B-Instruct",
+            "maxTokens": MAX_TOKENS,
+            "temperature": 0.0,
+            "top_p": 0.7,
+            "stop": ["<|im_start|>", "<|im_end|>"]
+        },
+        "type": "together"
+    },
+    # {
+    #     "name": "Yi-34B-Chat",
+    #     "config": {
+    #         "apiKey": os.getenv("TOGETHER_API_KEY"),
+    #         "model": "zero-one-ai/Yi-34B-Chat",
+    #         "maxTokens": MAX_TOKENS,
+    #         "temperature": 0.0,
+    #         "top_p": 0.7,
+    #         "stop": ["<|im_start|>", "<|im_end|>"]
+    #     },
+    #     "type": "together"
+    # },
+    # {
+    #     "name": "Doubao-4k",
+    #     "config": {
+    #         "apiKey": os.getenv("DOUBAO_API_KEY"),
+    #         "baseURL": "https://ark.cn-beijing.volces.com/api/v3",
+    #         "model": "ep-20240802142948-6vvc7",  # Replace with the actual endpoint ID if different
+    #         "maxTokens": MAX_TOKENS,
+    #         "temperature": 0.0,
+    #         "top_p": 0.7
+    #     },
+    #     "type": "openai"
+    # },
+    {
+        "name": "Claude-3.5-Sonnet",
+        "config": {
+            "apiKey": os.getenv("ANTHROPIC_API_KEY"),
+            "model": "claude-3-5-sonnet-20240620",
+            "maxTokens": MAX_TOKENS,
+            "temperature": 0.0,
+        },
+        "type": "anthropic"
+    },
+    # {
+    #     "name": "Claude-3-Opus",
+    #     "config": {
+    #         "apiKey": os.getenv("ANTHROPIC_API_KEY"),
+    #         "model": "claude-3-opus-20240229",
+    #         "maxTokens": MAX_TOKENS,
+    #         "temperature": 0.0,
+    #     },
+    #     "type": "anthropic"
+    # },
+    {
+        "name": "Claude-3-Haiku",
+        "config": {
+            "apiKey": os.getenv("ANTHROPIC_API_KEY"),
+            "model": "claude-3-haiku-20240307",
+            "maxTokens": MAX_TOKENS,
+            "temperature": 0.0,
+        },
+        "type": "anthropic"
+    },
+    # {
+    #     "name": "MiniMax-ABAB6.5s",
+    #     "config": {
+    #         "groupId": os.getenv("MINIMAX_GROUP_ID"),
+    #         "apiKey": os.getenv("MINIMAX_API_KEY"),
+    #         "model": "abab6.5s-chat",
+    #         "maxTokens": MAX_TOKENS,
+    #         "temperature": 0.01, # must be (0, 1]
+    #         "top_p": 1
+    #     },
+    #     "type": "minimax"
+    # },
+]
+# Load stories
+with open("data/stories.json", "r", encoding="utf-8") as f:
+    stories = json.load(f)
+def load_test_cases(filename):
+    with open(filename, "r", encoding="utf-8") as f:
+        _test_cases = []
+        for line in f:
+            parts = line.strip().split("	|	")
+            if len(parts) != 3:
+                print(f"Invalid test case: {line}")
+                continue
+            if parts[2] not in ["Correct", "Incorrect", "Unknown"]:
+                print(f"Skipping line with invalid ground truth: {line}")
+                continue
+            _test_cases.append(parts)
+        print("Total", len(_test_cases), "test cases loaded")
+        return _test_cases
+def starts_with_answer(response, answer):
+    return response.strip().lower().startswith(answer)
+def call_api(model, prompt, user_input):
+    try:
+        if model["type"] == "openai":
+            if model["name"] == "Doubao-4k":
+                client = OpenAI(
+                    api_key=model["config"]["apiKey"],
+                    base_url=model["config"]["baseURL"]
+                )
+                messages = [
+                    {"role": "system", "content": prompt},
+                    {"role": "user", "content": user_input}
+                ]
+                response = client.chat.completions.create(
+                    model=model["config"]["model"],
+                    messages=messages,
+                    max_tokens=model["config"]["maxTokens"],
+                    temperature=model["config"]["temperature"],
+                    top_p=model["config"]["top_p"],
+                    stream=False
+                )
+                return response.choices[0].message.content
+            else:
+                url = model["config"]["baseURL"] + "/chat/completions"
+                headers = {
+                    "Content-Type": "application/json",
+                    "Authorization": f"Bearer {model['config']['apiKey']}"
+                }
+                data = {
+                    "model": model["config"]["model"],
+                    "messages": [
+                        {"role": "system", "content": prompt},
+                        {"role": "user", "content": user_input}
+                    ],
+                    "max_tokens": model["config"]["maxTokens"],
+                    "temperature": model["config"]["temperature"],
+                }
+                if "top_p" in model["config"]:
+                    data["top_p"] = model["config"]["top_p"]
+                response = requests.post(url, headers=headers, json=data)
+                if response.status_code != 200:
+                    raise Exception(f"API call failed with status {response.status_code}: {response.text}")
+                result = response.json()
+                return result["choices"][0]["message"]["content"]
+        elif model["type"] == "together":
+            client = Together(api_key=model["config"]["apiKey"])
+            messages = [
+                {"role": "system", "content": prompt},
+                {"role": "user", "content": user_input}
+            ]
+            response = client.chat.completions.create(
+                model=model["config"]["model"],
+                messages=messages,
+                max_tokens=model["config"]["maxTokens"],
+                temperature=model["config"]["temperature"],
+                top_p=model["config"]["top_p"],
+                stop=model["config"]["stop"],
+                stream=False
+            )
+            return response.choices[0].message.content
+        elif model["type"] == "anthropic":
+            client = Anthropic(api_key=model["config"]["apiKey"])
+            message = client.messages.create(
+                model=model["config"]["model"],
+                max_tokens=model["config"]["maxTokens"],
+                temperature=model["config"]["temperature"],
+                system=prompt,
+                messages=[
+                    {
+                        "role": "user",
+                        "content": [
+                            {
+                                "type": "text",
+                                "text": user_input
+                            }
+                        ]
+                    }
+                ]
+            )
+            return message.content[0].text
+        elif model["type"] == "minimax":
+            url = f"https://api.minimax.chat/v1/text/chatcompletion_v2?GroupId={model['config']['groupId']}"
+            headers = {
+                "Authorization": f"Bearer {model['config']['apiKey']}",
+                "Content-Type": "application/json"
+            }
+            payload = {
+                "model": model["config"]["model"],
+                "messages": [
+                    {
+                        "role": "system",
+                        "name": "MM智能助理",
+                        "content": prompt
+                    },
+                    {
+                        "role": "user",
+                        "content": user_input
+                    }
+                ],
+                "tools": [],
+                "tool_choice": "none",
+                "stream": False,
+                "max_tokens": model["config"]["maxTokens"],
+                "temperature": model["config"]["temperature"],
+                "top_p": model["config"]["top_p"]
+            }
+            response = requests.post(url, headers=headers, json=payload)
+            if response.status_code != 200:
+                raise Exception(f"API call failed with status {response.status_code}: {response.text}")
+            result = response.json()
+            return result["choices"][0]["message"]["content"]
+        elif model["type"] == "gemini":
+            import google.generativeai as genai
+            genai.configure(api_key=model["config"]["apiKey"])
+            generation_config = {
+                "temperature": model["config"]["temperature"],
+                "max_output_tokens": model["config"]["maxTokens"],
+                "top_p": 0.7,
+                # "top_k": 64,
+            }
+            gemini_model = genai.GenerativeModel(
+                model_name=model["config"]["model"],
+                generation_config=generation_config,
+            )
+            chat_session = gemini_model.start_chat(history=[])
+            # Combine prompt and user_input
+            full_prompt = f"{prompt}\n\nUser: {user_input}\nAssistant:"
+            response = chat_session.send_message(full_prompt)
+            return response.text
+        else:
+            raise ValueError(f"Unsupported model type: {model['type']}")
+    except Exception as e:
+        print(f"Error in call_api for model {model['name']}: {str(e)}")
+        return None
+def call_api_with_timeout_and_timing(model, prompt, user_input, timeout=20):
+    start_time = time.time()
+    try:
+        result = call_api(model, prompt, user_input)
+        elapsed_time = time.time() - start_time
+        return result, elapsed_time
+    except Exception as e:
+        elapsed_time = time.time() - start_time
+        print(f"Error in call_api for model {model['name']}: {str(e)}")
+        return None, elapsed_time
+def evaluate_models(models, test_cases, stories, shot_type):
+    results = {model['name']: {'correct': 0, 'total': 0} for model in models}
+    logs = {model['name']: [] for model in models}
+    challenging_cases = []
+    all_cases = []
+    time_logs = []
+    log_folder = f"logs_with_{shot_type}shots"
+    os.makedirs(log_folder, exist_ok=True)
+    # Find the last tested sample
+    last_tested = 0
+    for i in range(len(test_cases), 0, -1):
+        if os.path.exists(f"{log_folder}/all_cases_simple_prompt_{i}.json"):
+            with open(f"{log_folder}/all_cases_simple_prompt_{i}.json", "r", encoding="utf-8") as f:
+                all_cases = json.load(f)
+            last_tested = i
+            break
+    # Update results with previously tested samples
+    for case in all_cases:
+        for model_name, result in case['results'].items():
+            if result is not None:
+                results[model_name]['total'] += 1
+                if (case['ground_truth'] == "Correct" and result == "Correct") or \
+                   ((case['ground_truth'] == "Incorrect" or case['ground_truth'] == "Unknown") and result != "Correct"):
+                    results[model_name]['correct'] += 1
+    # Start from the next untested sample
+    start_index = len(all_cases)
+    for i, (user_input, story_title, ground_truth) in enumerate(tqdm(test_cases[start_index:]), start_index + 1):
+        try:
+            story = next((s for s in stories if s["title"] == story_title), None)
+            if not story:
+                print(f"Story not found: {story_title}")
+                continue
+            # Use the appropriate prompt based on shot_type
+            if shot_type == "2":
+                prompt_template = system_prompt_with_2shots
+            else:
+                prompt_template = simple_system_prompt
+            prompt = prompt_template.replace("{surface}", story["surface"]).replace("{bottom}", story["bottom"])
+            gt_map = {"correct": "correct", "incorrect": "incorrect", "unknown": "unknown"}
+            case_results = {}
+            all_responses_valid = True
+            time_usage = {}
+            # Use ThreadPoolExecutor for concurrent API calls
+            with concurrent.futures.ThreadPoolExecutor(max_workers=len(models)) as executor:
+                future_to_model = {executor.submit(partial(call_api_with_timeout_and_timing, timeout=20), model, prompt, user_input): model for model in models}
+                for future in concurrent.futures.as_completed(future_to_model):
+                    model = future_to_model[future]
+                    try:
+                        response, elapsed_time = future.result()
+                        time_usage[model['name']] = elapsed_time
+                        if response is None:
+                            all_responses_valid = False
+                            print(f"Timeout or error for model {model['name']}")
+                        else:
+                            case_results[model['name']] = response
+                    except Exception as exc:
+                        print(f'{model["name"]} generated an exception: {exc}')
+                        all_responses_valid = False
+            # If any model timed out or had an error, skip this entire test case
+            if not all_responses_valid:
+                print(f"Skipping test case {i} due to timeout or error")
+                continue
+            # Process all responses
+            for model in models:
+                if model['name'] not in case_results:
+                    continue
+                response = case_results[model['name']].strip().lower()
+                if starts_with_answer(response, "correct") or starts_with_answer(response, "incorrect") or starts_with_answer(response, "unknown"):
+                    results[model['name']]['total'] += 1
+                    # Save the actual model output
+                    if starts_with_answer(response, "correct"):
+                        case_results[model['name']] = "Correct"
+                    elif starts_with_answer(response, "incorrect"):
+                        case_results[model['name']] = "Incorrect"
+                    else:
+                        case_results[model['name']] = "Unknown"
+                    # Calculate accuracy (merging N and F)
+                    if (ground_truth.lower() == "correct" and case_results[model['name']].lower() == "correct") or \
+                       ((ground_truth.lower() == "incorrect" or ground_truth.lower() == "unknown") and case_results[model['name']].lower() != "correct"):
+                        results[model['name']]['correct'] += 1
+                    else:
+                        # Print only wrong answers
+                        print(f"Wrong Answer - Model: {model['name']}, Input: {user_input}, Response: {response}, GT: {ground_truth.lower()}, Model Output: {case_results[model['name']]}")
+                else:
+                    # Handle invalid responses
+                    case_results[model['name']] = "Invalid"
+                    print(f"Invalid Response - Model: {model['name']}, Input: {user_input}, Response: {response}, GT: {ground_truth.lower()}, Model Output: {case_results[model['name']]}")
+                log_entry = {
+                    "Input": user_input,
+                    "Response": response,
+                    "GT": ground_truth,
+                    "Model_Output": case_results[model['name']],
+                    "Accuracy": f"{results[model['name']]['correct']}/{results[model['name']]['total']} ({results[model['name']]['correct']/max(results[model['name']]['total'], 1):.2f})"
+                }
+                logs[model['name']].append(log_entry)
+            case = {
+                "input": user_input,
+                "story_title": story_title,
+                "ground_truth": ground_truth,
+                "results": case_results,
+                "time_usage": time_usage
+            }
+            all_cases.append(case)
+            time_logs.append({"sample": i, "time_usage": time_usage})
+            # Print time usage for this sample
+            print(f"\nTime usage for sample {i}:")
+            for model_name, elapsed_time in sorted(time_usage.items(), key=lambda x: x[1], reverse=True):
+                print(f"{model_name}: {elapsed_time:.2f} seconds")
+            # Save log and print accuracy ranking every 10 steps
+            if i % 10 == 0 or i == len(test_cases):
+                print(f"\nCurrent rankings after {i} items:")
+                current_results = [(name, res['correct'] / max(res['total'], 1), res['correct'], res['total'])
+                                for name, res in results.items()]
+                current_results.sort(key=lambda x: x[1], reverse=True)
+                for rank, (name, accuracy, correct, total) in enumerate(current_results, 1):
+                    print(f"{rank}. {name}: {accuracy:.2f} ({correct}/{total})")
+                # Update challenging cases file
+                with open(f"{log_folder}/challenging_cases_simple_prompt_{i}.json", "w", encoding="utf-8") as f:
+                    json.dump(challenging_cases, f, ensure_ascii=False, indent=2)
+                # Update all cases file
+                with open(f"{log_folder}/all_cases_simple_prompt_{i}.json", "w", encoding="utf-8") as f:
+                    json.dump(all_cases, f, ensure_ascii=False, indent=2)
+                # Save time logs
+                with open(f"{log_folder}/time_logs_{i}.json", "w", encoding="utf-8") as f:
+                    json.dump(time_logs, f, ensure_ascii=False, indent=2)
+        except Exception as e:
+            print(f"Error processing test case {i}: {str(e)}")
+            continue
+    # Final update to challenging cases file
+    final_index = start_index + len(test_cases[start_index:])
+    with open(f"{log_folder}/challenging_cases_simple_prompt_{final_index}.json", "w", encoding="utf-8") as f:
+        json.dump(challenging_cases, f, ensure_ascii=False, indent=2)
+    # Final update to all cases file
+    with open(f"{log_folder}/all_cases_simple_prompt_{final_index}.json", "w", encoding="utf-8") as f:
+        json.dump(all_cases, f, ensure_ascii=False, indent=2)
+    return results, challenging_cases, all_cases, time_logs
+def save_all_cases(all_cases, output_file):
+    with open(output_file, "w", encoding="utf-8") as f:
+        json.dump(all_cases, f, ensure_ascii=False, indent=2)
+    print(f"All cases have been saved to {output_file}")
+def parse_challenging_cases(input_file, output_file):
+    with open(input_file, 'r', encoding='utf-8') as f:
+        challenging_cases = json.load(f)
+    with open(output_file, 'w', encoding='utf-8') as f:
+        for case in challenging_cases:
+            user_input = case['input']
+            story_title = case['story_title']
+            ground_truth = case['ground_truth']
+            f.write(f"{user_input}\t{story_title}\t{ground_truth}\n")
+    print(f"Parsed challenging cases have been written to {output_file}")
+def main():
+    # Parse command line arguments
+    parser = argparse.ArgumentParser(description="Run story understanding evaluation")
+    parser.add_argument("--shot", choices=["0", "2"], default="2", help="Number of shots (0 or 2)")
+    args = parser.parse_args()
+    test_cases = load_test_cases("data/cases.list")
+    results, challenging_cases, all_cases, time_logs = evaluate_models(models, test_cases, stories, args.shot)
+    final_results = [(name, res['correct'] / max(res['total'], 1), res['correct'], res['total'])
+                     for name, res in results.items()]
+    final_results.sort(key=lambda x: x[1], reverse=True)
+    print(f"\nFinal Rankings ({args.shot}-shot):")
+    for rank, (name, accuracy, correct, total) in enumerate(final_results, 1):
+        print(f"{rank}. {name}: {accuracy:.2f} ({correct}/{total})")
+    print(f"Evaluation complete. Logs have been saved in the '{log_folder}' directory.")
+    # Analyze and print overall time usage statistics
+    model_total_time = {model['name']: 0 for model in models}
+    model_call_count = {model['name']: 0 for model in models}
+    for log in time_logs:
+        for model_name, time_used in log['time_usage'].items():
+            model_total_time[model_name] += time_used
+            model_call_count[model_name] += 1
+    print("\nOverall Time Usage Statistics:")
+    for model_name in sorted(model_total_time, key=lambda x: model_total_time[x], reverse=True):
+        avg_time = model_total_time[model_name] / model_call_count[model_name] if model_call_count[model_name] > 0 else 0
+        print(f"{model_name}: Total time: {model_total_time[model_name]:.2f}s, Avg time per call: {avg_time:.2f}s")
+    # Save overall time usage statistics
+    log_folder = f"logs_with_{args.shot}shots"
+    with open(f"{log_folder}/overall_time_usage.json", "w", encoding="utf-8") as f:
+        json.dump({
+            "model_total_time": model_total_time,
+            "model_call_count": model_call_count,
+            "model_avg_time": {name: model_total_time[name] / count if count > 0 else 0
+                               for name, count in model_call_count.items()}
+        }, f, ensure_ascii=False, indent=2)
+if __name__ == "__main__":
+    main()

datasets/TurtleBenchmark/evaluation/english/prompt.py ADDED Viewed

	@@ -0,0 +1,74 @@

+simple_system_prompt = """
+You are the referee of a game where players are shown a <Surface> and you are given the <Bottom>. You need to understand the entire story based on both the <Surface> and <Bottom>. Players will make guesses based on the <Surface>, and you need to judge whether their guesses are correct. Please strictly adhere to answering with only three specified responses: Correct, Incorrect, or Unknown.
+## Judging Rules
+- If the player's guess is correct or the answer is affirmative: Please only answer "Correct" without any explanation.
+- If the player's guess is wrong or the answer is negative: Please only answer "Incorrect" without any explanation.
+- If the answer to the player's guess cannot be found in the <Surface> and <Bottom>, and cannot be deduced through reasoning: Please only answer "Unknown" without any explanation.
+## Important Notes
+1. Players can only see the <Surface>, so their guesses are based on it. Even if the <Bottom> contains additional information, you should judge based on the content in the <Surface>.
+2. If a conclusion cannot be drawn from the provided story or through reasonable inference, answer "Unknown".
+3. Strictly adhere to answering with only the three specified responses: Correct, Incorrect, or Unknown. Do not provide any additional explanations.
+## Question Content
+### <Surface>
+{surface}
+### <Bottom>
+{bottom}
+Now, please judge the following player guesses:
+"""
+system_prompt_with_2shots = """
+You are the referee of a game where players are shown a <Surface> and you are given the <Bottom>. You need to understand the entire story based on both the <Surface> and <Bottom>. Players will make guesses based on the <Surface>, and you need to judge whether their guesses are correct. Please strictly adhere to answering with only three specified responses: Correct, Incorrect, or Unknown.
+## Judging Rules
+- If the player's guess is correct or the answer is affirmative: Please only answer "Correct" without any explanation.
+- If the player's guess is wrong or the answer is negative: Please only answer "Incorrect" without any explanation.
+- If the answer to the player's guess cannot be found in the <Surface> and <Bottom>, and cannot be deduced through reasoning: Please only answer "Unknown" without any explanation.
+## Important Notes
+1. Fully understand the cause, process, and outcome of the entire story, and make logical inferences.
+2. If a conclusion cannot be drawn from the provided story or through reasonable inference, answer "Unknown".
+3. Strictly adhere to answering with only the three specified responses: Correct, Incorrect, or Unknown. Do not provide any additional explanations.
+## Examples
+### Example 1: The Hiccuping Man
+<Surface>
+A man walks into a bar and asks the bartender for a glass of water. The bartender suddenly pulls out a gun and points it at him. The man smiles and says, "Thank you!" then calmly leaves. What happened?
+<Bottom>
+The man had hiccups and wanted a glass of water to cure them. The bartender realized this and chose to scare him with a gun. The man's hiccups disappeared due to the sudden shock, so he sincerely thanked the bartender before leaving.
+Possible guesses and corresponding answers:
+Q: Does the man have a chronic illness? A: Unknown
+Q: Was the man scared away? A: Incorrect
+Q: Did the bartender want to kill the man? A: Incorrect
+Q: Did the bartender intend to scare the man? A: Correct
+Q: Did the man sincerely thank the bartender? A: Correct
+### Example 2: The Four-Year-Old Mother
+<Surface>
+A five-year-old kindergartener surprisingly claims that her mother is only four years old. Puzzled, I proposed a home visit. When I arrived at her house, I saw a horrifying scene...
+<Bottom>
+I saw several women chained up in her house, with a fierce-looking, ugly brute standing nearby. The kindergartener suddenly displayed an eerie smile uncharacteristic of her age... It turns out she's actually 25 years old but suffers from a condition that prevents her from growing. The brute is her brother, and she lures kindergarten teachers like us to their house to help her brother find women... Her "four-year-old mother" is actually a woman who was tricked and has been held captive for four years...
+Possible guesses and corresponding answers:
+Q: Is the child already dead? A: Incorrect
+Q: Is the child actually an adult? A: Correct
+Q: Does the child have schizophrenia? A: Unknown
+Q: Am I in danger? A: Correct
+## Question Content
+### Surface
+{surface}
+### Bottom
+{bottom}
+Now, please judge the following player guesses:
+"""

datasets/TurtleBenchmark/requirements.txt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:732c8664c5e7ed190afd1e22670bb1a9d031205d590207f8d6ad1ccbe7990fe8
+size 54