请教LongBench-v2上评测时输出异常问题

by AyongZheng - opened 3 days ago

3 days ago

•

对于同一个输入：{"_id": "66f36490821e116aacb2cc22", ..., "question": "According to the report, how to promote the construction of smart courts?", ...}，
使用Qwen2.5-72B-Instruct推理时返回
"response": "The correct answer is (A) Through technology empowerment, change the way of working and improve office efficiency."
使用Qwen2.5-7B-LongPO-128K推理时返回
"response": "Thead---c---c-c-a\nde\ndbc-dee-0cb--a0cbad\n The-adad-dca-ca---c8-c-dede-0cd00a\n-S\n\n-\nd-d\n-\n0cb \n--\n \n \ndb the the following options: 1. The number of people who like chocolate ice cream is the most most 2 2 2: 2. 2 flavor 2 2. 2 is 2 2 2 flavors 2 2 options: 2. 2 2"
这其中固然存在模型参数量的差异，但是看Qwen2.5-7B-LongPO-128K的response像是出了问题，所以想请教下配置上有没有需要特别注意的地方（p.s.: 使用几十个token输入时模型回复是正常的）

AyongZheng

3 days ago

This comment has been hidden (marked as Spam)

AyongZheng changed discussion status to closed 3 days ago

AyongZheng changed discussion status to open 1 day ago

Guanzheng

Language Technology Lab at Alibaba DAMO Academy org about 12 hours ago

•

edited about 12 hours ago

Hi, 感谢关注。
应该没有什么特别需要设置的地方，注意rope theta load正确并且apply chat format就可以了，然后Longbench有个model2maxlen的map，注意follow原始的setting修改到120000。我按照这样的setting跑了Qwen-2.5-LongPO-128K, 在LongBench v2上的结果如下：

	Overall	Easy	Hard	Short	Medium	Long
w/o CoT	32.6	32.3	32.8	36.7	33.5	24.1
w/ CoT	35.4	44.3	29.9	41.7	33.5	28.7

应该是比较impressive的结果了，可供参考。注意longbench v2原始的eval代码由于使用了多线程可能会引入随机性导致每次运行结果不同 https://github.com/THUDM/LongBench/issues/94#issuecomment-2601354113

AyongZheng

about 11 hours ago

Hi, 感谢关注。
应该没有什么特别需要设置的地方，注意rope theta load正确并且apply chat format就可以了，然后Longbench有个model2maxlen的map，注意follow原始的setting修改到120000。我按照这样的setting跑了Qwen-2.5-LongPO-128K, 在LongBench v2上的结果如下：

Overall Easy Hard Short Medium Long

w/o CoT 32.6 32.3 32.8 36.7 33.5 24.1

w/ CoT 35.4 44.3 29.9 41.7 33.5 28.7

应该是比较impressive的结果了，可供参考。注意longbench v2原始的eval代码由于使用了多线程可能会引入随机性导致每次运行结果不同 https://github.com/THUDM/LongBench/issues/94#issuecomment-2601354113

请问“rope theta load正确”是需要设置什么吗？我没有对theta信息做修改，用的官方代码直接加载的模型，vllm serve + client.chat.completions.create 的方式应该不需要我们使用apply chat format来构造输入。

Guanzheng

Language Technology Lab at Alibaba DAMO Academy org about 11 hours ago

为了避免随机性我没有使用vllm而是直接hf.generate，但只要指定--trust-remote-code后应该不需要修改任何东西就可以使用vllm serving。如果你依然遇到问题可以post你运行的指令，我有空时候可以测一下。

Guanzheng

Language Technology Lab at Alibaba DAMO Academy org about 11 hours ago

以及可以提供一下vllm版本等环境信息。

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment