# 4.2 基于GPT2的指令微调

我还是用第二章中的分类的例子，使用指令微调的形式，来再次解决分类问题。

使用 GPT-2 进行文本分类的两种方法：**使用 GPT-2 的分类头（Classification Header）** 和 **将分类任务转换为指令微调**，在思路、实现、优劣势和适用场景上存在明显差异。以下是详细对比：

---

### **1. 核心思路**

| **方法**                    | **使用 GPT-2 分类头**                                          | **转换为指令微调**                                      |
|-----------------------------|-------------------------------------------------------------|-------------------------------------------------------|
| **基本概念**                | 在 GPT-2 顶部添加一个分类头（通常是一个线性层），直接预测分类标签。 | 将分类任务转化为自然语言指令，模型通过微调理解并完成指令形式的任务。 |
| **实现方式**                | 修改 GPT-2 模型，添加 `num_labels` 分类头并定义分类损失函数。     | 构建任务指令数据（Instruction + Input + Output），然后微调模型。 |
| **数据形式**                | 文本与其分类标签的直接映射。                                   | 文本通过指令转化为生成任务。例如：<br>`Input`: 文章内容<br>`Output`: 分类结果。 |

---

### **2. 数据格式**

| **方法**                    | **使用 GPT-2 分类头**                                          | **转换为指令微调**                                      |
|-----------------------------|-------------------------------------------------------------|-------------------------------------------------------|
| **数据格式**                | - 输入：文本 <br>- 标签：离散类别标签（如 0, 1, 2）。             | - 指令：自然语言描述任务（如 "请分类以下文本"）。<br>- 输入：分类文本。<br>- 输出：分类结果（文本形式）。 |
| **示例**                   | 输入：`"This is a happy day!"`<br>标签：`1`（表示积极）         | `Instruction`: "请对以下文本进行情感分类"<br>`Input`: `"This is a happy day!"`<br>`Output`: `"积极"` |

---

### **3. 模型结构**

| **方法**                    | **使用 GPT-2 分类头**                                          | **转换为指令微调**                                      |
|-----------------------------|-------------------------------------------------------------|-------------------------------------------------------|
| **模型结构**                | - GPT-2 + 分类头（线性层）。                                   | - GPT-2 原始结构，无需额外的分类头。                   |
| **损失函数**                | - 使用交叉熵损失（Cross Entropy Loss）。                       | - 使用自回归的语言建模损失（Language Modeling Loss）。  |

---

### **4. 训练过程**

| **方法**                    | **使用 GPT-2 分类头**                                          | **转换为指令微调**                                      |
|-----------------------------|-------------------------------------------------------------|-------------------------------------------------------|
| **微调对象**                | 主要微调分类头部分的参数（可选择冻结 GPT-2 的主干部分）。         | 微调整个 GPT-2 模型（或使用参数高效微调如 LoRA）。      |
| **标签处理**                | 离散化标签（如 0, 1, 2）。                                    | 标签转化为自然语言（如“积极”、“中立”、“消极”）。      |
| **训练难度**                | - 简单，标准分类任务流程。<br>- 数据需求较小，适合小规模微调。     | - 复杂，需要构造高质量的指令数据集。<br>- 数据需求较大，适合多任务场景。 |

---

### **5. 优缺点分析**

| **方法**                    | **使用 GPT-2 分类头**                                          | **转换为指令微调**                                      |
|-----------------------------|-------------------------------------------------------------|-------------------------------------------------------|
| **优点**                    | - 训练速度快，计算资源需求较低。<br>- 实现简单，适合单一任务。    | - 泛化能力强，支持多任务扩展。<br>- 与多任务微调和开放式生成兼容。 |
| **缺点**                    | - 只能处理分类任务，难以扩展为其他任务。<br>- 需要人工调整分类头和损失函数。 | - 数据构造复杂且对数据质量依赖较高。<br>- 训练资源需求较大，训练时间较长。 |

---

### **6. 适用场景**

| **方法**                    | **使用 GPT-2 分类头**                                          | **转换为指令微调**                                      |
|-----------------------------|-------------------------------------------------------------|-------------------------------------------------------|
| **适用场景**                | - 单任务文本分类，如情感分析、垃圾邮件检测等。                 | - 多任务场景，支持分类、翻译、摘要等任务的统一处理。     |
| **数据规模**                | 适合小数据集，数千到数万条数据即可训练效果良好。                  | 适合大数据集，特别是多任务、多领域的数据集。             |
| **需求类型**                | 专注于提高单一任务的分类准确率。                                | 需要增强模型的多任务泛化能力，同时提升用户交互体验。     |

---

### **7. 综合对比总结**

| **维度**                | **使用 GPT-2 分类头**                                           | **转换为指令微调**                                      |
|-------------------------|--------------------------------------------------------------|-------------------------------------------------------|
| **实现复杂度**          | 较低，直接添加分类头并使用标准分类流程即可完成。                    | 较高，需要构造高质量指令数据，并调整训练流程。            |
| **资源需求**            | 较低，仅需调整分类头部分，训练时间和显存消耗较少。                   | 较高，需要微调整个模型，且对数据和算力需求更大。          |
| **性能表现**            | 对单一分类任务效果较好，但泛化能力较弱。                           | 在多任务、多样化分类场景中表现更强，且可扩展为其他任务类型。 |
| **扩展性**              | 较差，仅适用于当前任务，难以迁移到其他任务。                        | 较强，可适应多任务指令和开放式生成场景。                 |

---

### **选择建议**

1. **使用 GPT-2 分类头**：
   - 如果任务是单一分类问题（如情感分析、垃圾邮件检测），并且数据量有限，推荐使用分类头方法。
   - 适合快速实现和部署，无需复杂的预处理和指令数据集构建。

2. **转换为指令微调**：
   - 如果任务需要多样化（分类+生成+翻译等），或需要对未见任务有更好的泛化能力，推荐使用指令微调。
   - 适合多任务、多场景部署，尤其是在 ChatGPT 风格的应用中更为适用。

通过综合任务需求、数据规模和资源条件选择合适的方法，能够有效提升模型性能并实现更广泛的适用性。


原始的数据格式如下：
| sequence                                               | label | label_name     |
|--------------------------------------------------------|-------|----------------|
| TATATTTTCTCAGCTGAGTTAATTAGTTTCACTAGTTAACTGAGAATAAAAGAA | 1     | promoter       |
| TGGGGAGGGTCCGGTGTTAGTTAGATACATCCCCAGACCCACACCCCGGATAGA | 0     | Non-promoter   |

转成指令的格式为：
```
{'instruction': 'Determine core promoter detection of following dna sequence, The result will be one of the following: Non-promoter, promoter.', 
'input': 'CATGCGGGTCG...', 
'output': 'Non-promoter'}
```

然后写成指令微调数据格式，当做一般的文本进行训练：
```
Below is an instruction that describes a task. Write a response that appropriately completes the request.
### Instruction:
Determine core promoter detection of following dna sequence, The result will be one of the following: Non-promoter, promoter.
### Input:
TCTTTCTCTTCTGTATCATTCTACTT...
### Response:
Non-promoter
```


In [1]:
import subprocess
import os
# 设置环境变量, autodl一般区域
result = subprocess.run('bash -c "source /etc/network_turbo && env | grep proxy"', shell=True, capture_output=True, text=True)
output = result.stdout
for line in output.splitlines():
    if '=' in line:
        var, value = line.split('=', 1)
        os.environ[var] = value

In [6]:
from datasets import load_dataset
# 1. load ~11k samples from promoters prediction dataset
dna_dataset = load_dataset("dnagpt/dna_promoter_300")
dna_dataset

DatasetDict({
    train: Dataset({
        features: ['sequence', 'label'],
        num_rows: 59195
    })
})

In [7]:
dna_dataset["train"][0]

{'sequence': 'TAAATACGGAAGTTTATTACTTGAGGAATAGATGGAATCGTCGGGCGTGAGAGATCATAATCGGCTGCTTCTGGGAGCCGCACGTGGGAAAGACTTATCCCCGACGGAGCTGGGACTGGGGCACAAACCGGAAGGAACACATCTGACCGAGAAAGAGACCAAGTGGCTCAGGTAGGACCAAAGCGAGCAAGGCTGCGGGTCCTGTTGCTCTCTGTCCTGTAAATTTAAACGTTACGCCACCTGGTAATGATACCCTCGTCCTCCGAGGCGACAAGTCAGAACTTCCACCAAGGGCATTAC',
 'label': 0}

In [8]:
def build_prompt(example):
  if int(example['label']) == 1:
    label = 'promoter'
  else:
    label = 'Non-promoter'

  instruction = "Determine core promoter detection of following dna sequence, The result will be one of the following: Non-promoter, promoter."
    
  input = example["sequence"]
  input_text = f"\n\n### Input:\n{input}"


  output = label

  prompt =  {"instruction":instruction, 
             "input":input,
             "output":output
            }

  return prompt

In [9]:
example = dna_dataset["train"][0]
print(build_prompt(example))

{'instruction': 'Determine core promoter detection of following dna sequence, The result will be one of the following: Non-promoter, promoter.', 'input': 'TAAATACGGAAGTTTATTACTTGAGGAATAGATGGAATCGTCGGGCGTGAGAGATCATAATCGGCTGCTTCTGGGAGCCGCACGTGGGAAAGACTTATCCCCGACGGAGCTGGGACTGGGGCACAAACCGGAAGGAACACATCTGACCGAGAAAGAGACCAAGTGGCTCAGGTAGGACCAAAGCGAGCAAGGCTGCGGGTCCTGTTGCTCTCTGTCCTGTAAATTTAAACGTTACGCCACCTGGTAATGATACCCTCGTCCTCCGAGGCGACAAGTCAGAACTTCCACCAAGGGCATTAC', 'output': 'Non-promoter'}


In [10]:
import json
ins_file = open("data/dna_promoter_300.jsonl", "w")
ins_list = []
for ins in dna_dataset["train"]:
    if ins["sequence"]=="sequence":
        continue
    ins = build_prompt(ins)
    ins_file.write(json.dumps(ins)+"\n")
    ins_list.append(ins)
ins_file.close()

In [11]:
dna_ft_dataset = load_dataset("json", data_files='data/dna_promoter_300.jsonl')
dna_ft_dataset

Generating train split: 0 examples [00:00, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['instruction', 'input', 'output'],
        num_rows: 59195
    })
})

In [12]:
data = dna_ft_dataset["train"].train_test_split(train_size=0.9, seed=42)
data

DatasetDict({
    train: Dataset({
        features: ['instruction', 'input', 'output'],
        num_rows: 53275
    })
    test: Dataset({
        features: ['instruction', 'input', 'output'],
        num_rows: 5920
    })
})

In [13]:
# 初始化tokenizer
from datasets import load_dataset
from transformers import AutoTokenizer, GPT2LMHeadModel, AutoConfig
from transformers import GPT2Tokenizer,GPT2Model,AutoModel
from transformers import DataCollatorForLanguageModeling
from transformers import Trainer, TrainingArguments
from tokenizers import Tokenizer
from transformers import GPT2TokenizerFast

#需要使用生物序列+英文的多模态大模型
tokenizer = GPT2Tokenizer.from_pretrained("dnagpt/gene_eng_gpt2_v0")
tokenizer.pad_token = tokenizer.eos_token

In [14]:
#构建提示词
def format_input(entry):
    instruction_text = (
        f"Below is an instruction that describes a task. "
        f"Write a response that appropriately completes the request."
        f"\n\n### Instruction:\n{entry['instruction']}"
    )

    input_text = f"\n\n### Input:\n{entry['input']}" if entry["input"] else ""

    return instruction_text + input_text + "\n\n### Response:\n"

#构建提示词
def build_prompt(entry):

    input_data = format_input(entry)

    desired_response = entry['output']

    return input_data + desired_response

In [15]:
example = data["test"][0]
example

{'instruction': 'Determine core promoter detection of following dna sequence, The result will be one of the following: Non-promoter, promoter.',
 'input': 'CCAGGATGCGCTGACGACCCGGCTGGCAGGCGGGTCCTCGTGGGCGAGGCGAGGGAGGCGGCGAGAGAGGAGCAATAGTTTCCCACCGCTCCCTCTCAGGCGCAGGGTCTAGAGAAGCGCGAGGGGATCTAGAGAAGCCGGAGGGGAGGAAGCGCGAGTCCGCGGCCCGCCCCGTTGCGTCCCACCCACCGCGTCCCCTCCCCTCCCCTCCCGCTGCGGGAAAAGCGGCCGCGGGCGGCGGCGCCCACTGTGGGGCGGGCGGAGCGCCGCGGGAGGCGGACGAGATGCGAGCGCGGCCGC',
 'output': 'promoter'}

In [16]:
prompt = build_prompt(example)
print(prompt)

Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
Determine core promoter detection of following dna sequence, The result will be one of the following: Non-promoter, promoter.

### Input:
CCAGGATGCGCTGACGACCCGGCTGGCAGGCGGGTCCTCGTGGGCGAGGCGAGGGAGGCGGCGAGAGAGGAGCAATAGTTTCCCACCGCTCCCTCTCAGGCGCAGGGTCTAGAGAAGCGCGAGGGGATCTAGAGAAGCCGGAGGGGAGGAAGCGCGAGTCCGCGGCCCGCCCCGTTGCGTCCCACCCACCGCGTCCCCTCCCCTCCCCTCCCGCTGCGGGAAAAGCGGCCGCGGGCGGCGGCGCCCACTGTGGGGCGGGCGGAGCGCCGCGGGAGGCGGACGAGATGCGAGCGCGGCCGC

### Response:
promoter


In [17]:
print('tokens: ', ' '.join(tokenizer.tokenize(prompt)))

tokens:  Bel ow Ġ is Ġ an Ġ instruc tion Ġ th at Ġ describ es Ġ a Ġ t ask . ĠWrit e Ġ a Ġ respon se Ġ th at Ġ appropri at el y Ġ complet es Ġ the Ġ request . Ċ Ċ # # # ĠIn struc tion : Ċ D eter min e Ġ cor e Ġ promo ter Ġ det ec tion Ġ of Ġ follow ing Ġ d na Ġ sequenc e , ĠTh e Ġ resul t Ġ will Ġ be Ġ on e Ġ of Ġ the Ġ follow ing : ĠN on - promo ter , Ġ promo ter . Ċ Ċ # # # ĠIn put : Ċ CC AGGATGC GC TGACG ACCC GGCTGGC AGGC GGGTCC TCG TGGGCG AGGCG AGGGAGGC GGCG AGAGAGG AGCAATAG TTTCCC ACCGC TCCCTCTC AGGCGC AGGG TCTAG AGAAGC GCG AGGGG ATCTAG AGAAGCC GG AGGGG AGGAAGC GCG AGTCC GCGG CCCGCC CCG TTGCG TCCC ACCCACC GCG TCCCCTCCCC TCCCCTCCC GCTGC GGG AAAAGC GGCCGC GGGCGGC GGCGCCC ACTGTG GGGC GGGC GGAGC GCCGC GGGAGGC GGACG AGATGCG AGCGC GGCCGC Ċ Ċ # # # ĠR esp on se : Ċ promo ter


In [18]:
def tokenize_function(example):
    prompt =  build_prompt(example)
    result = tokenizer(prompt, padding='max_length', truncation=True, max_length=256) # max_length=256
    return result


# Use batched=False for easy
tokenized_datasets = data.map(
    tokenize_function, batched=False,remove_columns=['instruction', 'input', 'output']
)
tokenized_datasets

Map:   0%|          | 0/53275 [00:00<?, ? examples/s]

Map:   0%|          | 0/5920 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask'],
        num_rows: 53275
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask'],
        num_rows: 5920
    })
})

In [19]:
tokenized_datasets["train"]

Dataset({
    features: ['input_ids', 'attention_mask'],
    num_rows: 53275
})

In [20]:
# 创建DataCollator
data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer,
    mlm=False,  # 因为GPT2是自回归模型，不需要MLM
)

In [21]:
model = GPT2LMHeadModel.from_pretrained("dnagpt/gene_eng_gpt2_v0")

generation_config.json:   0%|          | 0.00/111 [00:00<?, ?B/s]

In [22]:
def inference(text, model, tokenizer, max_input_tokens=1000, max_output_tokens=1000):
  # Tokenize
  input_ids = tokenizer.encode(
          text,
          return_tensors="pt",
          truncation=True,
          max_length=max_input_tokens
          # return_attention_mask=True,
  )

  # Generate
  device = model.device
  generated_tokens_with_prompt = model.generate(
    input_ids=input_ids.to(device),
    #max_length=max_output_tokens,
    max_new_tokens=5,
  )

  # Decode
  #generated_text_with_prompt = tokenizer.batch_decode(generated_tokens_with_prompt, skip_special_tokens=True)
  # Strip the prompt
  #generated_text_answer = generated_text_with_prompt[0][len(text):]
    
  generated_text_with_prompt = tokenizer.decode(generated_tokens_with_prompt[0], skip_special_tokens=True)
  generated_text_answer = generated_text_with_prompt[len(text):]


  return generated_text_answer

# 如果需要进一步清理
def clean_generated_text(text):
    # 去除 'Ġ' 符号并替换为空格
    text = text.replace('Ġ', ' ')
    # 去除多余的空格
    text = ' '.join(text.split())
    return text

In [23]:
input_text = format_input(data["test"][0])

print("input (test):", input_text)

print("--------------------------\n")

print("model's answer: \n")
print(inference(input_text, model, tokenizer))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


input (test): Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
Determine core promoter detection of following dna sequence, The result will be one of the following: Non-promoter, promoter.

### Input:
CCAGGATGCGCTGACGACCCGGCTGGCAGGCGGGTCCTCGTGGGCGAGGCGAGGGAGGCGGCGAGAGAGGAGCAATAGTTTCCCACCGCTCCCTCTCAGGCGCAGGGTCTAGAGAAGCGCGAGGGGATCTAGAGAAGCCGGAGGGGAGGAAGCGCGAGTCCGCGGCCCGCCCCGTTGCGTCCCACCCACCGCGTCCCCTCCCCTCCCCTCCCGCTGCGGGAAAAGCGGCCGCGGGCGGCGGCGCCCACTGTGGGGCGGGCGGAGCGCCGCGGGAGGCGGACGAGATGCGAGCGCGGCCGC

### Response:

--------------------------

model's answer: 

TATAT


In [24]:
training_args = TrainingArguments(
        output_dir='./results_small',
        overwrite_output_dir=True,
        num_train_epochs=3,
        per_device_train_batch_size=8,
        save_steps=2000,
        save_total_limit=2,
        prediction_loss_only=True,
        fp16=True, #v100没法用
    )

In [25]:
# 初始化Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets['train'],
    eval_dataset=tokenized_datasets['test'],
    data_collator=data_collator
)

[2025-01-10 15:41:26,331] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect)


/root/miniconda3/compiler_compat/ld: cannot find -laio: No such file or directory
collect2: error: ld returned 1 exit status
/root/miniconda3/compiler_compat/ld: /usr/local/cuda/lib64/libcufile.so: undefined reference to `std::runtime_error::~runtime_error()@GLIBCXX_3.4'
/root/miniconda3/compiler_compat/ld: /usr/local/cuda/lib64/libcufile.so: undefined reference to `__gxx_personality_v0@CXXABI_1.3'
/root/miniconda3/compiler_compat/ld: /usr/local/cuda/lib64/libcufile.so: undefined reference to `std::ostream::tellp()@GLIBCXX_3.4'
/root/miniconda3/compiler_compat/ld: /usr/local/cuda/lib64/libcufile.so: undefined reference to `std::chrono::_V2::steady_clock::now()@GLIBCXX_3.4.19'
/root/miniconda3/compiler_compat/ld: /usr/local/cuda/lib64/libcufile.so: undefined reference to `std::string::_M_replace_aux(unsigned long, unsigned long, unsigned long, char)@GLIBCXX_3.4'
/root/miniconda3/compiler_compat/ld: /usr/local/cuda/lib64/libcufile.so: undefined reference to `typeinfo for bool@CXXABI_1.3'

In [26]:
# 开始训练
trainer.train()

Step,Training Loss
500,2.3357
1000,2.1841
1500,2.1785
2000,2.1724
2500,2.1714
3000,2.1719
3500,2.1631
4000,2.1593
4500,2.161
5000,2.1602


TrainOutput(global_step=19980, training_loss=2.1272921145021977, metrics={'train_runtime': 1315.5944, 'train_samples_per_second': 121.485, 'train_steps_per_second': 15.187, 'total_flos': 2.08804995072e+16, 'train_loss': 2.1272921145021977, 'epoch': 3.0})

In [27]:
save_dir = 'gpt_ft/final'
trainer.save_model(save_dir)
print("Saved model to:", save_dir)

Saved model to: gpt_ft/final


In [28]:
ave_dir = 'gpt_ft/final'
finetuned_model = GPT2LMHeadModel.from_pretrained(save_dir, local_files_only=True)

In [29]:
finetuned_model

GPT2LMHeadModel(
  (transformer): GPT2Model(
    (wte): Embedding(90000, 768)
    (wpe): Embedding(1024, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0-11): 12 x GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2SdpaAttention(
          (c_attn): Conv1D(nf=2304, nx=768)
          (c_proj): Conv1D(nf=768, nx=768)
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D(nf=3072, nx=768)
          (c_proj): Conv1D(nf=768, nx=3072)
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
  )
  (lm_head): Linear(in_features=768, out_features=90000, bias=False)
)

In [30]:
print("input (test):", input_text)

print("--------------------------\n")

print("model's answer: \n")
print(inference(input_text, finetuned_model, tokenizer))

print("--------------------------\n")
print("real answer: \n")
print(data["test"][0]["output"])

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.


input (test): Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
Determine core promoter detection of following dna sequence, The result will be one of the following: Non-promoter, promoter.

### Input:
CCAGGATGCGCTGACGACCCGGCTGGCAGGCGGGTCCTCGTGGGCGAGGCGAGGGAGGCGGCGAGAGAGGAGCAATAGTTTCCCACCGCTCCCTCTCAGGCGCAGGGTCTAGAGAAGCGCGAGGGGATCTAGAGAAGCCGGAGGGGAGGAAGCGCGAGTCCGCGGCCCGCCCCGTTGCGTCCCACCCACCGCGTCCCCTCCCCTCCCCTCCCGCTGCGGGAAAAGCGGCCGCGGGCGGCGGCGCCCACTGTGGGGCGGGCGGAGCGCCGCGGGAGGCGGACGAGATGCGAGCGCGGCCGC

### Response:

--------------------------

model's answer: 

promoterpromoterpromo
--------------------------

real answer: 

promoter


In [31]:
test_data = data["test"].select(range(100))

data_list = []

for entry in test_data:
    input_text = format_input(entry)
    #print(input_text)
    response_text = inference(input_text, finetuned_model, tokenizer)
    #print(response_text)
    data = {
        "instruction":entry["instruction"],
         "input":entry["input"],
         "output":entry["output"],
        "model_response":response_text
    }

    data_list.append(data)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
The attentio

In [32]:
import json

# 定义输出文件路径
output_file = 'gpt2-small3-1024.json'

# 将 Dataset 对象导出为 JSON 文件
# test_data.to_json(output_file)
with open(output_file, "w") as file:
    json.dump(data_list, file, indent=4)  # "indent" for pretty-printing


In [1]:
import json


output_file = 'gpt2-small3-1024.json'

with open(output_file, "r") as file:
    test_data = json.load(file)

all_num = len(test_data)
right_sum = 0
same_sum = 0
for item in test_data:
    output = item["output"]
    #output = " ".join(tokenizer.tokenize(output))
    model_response = item["model_response"]

    print(output,"||||||||||||", model_response)

    if model_response == output: #same it
        same_sum = same_sum + 1
        
    if output.find("Non")==-1: # no Non
        if model_response.find(output)!=-1 and model_response.find("Non")==-1: #find it, but no Non
            right_sum = right_sum + 1
    else:
        if model_response.find(output)!=-1: #find it
            right_sum = right_sum + 1


print("Accuracy", right_sum/all_num, "same", same_sum/all_num)

promoter |||||||||||| promoterpromoterpromo
Non-promoter |||||||||||| Non-promoter
Non-promoter |||||||||||| Non-promoter
Non-promoter |||||||||||| Non-promoter
promoter |||||||||||| promoterpromoterpromo
Non-promoter |||||||||||| Non-promoter
promoter |||||||||||| promoterpromoterpromo
Non-promoter |||||||||||| Non-promoter
Non-promoter |||||||||||| Non-promoter
promoter |||||||||||| promoterpromoterpromo
Non-promoter |||||||||||| Non-promoter
Non-promoter |||||||||||| Non-promoter
promoter |||||||||||| promoterpromoterpromo
Non-promoter |||||||||||| Non-promoter
Non-promoter |||||||||||| Non-promoter
Non-promoter |||||||||||| Non-promoter
Non-promoter |||||||||||| promoterpromoterpromo
Non-promoter |||||||||||| Non-promoter
promoter |||||||||||| promoterpromoterpromo
Non-promoter |||||||||||| Non-promoter
Non-promoter |||||||||||| Non-promoter
promoter |||||||||||| promoterpromoterpromo
Non-promoter |||||||||||| Non-promoter
Non-promoter |||||||||||| Non-promoter
promoter |||||||||||