File size: 10,313 Bytes

- [1.与 knowlm-13b-zhixi 的区别](#1与-knowlm-13b-zhixi-的区别)
- [2.信息抽取模板](#2信息抽取模板)
- [3.常见的关系类型](#3常见的关系类型)
- [4.转换脚本](#4转换脚本)
- [5.现成数据集](#5现成数据集)
- [6.使用](#6使用)
- [7.评估](#7评估)


# 1.与 knowlm-13b-zhixi 的区别

与 zjunlp/knowlm-13b-zhixi 相比，zjunlp/knowlm-13b-ie 在信息抽取方面表现出略强的实用性，但其一般适用性下降。

zjunlp/knowlm-13b-ie 从中英文信息抽取数据集中采样约 10% 的数据，然后进行负采样。例如，如果数据集 A 包含标签 [a，b，c，d，e，f]，我们首先从 A 中采样出 10% 的数据。对于给定的样本 s，它可能只包含标签 a 和 b。我们随机地添加原本没有的关系，比如来自指定关系候选列表的 c 和 d。当遇到这些额外的关系时，模型可能会输出类似 'NAN' 的文本。这种方法使模型在一定程度上具备生成 'NAN' 输出的能力，增强了其信息抽取能力，但削弱了其泛化能力。



# 2.信息抽取模板
关系抽取（RE）支持以下模板：

```python
relation_template_zh =  {
    0:'已知候选的关系列表：{s_schema}，请你根据关系列表，从以下输入中抽取出可能存在的头实体与尾实体，并给出对应的关系三元组。请按照{s_format}的格式回答。',
    1:'我将给你个输入，请根据关系列表：{s_schema}，从输入中抽取出可能包含的关系三元组，并以{s_format}的形式回答。',
    2:'我希望你根据关系列表从给定的输入中抽取可能的关系三元组，并以{s_format}的格式回答，关系列表={s_schema}。',
    3:'给定的关系列表是{s_schema}\n根据关系列表抽取关系三元组，在这个句子中可能包含哪些关系三元组？请以{s_format}的格式回答。',
} 

relation_int_out_format_zh = {
    0:['"(头实体,关系,尾实体)"', relation_convert_target0],
    1:['"头实体是\n关系是\n尾实体是\n\n"', relation_convert_target1],
    2:['"关系：头实体,尾实体\n"', relation_convert_target2],
    3:["JSON字符串[{'head':'', 'relation':'', 'tail':''}, ]", relation_convert_target3],
}

relation_template_en =  {
    0:'Identify the head entities (subjects) and tail entities (objects) in the following text and provide the corresponding relation triples from relation list {s_schema}. Please provide your answer as a list of relation triples in the form of {s_format}.',
    1:'From the given text, extract the possible head entities (subjects) and tail entities (objects) and give the corresponding relation triples. The relations are {s_schema}. Please format your answer as a list of relation triples in the form of {s_format}.', 
}

relation_int_out_format_en = {
    0:['(Subject, Relation, Object)', relation_convert_target0_en],
    1:["{'head':'', 'relation':'', 'tail':''}", relation_convert_target1_en],
}

```


这些模板中的schema（{s_schema}）和输出格式 （{s_format}）占位符被嵌入在模板中，用户必须指定。
有关模板的更全面理解，请参阅文件  [ner_template.py](https://github.com/zjunlp/DeepKE/blob/main/example/llm/InstructKGC/kg2instruction/ner_template.py)、[re_template.py](https://github.com/zjunlp/DeepKE/blob/main/example/llm/InstructKGC/kg2instruction/re_template.py)、[ee_template.py](https://github.com/zjunlp/DeepKE/blob/main/example/llm/InstructKGC/kg2instruction/ee_template.py) .



# 3.常见的关系类型

```python
{
    '组织': ['别名', '位于', '类型', '成立时间', '解散时间', '成员', '创始人', '事件', '子组织', '产品', '成就', '运营'], 
    '医学': ['别名', '病因', '症状', '可能后果', '包含', '发病部位'], 
    '事件': ['别名', '类型', '发生时间', '发生地点', '参与者', '主办方', '提名者', '获奖者', '赞助者', '获奖作品', '获胜者', '奖项'], 
    '运输': ['别名', '位于', '类型', '属于', '途径', '开通时间', '创建时间', '车站等级', '长度', '面积'], 
    '人造物件': ['别名', '类型', '受众', '成就', '品牌', '产地', '长度', '宽度', '高度', '重量', '价值', '制造商', '型号', '生产时间', '材料', '用途', '发现者或发明者'], 
    '生物': ['别名', '学名', '类型', '分布', '父级分类单元', '主要食物来源', '用途', '长度', '宽度', '高度', '重量', '特征'], 
    '建筑': ['别名', '类型', '位于', '临近', '名称由来', '长度', '宽度', '高度', '面积', '创建时间', '创建者', '成就', '事件'], 
    '自然科学': ['别名', '类型', '性质', '生成物', '用途', '组成', '产地', '发现者或发明者'], 
    '地理地区': ['别名', '类型', '所在行政领土', '接壤', '事件', '面积', '人口', '行政中心', '产业', '气候'], 
    '作品': ['别名', '类型', '受众', '产地', '成就', '导演', '编剧', '演员', '平台', '制作者', '改编自', '包含', '票房', '角色', '作曲者', '作词者', '表演者', '出版时间', '出版商', '作者'], 
    '人物': ['别名', '籍贯', '国籍', '民族', '朝代', '出生时间', '出生地点', '死亡时间', '死亡地点', '专业', '学历', '作品', '职业', '职务', '成就', '所属组织', '父母', '配偶', '兄弟姊妹', '亲属', '同事', '参与'], 
    '天文对象': ['别名', '类型', '坐标', '发现者', '发现时间', '名称由来', '属于', '直径', '质量', '公转周期', '绝对星等', '临近']
}
```

此处 [schema](https://github.com/zjunlp/DeepKE/blob/main/example/llm/InstructKGC/kg2instruction/schema.py) 提供了12种文本主题, 以及该主题下常见的关系类型。

# 4.转换脚本

提供一个名为 [convert.py](https://github.com/zjunlp/DeepKE/blob/main/example/llm/InstructKGC/kg2instruction/convert.py)、[convert_test.py](https://github.com/zjunlp/DeepKE/blob/main/example/llm/InstructKGC/kg2instruction/convert_test.py) 的脚本，用于将数据统一转换为可以直接输入 KnowLM 的指令。在执行 convert.py 之前，请参考 [data](https://github.com/zjunlp/DeepKE/tree/main/example/llm/InstructKGC/data) 目录中包含了每个任务的预期数据格式。

```bash
python kg2instruction/convert.py \
  --src_path data/NER/sample.json \
  --tgt_path data/NER/processed.json \
  --schema_path data/NER/schema.json \
  --language zh \       # 不同语言使用的template及转换脚本不同
  --task NER \          # ['RE', 'NER', 'EE']三种任务
  --sample 0 \          # 若为-1, 则从4种指令和4种输出格式中随机采样其中一种, 否则即为指定的指令格式, -1<=sample<=3
  --all                 # 是否将指令中指定的抽取类型列表设置为全部schema
```

[convert_test.py](https://github.com/zjunlp/DeepKE/blob/main/example/llm/InstructKGC/kg2instruction/convert_test.py) 不要求数据具有标签(`entity`、`relation`、`event`)字段, 只需要具有 `input` 字段, 以及提供 `schema_path`, 适合用来处理测试数据。

```bash
python kg2instruction/convert_test.py \
    --src_path data/NER/sample.json \
    --tgt_path data/NER/processed.json \
    --schema_path data/NER/schema.json \
    --language zh \      
    --task NER \          
    --sample 0 
```


# 5.现成数据集

下面是一些现成的处理后的数据：

| 名称                  | 下载                                                                                                                     | 数量     | 描述                                                                                                                                                       |
| ------------------- | ---------------------------------------------------------------------------------------------------------------------- | ------ | -------------------------------------------------------------------------------------------------------------------------------------------------------- |
| KnowLM-IE.json       | [Google drive](https://drive.google.com/file/d/1hY_R6aFgW4Ga7zo41VpOVOShbTgBqBbL/view?usp=sharing) <br/> [HuggingFace](https://huggingface.co./datasets/zjunlp/KnowLM-IE)      | 281860 | [InstructIE](https://arxiv.org/abs/2305.11527) 中提到的数据集                                                                                     |
| KnowLM-ke         | [HuggingFace](https://huggingface.co./datasets/zjunlp/knowlm-ke)                     | XXXX   | 训练[zjunlp/knowlm-13b-zhixi](https://huggingface.co./zjunlp/knowlm-13b-zhixi)所用到的所有指令数据(通用、IE、Code、COT等) |


`KnowLM-IE.json`：包含 `'id'`(唯一标识符)、`'cate'`(文本主题)、`'instruction'`(抽取指令)、`'input'`(输入文本)、`'output'`(输出文本)字段、`'relation'`(三元组)字段，可以通过`'relation'`自由构建抽取的指令和输出，`'instruction'`有16种格式(4种prompt * 4种输出格式)，`'output'`是按照`'instruction'`中指定的输出格式生成的文本。


`KnowLM-ke`：仅包含`'instruction'`、`'input'`、`'output'`字段。其目录下的`ee-en.json`、`ee_train.json`、`ner-en.json`、`ner_train.json`、`re-en.json`、`re_train.json`为中英文IE指令数据。



# 6.使用
我们提供了可直接使用 `zjunlp/knowlm-13b-ie` 模型进行推理的脚本[inference.py](https://github.com/zjunlp/DeepKE/blob/main/example/llm/InstructKGC/src/inference.py), 请参考 [README.md](https://github.com/zjunlp/DeepKE/blob/main/example/llm/InstructKGC/README.md) 配置环境等。

```bash
CUDA_VISIBLE_DEVICES="0" python src/inference.py \
    --model_name_or_path 'models/knowlm-13b-ie' \
    --model_name 'llama' \
    --input_file 'data/NER/processed.json' \
    --output_file 'results/ner_test.json' \
    --fp16 
```

如果GPU显存不足够, 可以采用 `--bits 8` 或 `--bits 4`


# 7.评估
我们提供一个位于 [evaluate.py](https://github.com/zjunlp/DeepKE/blob/main/example/llm/InstructKGC/kg2instruction/evaluate.py) 的脚本，用于将模型的字符串输出转换为列表并计算 F1 分数。

```bash
python kg2instruction/evaluate.py \
  --standard_path data/NER/processed.json \
  --submit_path data/NER/processed.json \
  --task ner \
  --language zh
```