|
--- |
|
license: mit |
|
datasets: |
|
- Umean/B2NERD |
|
language: |
|
- en |
|
- zh |
|
library_name: peft |
|
--- |
|
|
|
This is the B2NER model's LoRA adapter based on [InternLM2-20B](https://huggingface.co./internlm/internlm2-20b). |
|
**See [github repo](https://github.com/UmeanNever/B2NER) for quick demo usage and more information about this work.** |
|
|
|
## B2NER |
|
|
|
We present B2NERD, a cohesive and efficient dataset that can improve LLMs' generalization on the challenging Open NER task, refined from 54 existing English or Chinese datasets. |
|
Our B2NER models, trained on B2NERD, outperform GPT-4 by 6.8-12.0 F1 points and surpass previous methods in 3 out-of-domain benchmarks across 15 datasets and 6 languages. |
|
|
|
- 📖 Paper: [Beyond Boundaries: Learning a Universal Entity Taxonomy across Datasets and Languages for Open Named Entity Recognition](http://arxiv.org/abs/2406.11192) |
|
- 🎮 Code Repo: We provide codes for both training and inference at https://github.com/UmeanNever/B2NER |
|
- 📀 Data: See [B2NERD](https://huggingface.co./datasets/Umean/B2NERD). |
|
- 💾 Model (LoRA Adapters): Current repo saves the B2NER model LoRA adapter based on InternLM2-20B. See [7B model](https://huggingface.co./Umean/B2NER-Internlm2.5-7B-LoRA) for a 7B adapter. |
|
|
|
## Sample Usage - Quick Demo |
|
Here we show how to use our provided lora adapter to do quick demo with customized input. You can also refer to github repo's `src/demo.ipynb` to see our examples and reuse for your own demo. |
|
- Prepare/download our LoRA checkpoint and corresponding backbone model. |
|
- Load the model & tokenizer. |
|
```python |
|
import torch |
|
from peft import PeftModel, PeftConfig |
|
from transformers import AutoModelForCausalLM, AutoTokenizer |
|
|
|
# Load the base model and tokenizer, use your own path/name |
|
base_model_path = "/path/to/backbone_model" |
|
base_model = AutoModelForCausalLM.from_pretrained(base_model_path, |
|
trust_remote_code=True, torch_dtype=torch.float16) |
|
tokenizer = AutoTokenizer.from_pretrained(base_model_path, trust_remote_code=True) |
|
|
|
# Load and apply the PEFT model, point weight path to your own directory where an adapter_config.json is located |
|
lora_weight_path = "/path/to/adapter" |
|
config = PeftConfig.from_pretrained(lora_weight_path) |
|
model = PeftModel.from_pretrained(base_model, lora_weight_path, torch_dtype=torch.bfloat16) |
|
``` |
|
|
|
- Set `text` and `labels` for your NER demo. Prepare instructions and generate the answer. Below are an English example and a Chinese example based on our B2NER-InternLM2.5-7B (Both examples are out-of-domain data). |
|
|
|
```python |
|
## English Example ## |
|
# Input your own text and target entity labels. The model will extract entities inside provided label set from text. |
|
text = "what is a good 1990 s romance movie starring kelsy grammer" |
|
labels = ["movie genre", "year or time period", "movie title", "movie actor", "movie age rating"] |
|
|
|
instruction_template_en = "Given the label set of entities, please recognize all the entities in the text. The answer format should be \"entity label: entity; entity label: entity\". \nLabel Set: {labels_str} \n\nText: {text} \nAnswer:" |
|
labels_str = ", ".join(labels) |
|
final_instruction = instruction_template_en.format(labels_str=labels_str, text=text) |
|
inputs = tokenizer([final_instruction], return_tensors="pt") |
|
output = model.generate(**inputs, max_length=500) |
|
generated_text = tokenizer.decode(output[0].tolist(), skip_special_tokens=True) |
|
print(generated_text.split("Answer:")[-1]) |
|
# year or time period: 1990 s; movie genre: romance; movie actor: kelsy grammer |
|
|
|
|
|
## 中文例子 ## |
|
# 输入您自己的文本和目标实体类别标签。模型将从文本中提取出在提供的标签集内的实体。 |
|
text = "暴雪中国时隔多年之后再次举办了官方比赛,而Moon在星际争霸2中发挥不是很理想,对此Infi感觉Moon是哪里出了问题呢?" |
|
labels = ["人名", "作品名->文字作品", "作品名->游戏作品", "作品名->影像作品", "组织机构名->政府机构", "组织机构名->公司", "组织机构名->其它", "地名"] |
|
|
|
instruction_template_zh = "给定实体的标签范围,请识别文本中属于这些标签的所有实体。答案格式为 \"实体标签: 实体; 实体标签: 实体\"。\n标签范围: {labels_str}\n\n文本: {text} \n答案:" |
|
labels_str = ", ".join(labels) |
|
final_instruction = instruction_template_zh.format(labels_str=labels_str, text=text) |
|
inputs = tokenizer([final_instruction], return_tensors="pt") |
|
output = model.generate(**inputs, max_length=500) |
|
generated_text = tokenizer.decode(output[0].tolist(), skip_special_tokens=True) |
|
print(generated_text.split("答案:")[-1]) |
|
# 组织机构名->公司: 暴雪中国; 人名: Moon; 作品名->游戏作品: 星际争霸2; 人名: Infi |
|
``` |
|
|
|
## Cite |
|
``` |
|
@article{yang2024beyond, |
|
title={Beyond Boundaries: Learning a Universal Entity Taxonomy across Datasets and Languages for Open Named Entity Recognition}, |
|
author={Yang, Yuming and Zhao, Wantong and Huang, Caishuang and Ye, Junjie and Wang, Xiao and Zheng, Huiyuan and Nan, Yang and Wang, Yuran and Xu, Xueying and Huang, Kaixin and others}, |
|
journal={arXiv preprint arXiv:2406.11192}, |
|
year={2024} |
|
} |
|
``` |
|
|
|
|
|
|