Update README.md

8cd8b6d verified 6 months ago

5.17 kB

	---
	license: mit
	datasets:
	- Umean/B2NERD
	language:
	- en
	- zh
	library_name: peft
	---

	This is the B2NER model's LoRA adapter based on [InternLM2-20B](https://huggingface.co./internlm/internlm2-20b).
	See [github repo](https://github.com/UmeanNever/B2NER) for quick demo usage and more information about this work.

	## B2NER

	We present B2NERD, a cohesive and efficient dataset that can improve LLMs' generalization on the challenging Open NER task, refined from 54 existing English or Chinese datasets.
	Our B2NER models, trained on B2NERD, outperform GPT-4 by 6.8-12.0 F1 points and surpass previous methods in 3 out-of-domain benchmarks across 15 datasets and 6 languages.

	- 📖 Paper: [Beyond Boundaries: Learning a Universal Entity Taxonomy across Datasets and Languages for Open Named Entity Recognition](http://arxiv.org/abs/2406.11192)
	- 🎮 Code Repo: We provide codes for both training and inference at https://github.com/UmeanNever/B2NER
	- 📀 Data: See [B2NERD](https://huggingface.co./datasets/Umean/B2NERD).
	- 💾 Model (LoRA Adapters): Current repo saves the B2NER model LoRA adapter based on InternLM2-20B. See [7B model](https://huggingface.co./Umean/B2NER-Internlm2.5-7B-LoRA) for a 7B adapter.

	## Sample Usage - Quick Demo
	Here we show how to use our provided lora adapter to do quick demo with customized input. You can also refer to github repo's `src/demo.ipynb` to see our examples and reuse for your own demo.
	- Prepare/download our LoRA checkpoint and corresponding backbone model.
	- Load the model & tokenizer.
	```python
	import torch
	from peft import PeftModel, PeftConfig
	from transformers import AutoModelForCausalLM, AutoTokenizer

	# Load the base model and tokenizer, use your own path/name
	base_model_path = "/path/to/backbone_model"
	base_model = AutoModelForCausalLM.from_pretrained(base_model_path,
	trust_remote_code=True, torch_dtype=torch.float16)
	tokenizer = AutoTokenizer.from_pretrained(base_model_path, trust_remote_code=True)

	# Load and apply the PEFT model, point weight path to your own directory where an adapter_config.json is located
	lora_weight_path = "/path/to/adapter"
	config = PeftConfig.from_pretrained(lora_weight_path)
	model = PeftModel.from_pretrained(base_model, lora_weight_path, torch_dtype=torch.bfloat16)
	```

	- Set `text` and `labels` for your NER demo. Prepare instructions and generate the answer. Below are an English example and a Chinese example based on our B2NER-InternLM2.5-7B (Both examples are out-of-domain data).

	```python
	## English Example ##
	# Input your own text and target entity labels. The model will extract entities inside provided label set from text.
	text = "what is a good 1990 s romance movie starring kelsy grammer"
	labels = ["movie genre", "year or time period", "movie title", "movie actor", "movie age rating"]

	instruction_template_en = "Given the label set of entities, please recognize all the entities in the text. The answer format should be \"entity label: entity; entity label: entity\". \nLabel Set: {labels_str} \n\nText: {text} \nAnswer:"
	labels_str = ", ".join(labels)
	final_instruction = instruction_template_en.format(labels_str=labels_str, text=text)
	inputs = tokenizer([final_instruction], return_tensors="pt")
	output = model.generate(**inputs, max_length=500)
	generated_text = tokenizer.decode(output[0].tolist(), skip_special_tokens=True)
	print(generated_text.split("Answer:")[-1])
	# year or time period: 1990 s; movie genre: romance; movie actor: kelsy grammer


	## 中文例子 ##
	# 输入您自己的文本和目标实体类别标签。模型将从文本中提取出在提供的标签集内的实体。
	text = "暴雪中国时隔多年之后再次举办了官方比赛，而Moon在星际争霸2中发挥不是很理想，对此Infi感觉Moon是哪里出了问题呢？"
	labels = ["人名", "作品名->文字作品", "作品名->游戏作品", "作品名->影像作品", "组织机构名->政府机构", "组织机构名->公司", "组织机构名->其它", "地名"]

	instruction_template_zh = "给定实体的标签范围，请识别文本中属于这些标签的所有实体。答案格式为 \"实体标签: 实体; 实体标签: 实体\"。\n标签范围: {labels_str}\n\n文本: {text} \n答案:"
	labels_str = ", ".join(labels)
	final_instruction = instruction_template_zh.format(labels_str=labels_str, text=text)
	inputs = tokenizer([final_instruction], return_tensors="pt")
	output = model.generate(**inputs, max_length=500)
	generated_text = tokenizer.decode(output[0].tolist(), skip_special_tokens=True)
	print(generated_text.split("答案:")[-1])
	# 组织机构名->公司: 暴雪中国; 人名: Moon; 作品名->游戏作品: 星际争霸2; 人名: Infi
	```

	## Cite
	```
	@article{yang2024beyond,
	title={Beyond Boundaries: Learning a Universal Entity Taxonomy across Datasets and Languages for Open Named Entity Recognition},
	author={Yang, Yuming and Zhao, Wantong and Huang, Caishuang and Ye, Junjie and Wang, Xiao and Zheng, Huiyuan and Nan, Yang and Wang, Yuran and Xu, Xueying and Huang, Kaixin and others},
	journal={arXiv preprint arXiv:2406.11192},
	year={2024}
	}
	```