Update README.md

74cb0f6 over 1 year ago

6.62 kB

	---
	license: other
	tags:
	- text2text-generation
	pipeline_tag: text2text-generation
	language:
	- zh
	- en
	---

	Considering LLaMA's license constraints, the model is for research and learning only.
	Please strictly respect LLaMA's usage policy. We are not allowed to publish weights for LLaMA, of course, even finetuned, but there is no problem publishing the difference, a patch that we suggest to apply to the files.
	The encryption is a simple XOR between files, ensuring that only the people that have access to the original weights (from completely legal sources, of course) can transform them into finetuned weights.
	You can find the decrypt code on https://github.com/LianjiaTech/BELLE/tree/main/models .


	# Model Card for Model ID

	## Welcome
	If you find this model helpful, please like this model and star us on https://github.com/LianjiaTech/BELLE !

	## Update
	A new checkpoint trained with learning rate of 5e-6 is uploaded.
	In our evaluation, llama trained with smaller lr achieved better performance.

	## Model description
	BELLE-LLAMA-13B-2M-enc is based on LLAMA 13B and finetuned with 2M Chinese data combined with 50,000 pieces of English data from the open source Stanford-Alpaca, resulting in good Chinese instruction understanding and response generation capabilities.

	The code of Chinese data generation and other detailed information can be found in our Github project repository: https://github.com/LianjiaTech/BELLE.


	## Training hyper-parameters
	\| Parameter \| Value \|
	\| ------ \| ------ \|
	\| Batch size \| 16 \|
	\| Learning rate \| 2e-5 \|
	\| Epochs \| 3 \|
	\|Weight_decay \| 0.0 \|
	\|Warmup_rate \| 0.03 \|
	\|LR_scheduler \| cosine \|

	## Download, Convert & Check
	1. After you git clone this model
	```
	md5sum ./*
	029965adbff7a240f33d040dedca0a54 ./config.json.e366f0c901ee336cb921450f975b3e3c5e32874035d227f4263dbcb5d966b822.enc
	b1cc6321ba72757b82842cc44ffadbf3 ./generation_config.json.fd7ff399e5568cc21a0a8414f43df88ef7c424995b9b97a90563165d2cf79efd.enc
	0311f7aac77860f24e5d6379043a1c5e ./pytorch_model-00001-of-00003.bin.5abb160ecbd441c6a1fbe00a9eaa194ee0bd8cd75850c24f503336bd29f0dc45.enc
	e1f8ffc06377eaa516c72091d49af6ec ./pytorch_model-00002-of-00003.bin.46a0e748edff9f0f82aa5f3e721e80e0f342f3d03dc47d0ec6514ea78a585320.enc
	f1fd70e919041e63d7f8b104380dfcb1 ./pytorch_model-00003-of-00003.bin.ec6e4d45dc4c51f2b9abff5ea9840f06f633e065cdf574b71e96366c26a01578.enc
	bf19c5b8dc64bfb19400a4b7fb3bc5b6 ./pytorch_model.bin.index.json.72e91e29282dae48ea5562fcf4d6ca0d5a9c2a30ebc8d67174a19e192552a20b.enc
	1ab707fa9b0c4be294fd0b867d73e919 ./special_tokens_map.json.44136fa355b3678a1146ad16f7e8649e94fb4fc21fe77e8310c060f61caaff8a.enc
	cae7b4ee8d1ad4e4402632bb0600cc17 ./tokenizer_config.json.ef7ef410b9b909949e96f172b17cbf7c68b11761c632715fa05a6088c0c2b9ac.enc
	848005d07146c31e73a10020b3a3099a ./tokenizer.model.9e556afd44213b6bd1be2b850ebbbd98f5481437a8021afaf58ee7fb1818d347.enc
	```

	2. Decrypt the files using the scripts in https://github.com/LianjiaTech/BELLE/tree/main/models

	You can use the following command in Bash:
	```bash
	for f in "encrypted"/*; \
	do if [ -f "$f" ]; then \
	python3 decrypt.py "$f" "/path/to_original_llama_13B/consolidated.00.pth" "result/"; \
	fi; \
	done
	```

	After executing the aforementioned command, you will obtain the following files.

	```
	./config.json
	./generation_config.json
	./pytorch_model-00001-of-00003.bin
	./pytorch_model-00002-of-00003.bin
	./pytorch_model-00003-of-00003.bin
	./pytorch_model.bin.index.json
	./README.md
	./special_tokens_map.json
	./tokenizer_config.json
	./tokenizer.model
	```

	3. Check md5sum
	You can verify the integrity of these files by performing an MD5 checksum to ensure their complete recovery.
	Here are the MD5 checksums for the relevant files:
	```
	md5sum ./*
	0fa6ff8379308d40f090878593f085a9 ./config.json
	2917a1cafb895cf57e746cfd7696bfe5 ./generation_config.json
	1710f2d139d883d7e1e9a3f3198ee581 ./pytorch_model-00001-of-00003.bin
	74b26646e31debd94c5c1092b3e39102 ./pytorch_model-00002-of-00003.bin
	1c123bee82a65a43b6005b7040e20618 ./pytorch_model-00003-of-00003.bin
	621720a147e0dd2a97580ab5dd0c5557 ./pytorch_model.bin.index.json
	d463d8a04501fbf1d71feaa8fc1be250 ./README.md
	99914b932bd37a50b983c5e7c90ae93b ./special_tokens_map.json
	5526ad31f4928acb5219e295e5ff81ce ./tokenizer_config.json
	eeec4125e9c7560836b4873b6f8e3025 ./tokenizer.model
	```

	## Use model
	Please note that the input should be formatted as follows in both training and inference.
	``` python
	Human: {input} \n\nAssistant:
	```

	In order to load BELLE-LLAMA-13B-2M-enc with huggingface transformers, please install the main version, as the latest stable version doesn't support LLAMA (as of March 26, 2023).
	``` python
	pip install git+https://github.com/huggingface/transformers
	```

	After you decrypt the files, BELLE-LLAMA-13B-2M can be easily loaded with LlamaForCausalLM.
	``` python
	from transformers import LlamaForCausalLM, AutoTokenizer
	import torch

	ckpt = './result/'
	device = torch.device('cuda')
	model = LlamaForCausalLM.from_pretrained(ckpt, device_map='auto', low_cpu_mem_usage=True)
	tokenizer = AutoTokenizer.from_pretrained(ckpt)
	prompt = "Human: 写一首中文歌曲，赞美大自然 \n\nAssistant: "
	input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(device)
	generate_ids = model.generate(input_ids, max_new_tokens=500, do_sample = True, top_k = 30, top_p = 0.85, temperature = 0.5, repetition_penalty=1., eos_token_id=2, bos_token_id=1, pad_token_id=0)
	output = tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
	response = output[len(prompt):]

	```

	## Limitations
	There still exists a few issues in the model trained on current base model and data:

	1. The model might generate factual errors when asked to follow instructions related to facts.

	2. Occasionally generates harmful responses since the model still struggles to identify potential harmful instructions.

	3. Needs improvements on reasoning and coding.

	Since the model still has its limitations, we require developers only use the open-sourced code, data, model and any other artifacts generated via this project for research purposes. Commercial use and other potential harmful use cases are not allowed.


	## Citation

	Please cite us when using our code, data or model.

	```
	@misc{BELLE,
	author = {Yunjie Ji, Yong Deng, Yan Gong, Yiping Peng, Qiang Niu, Baochang Ma, Xiangang Li},
	title = {BELLE: Be Everyone's Large Language model Engine},
	year = {2023},
	publisher = {GitHub},
	journal = {GitHub repository},
	howpublished = {\url{https://github.com/LianjiaTech/BELLE}},
	}
	```