ctrltokyo
/

llama-2-7b-hf-dolly-flash-attention

Text Generation

Model card Files Files and versions Community

llama-2-7b-hf-dolly-flash-attention / README.md

ctrltokyo's picture

Update README.md

3600b7d over 1 year ago

|

history blame contribute delete

1.56 kB

	---
	library_name: peft
	datasets:
	- databricks/databricks-dolly-15k
	language:
	- en
	pipeline_tag: text-generation
	---

	# ctrltokyo/llama-2-7b-hf-dolly-flash-attention

	This model is a fine-tuned version of [NousResearch/Llama-2-7b-hf](https://huggingface.co./NousResearch/Llama-2-7b-hf) on the databricks/databricks-dolly-15k dataset with all training performed using Flash Attention 2.

	No further testing or optimisation has been performed.

	## Model description

	Just like [ctrltokyo/llm_prompt_mask_fill_model](https://huggingface.co./ctrltokyo/llm_prompt_mask_fill_model), this model could be used for live autocompletion of PROMPTS, but is more designed for a generalized chatbot (hence the usage of the Dolly 15k dataset). Don't try this on code, because it won't work.
	I plan to release a further fine-tuned version using the [code_instructions_120k](https://huggingface.co./datasets/sahil2801/code_instructions_120k) dataset.

	## Intended uses & limitations

	Use as intended.

	## Training and evaluation data

	No evaluation was performed. Trained on NVIDIA A100, but appears to use around 20GB of VRAM when performing inference on the raw model.

	## Training procedure

	The following `bitsandbytes` quantization config was used during training:
	- load_in_8bit: False
	- load_in_4bit: True
	- llm_int8_threshold: 6.0
	- llm_int8_skip_modules: None
	- llm_int8_enable_fp32_cpu_offload: False
	- llm_int8_has_fp16_weight: False
	- bnb_4bit_quant_type: fp4
	- bnb_4bit_use_double_quant: False
	- bnb_4bit_compute_dtype: float32
	### Framework versions


	- PEFT 0.4.0