Update README.md

04a4be4 verified 6 months ago

5.34 kB

	---
	license: cc-by-nc-4.0
	base_model: google/gemma-2b
	model-index:
	- name: Octopus-V2-2B
	results: []
	tags:
	- function calling
	- on-device language model
	- android
	inference: false
	space: false
	spaces: false
	language:
	- en
	---

	# Quantized Octopus V2: On-device language model for super agent

	This repo includes two types of quantized models: GGUF and AWQ, for our Octopus V2 model at [NexaAIDev/Octopus-v2](https://huggingface.co./NexaAIDev/Octopus-v2)

	<p align="center" width="100%">
	<a><img src="Octopus-logo.jpeg" alt="nexa-octopus" style="width: 40%; min-width: 300px; display: block; margin: auto;"></a>
	</p>


	# GGUF Qauntization
	Run with [Ollama](https://github.com/ollama/ollama)

	```bash
	ollama run NexaAIDev/octopus-v2-Q4_K_M
	```

	# AWQ Quantization
	Python example:

	```python
	from awq import AutoAWQForCausalLM
	from transformers import AutoTokenizer, GemmaForCausalLM
	import torch
	import time
	import numpy as np

	def inference(input_text):

	tokens = tokenizer(
	input_text,
	return_tensors='pt'
	).input_ids.cuda()

	start_time = time.time()
	generation_output = model.generate(
	tokens,
	do_sample=True,
	temperature=0.7,
	top_p=0.95,
	top_k=40,
	max_new_tokens=512
	)
	end_time = time.time()

	res = tokenizer.decode(generation_output[0])
	res = res.split(input_text)
	latency = end_time - start_time
	output_tokens = tokenizer.encode(res)
	num_output_tokens = len(output_tokens)
	throughput = num_output_tokens / latency

	return {"output": res[-1], "latency": latency, "throughput": throughput}


	model_id = "path/to/Octopus-v2-AWQ"
	model = AutoAWQForCausalLM.from_quantized(model_id, fuse_layers=True,
	trust_remote_code=False, safetensors=True)
	tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=False)

	prompts = ["Below is the query from the users, please call the correct function and generate the parameters to call the function.\n\nQuery: Can you take a photo using the back camera and save it to the default location? \n\nResponse:"]

	avg_throughput = []
	for prompt in prompts:
	out = inference(prompt)
	avg_throughput.append(out["throughput"])
	print("nexa model result:\n", out["output"])

	print("avg throughput:", np.mean(avg_throughput))
	```

	# Quantized GGUF & AWQ Models Benchmark

	\| Name \| Quant method \| Bits \| Size \| Response (t/s) \| Use Cases \|
	\| ---------------------- \| ------------ \| ---- \| -------- \| -------------- \| ----------------------------------- \|
	\| Octopus-v2-AWQ \| AWQ \| 4 \| 3.00 GB \| 63.83 \| fast, high quality, recommended \|
	\| Octopus-v2-Q2_K.gguf \| Q2_K \| 2 \| 1.16 GB \| 57.81 \| fast but high loss, not recommended \|
	\| Octopus-v2-Q3_K.gguf \| Q3_K \| 3 \| 1.38 GB \| 57.81 \| extremely not recommended \|
	\| Octopus-v2-Q3_K_S.gguf \| Q3_K_S \| 3 \| 1.19 GB \| 52.13 \| extremely not recommended \|
	\| Octopus-v2-Q3_K_M.gguf \| Q3_K_M \| 3 \| 1.38 GB \| 58.67 \| moderate loss, not very recommended \|
	\| Octopus-v2-Q3_K_L.gguf \| Q3_K_L \| 3 \| 1.47 GB \| 56.92 \| not very recommended \|
	\| Octopus-v2-Q4_0.gguf \| Q4_0 \| 4 \| 1.55 GB \| 68.80 \| moderate speed, recommended \|
	\| Octopus-v2-Q4_1.gguf \| Q4_1 \| 4 \| 1.68 GB \| 68.09 \| moderate speed, recommended \|
	\| Octopus-v2-Q4_K.gguf \| Q4_K \| 4 \| 1.63 GB \| 64.70 \| moderate speed, recommended \|
	\| Octopus-v2-Q4_K_S.gguf \| Q4_K_S \| 4 \| 1.56 GB \| 62.16 \| fast and accurate, very recommended \|
	\| Octopus-v2-Q4_K_M.gguf \| Q4_K_M \| 4 \| 1.63 GB \| 64.74 \| fast, recommended \|
	\| Octopus-v2-Q5_0.gguf \| Q5_0 \| 5 \| 1.80 GB \| 64.80 \| fast, recommended \|
	\| Octopus-v2-Q5_1.gguf \| Q5_1 \| 5 \| 1.92 GB \| 63.42 \| very big, prefer Q4 \|
	\| Octopus-v2-Q5_K.gguf \| Q5_K \| 5 \| 1.84 GB \| 61.28 \| big, recommended \|
	\| Octopus-v2-Q5_K_S.gguf \| Q5_K_S \| 5 \| 1.80 GB \| 62.16 \| big, recommended \|
	\| Octopus-v2-Q5_K_M.gguf \| Q5_K_M \| 5 \| 1.71 GB \| 61.54 \| big, recommended \|
	\| Octopus-v2-Q6_K.gguf \| Q6_K \| 6 \| 2.06 GB \| 55.94 \| very big, not very recommended \|
	\| Octopus-v2-Q8_0.gguf \| Q8_0 \| 8 \| 2.67 GB \| 56.35 \| very big, not very recommended \|
	\| Octopus-v2-f16.gguf \| f16 \| 16 \| 5.02 GB \| 36.27 \| extremely big \|
	\| Octopus-v2.gguf \| \| \| 10.00 GB \| \| \|

	_Quantized with llama.cpp_


	Acknowledgement:
	We sincerely thank our community members, [Mingyuan](https://huggingface.co./ThunderBeee), [Zoey](https://huggingface.co./ZY6), [Brian](https://huggingface.co./JoyboyBrian), [Perry](https://huggingface.co./PerryCheng614), [Qi](https://huggingface.co./qiqiWav), [David](https://huggingface.co./Davidqian123) for their extraordinary contributions to this quantization effort.