vllm can not inter this model (other 70b gptq model are ok)
exllama can infer this model but exllama is not very stable.
vllm is the perfect one. why can not vllm?
I am using this model in TGI without any issue. I used the latest AutoGPTQ to quantized this model. https://github.com/huggingface/text-generation-inference
@MaziyarPanahi Thanks for quantized model and sharing..How much VRAM do I need to load this version for inference?
using tgi is also out of vram.
docker run --gpus all --shm-size 1g -p 8001:8001 -v /home/tutu/models/miqu-1-70b-sf-GPTQ:/model ghcr.io/huggingface/text-generation-inference:1.4 --model-id /model --quantize gptq --hostname 0.0.0.0 --port 8001
using tgi for other gptq model is ok.
so strange.
is tokenizer_config.json correct? like "model_max_length"?
So this model is 8k (8192) for the max length. If you are short on vRAM, would make the max length down to 4k and also make sure cuda_fraction
is 0.95 so you can use all the available GPU memory. (this is larger than other GPTQ 70b because it has double context length)
I am seeing the same on VLLM. I wonder if this is the watermarking?
So this is my TGI, and it's pretty fast!
{ model_id: "MaziyarPanahi/miqu-1-70b-sf-GPTQ", revision: None, validation_workers: 2, sharded: Some(true), num_shard: Some(4), quantize: Some(Gptq), speculate: None, dtype: None, trust_remote_code: false, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_top_n_tokens: 5, max_input_length: 7100, max_total_tokens: 8192, waiting_served_ratio: 1.2, max_batch_prefill_tokens: 8192, max_batch_total_tokens: Some(1044000), max_waiting_tokens: 20, hostname: "b869416c7485", port: 80, shard_uds_path: "/tmp/text-generation-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: Some("/data"), weights_cache_override: None, disable_custom_kernels: false, cuda_memory_fraction: 0.9, rope_scaling: None, rope_factor: None, json_output: false, otlp_endpoint: None, cors_allow_origin: [], watermark_gamma: None, watermark_delta: None, ngrok: false, ngrok_authtoken: None, ngrok_edge: None, tokenizer_config_path: None, env: false }
2024-02-08T18:44:08.167914Z INFO text_generation_router: router/src/main.rs:420: Serving revision 010afdb6478a25946fc381a327c82b83a86e99b0 of model MaziyarPanahi/miqu-1-70b-sf-GPTQ
2024-02-08T18:44:08.167938Z INFO text_generation_router: router/src/main.rs:237: Using the Hugging Face API to retrieve tokenizer config
2024-02-08T18:44:08.174077Z INFO text_generation_router: router/src/main.rs:280: Warming up model
2024-02-08T18:44:18.394725Z WARN text_generation_router: router/src/main.rs:301: `--max-batch-total-tokens` is deprecated for Flash Attention models.
2024-02-08T18:44:18.394748Z WARN text_generation_router: router/src/main.rs:305: Inferred max batch total tokens: 419728
2024-02-08T18:44:18.394752Z INFO text_generation_router: router/src/main.rs:316: Setting max batch total tokens to 419728
2024-02-08T18:44:18.394754Z INFO text_generation_router: router/src/main.rs:317: Connected
2024-02-08T18:44:18.394758Z WARN text_generation_router: router/src/main.rs:322: Invalid hostname, defaulting to 0.0.0.0
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.30.02 Driver Version: 530.30.02 CUDA Version: 12.1 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA A100 80GB PCIe On | 00000000:08:00.0 Off | 0 |
| N/A 35C P0 65W / 300W| 47394MiB / 81920MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 1 NVIDIA A100 80GB PCIe On | 00000000:48:00.0 Off | 0 |
| N/A 33C P0 62W / 300W| 47402MiB / 81920MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 2 NVIDIA A100 80GB PCIe On | 00000000:88:00.0 Off | 0 |
| N/A 33C P0 63W / 300W| 47402MiB / 81920MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 3 NVIDIA A100 80GB PCIe On | 00000000:C8:00.0 Off | 0 |
| N/A 33C P0 62W / 300W| 51522MiB / 81920MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 313848 C /opt/conda/bin/python3.10 47392MiB |
| 1 N/A N/A 313849 C /opt/conda/bin/python3.10 47400MiB |
| 2 N/A N/A 313850 C /opt/conda/bin/python3.10 47400MiB |
| 3 N/A N/A 313853 C /opt/conda/bin/python3.10 51520MiB |
+---------------------------------------------------------------------------------------+
Test:
What's Large Language Model? answer in 3 bullet points
Response:
1. A large language model is a type of artificial intelligence model that has been trained on a vast amount of text data to generate human-like text.
2. These models use machine learning algorithms to analyze patterns in the data and learn how to produce coherent and contextually relevant responses to a wide range of prompts.
3. Large language models can be used for a variety of natural language processing tasks, such as text generation, translation, summarization, and question answering, and are often used in virtual assistants, chatbots, and other conversational AI applications.
| Parameter | Value |
|--------------------|----------------------------------|
| Model | MaziyarPanahi/miqu-1-70b-sf-GPTQ |
| Sequence Length | 10 |
| Decode Length | 8 |
| Top N Tokens | None |
| N Runs | 10 |
| Warmups | 10 |
| Temperature | None |
| Top K | None |
| Top P | None |
| Typical P | None |
| Repetition Penalty | None |
| Watermark | false |
| Do Sample | false |
| Step | Batch Size | Average | Lowest | Highest | p50 | p90 | p99 |
|----------------|------------|-----------|-----------|-----------|-----------|-----------|-----------|
| Prefill | 1 | 45.41 ms | 45.25 ms | 45.93 ms | 45.35 ms | 45.93 ms | 45.93 ms |
| | 2 | 59.71 ms | 58.65 ms | 66.89 ms | 58.96 ms | 66.89 ms | 66.89 ms |
| | 4 | 84.12 ms | 83.24 ms | 85.01 ms | 84.30 ms | 85.01 ms | 85.01 ms |
| | 8 | 104.74 ms | 102.54 ms | 114.02 ms | 102.91 ms | 114.02 ms | 114.02 ms |
| | 16 | 139.02 ms | 136.19 ms | 147.54 ms | 137.76 ms | 147.54 ms | 147.54 ms |
| | 32 | 207.08 ms | 203.38 ms | 213.40 ms | 205.43 ms | 213.40 ms | 213.40 ms |
| | 64 | 342.57 ms | 342.08 ms | 343.25 ms | 342.64 ms | 343.25 ms | 343.25 ms |
| | 128 | 629.88 ms | 629.04 ms | 630.63 ms | 630.18 ms | 630.63 ms | 630.63 ms |
| Decode (token) | 1 | 37.28 ms | 35.48 ms | 39.80 ms | 37.70 ms | 35.82 ms | 35.82 ms |
| | 2 | 38.19 ms | 36.31 ms | 40.41 ms | 38.23 ms | 38.17 ms | 38.17 ms |
| | 4 | 37.38 ms | 36.12 ms | 38.88 ms | 37.67 ms | 36.38 ms | 36.38 ms |
| | 8 | 38.35 ms | 36.94 ms | 41.21 ms | 38.19 ms | 39.34 ms | 39.34 ms |
| | 16 | 48.95 ms | 47.28 ms | 51.23 ms | 49.03 ms | 49.63 ms | 49.63 ms |
| | 32 | 73.37 ms | 72.74 ms | 74.33 ms | 73.37 ms | 72.94 ms | 72.94 ms |
| | 64 | 102.43 ms | 102.29 ms | 102.62 ms | 102.45 ms | 102.30 ms | 102.30 ms |
| | 128 | 131.91 ms | 131.74 ms | 131.99 ms | 131.93 ms | 131.99 ms | 131.99 ms |
| Decode (total) | 1 | 260.95 ms | 248.35 ms | 278.60 ms | 263.92 ms | 250.75 ms | 250.75 ms |
| | 2 | 267.34 ms | 254.16 ms | 282.88 ms | 267.59 ms | 267.23 ms | 267.23 ms |
| | 4 | 261.67 ms | 252.85 ms | 272.19 ms | 263.67 ms | 254.64 ms | 254.64 ms |
| | 8 | 268.42 ms | 258.59 ms | 288.47 ms | 267.33 ms | 275.39 ms | 275.39 ms |
| | 16 | 342.65 ms | 330.96 ms | 358.62 ms | 343.24 ms | 347.44 ms | 347.44 ms |
| | 32 | 513.62 ms | 509.20 ms | 520.30 ms | 513.58 ms | 510.60 ms | 510.60 ms |
| | 64 | 717.02 ms | 716.04 ms | 718.36 ms | 717.18 ms | 716.07 ms | 716.07 ms |
| | 128 | 923.36 ms | 922.17 ms | 923.91 ms | 923.54 ms | 923.91 ms | 923.91 ms |
| Step | Batch Size | Average | Lowest | Highest |
|---------|------------|--------------------|--------------------|--------------------|
| Prefill | 1 | 22.02 tokens/secs | 21.77 tokens/secs | 22.10 tokens/secs |
| | 2 | 33.55 tokens/secs | 29.90 tokens/secs | 34.10 tokens/secs |
| | 4 | 47.55 tokens/secs | 47.05 tokens/secs | 48.05 tokens/secs |
| | 8 | 76.46 tokens/secs | 70.16 tokens/secs | 78.02 tokens/secs |
| | 16 | 115.17 tokens/secs | 108.45 tokens/secs | 117.48 tokens/secs |
| | 32 | 154.58 tokens/secs | 149.95 tokens/secs | 157.34 tokens/secs |
| | 64 | 186.82 tokens/secs | 186.46 tokens/secs | 187.09 tokens/secs |
| | 128 | 203.21 tokens/secs | 202.97 tokens/secs | 203.48 tokens/secs |
| Decode | 1 | 26.87 tokens/secs | 25.13 tokens/secs | 28.19 tokens/secs |
| | 2 | 52.43 tokens/secs | 49.49 tokens/secs | 55.08 tokens/secs |
| | 4 | 107.09 tokens/secs | 102.87 tokens/secs | 110.74 tokens/secs |
| | 8 | 208.98 tokens/secs | 194.13 tokens/secs | 216.56 tokens/secs |
| | 16 | 327.11 tokens/secs | 312.31 tokens/secs | 338.41 tokens/secs |
| | 32 | 436.14 tokens/secs | 430.52 tokens/secs | 439.90 tokens/secs |
| | 64 | 624.81 tokens/secs | 623.65 tokens/secs | 625.66 tokens/secs |
| | 128 | 970.37 tokens/secs | 969.79 tokens/secs | 971.62 tokens/secs |
so need 480G vram?
i have only 422G vram...
add --num-shard 4
then tgi is ok
so need 480G vram?
i have only 422G vram...
It needs less than 200GB vram to load the model. If more batches and longer sequences are needed then the rest of the memory can be expanded by TGI if you allowed it via cuda_fraction.
vllm recently added support for 2-bit gptq quantization, any chance it will run on 24gb vram in 2-bits? afaik GGUF and EXL can fit, but are slow
@ceoofcapybaras
Can the 4-bit GPTQ be automatically converted to 2-bit in vLLM or do I have to quantized in GPTQ for 2-bit? (I've never tried it in autogptq
to be honest, must be new)
hello,tutu
Late reply, but
@ceoofcapybaras
if you need 2-bit, try https://huggingface.co./AlexWortega/miqu-1-70b-AQLM-2Bit-1x16-hf with --quantization aqlm
in vllm. Works well in my personal evals, and easily fits on a single 3090/4090. It runs at about 8 tokens per second for a simple prompt like "write a story about X" (i.e. no prefill, batch size 1).
Seems to live up to its "SoTA 2-bit quantization" claim - at least relative to exl2, which is unusable (quality-wise) at 2 bits.