MLC-LLM / README.md
Arun Kumar Tiwary
Update README.md
681be52 verified
# Setup MLC-LLM on CPU on UBUNTU 22.04 LTS
```sh
sudo apt update
sudo apt install ocl-icd-opencl-dev clinfo vulkan-tools
python -m pip install --pre -U -f https://mlc.ai/wheels mlc-llm-nightly mlc-ai-nightly
mlc_llm chat HF://mlc-ai/Llama-3-8B-Instruct-q4f16_1-MLC
___________________________________________________________________________________________________________________________________________
$ mlc_llm --help
usage: MLC LLM Command Line Interface. [-h] {compile,convert_weight,gen_config,chat,serve,bench,package}
positional arguments:
{compile,convert_weight,gen_config,chat,serve,bench,package}
Subcommand to to run. (choices: compile, convert_weight, gen_config, chat, serve, bench, package)
options:
-h, --help show this help message and exit
____________________________________________________________________________________________________________________________________________
$ mlc_llm chat --help
usage: MLC LLM Chat CLI [-h] [--opt OPT] [--device DEVICE] [--overrides OVERRIDES] [--model-lib MODEL_LIB] model
positional arguments:
model A path to ``mlc-chat-config.json``, or an MLC model directory that contains `mlc-chat-config.json`.
It can also be a link to a HF repository pointing to an MLC compiled model. (required)
options:
-h, --help show this help message and exit
--opt OPT Optimization flags. MLC LLM maintains a predefined set of optimization flags, denoted as O0, O1, O2,
O3, where O0 means no optimization, O2 means majority of them, and O3 represents extreme
optimization that could potentially break the system. Meanwhile, optimization flags could be
explicitly specified via details knobs, e.g. --opt="cublas_gemm=1;cudagraph=0". (default: "O2")
--device DEVICE The device used to deploy the model such as "cuda" or "cuda:0". Will detect from local available
GPUs if not specified. (default: "auto")
--overrides OVERRIDES
Chat configuration override. Configurations to override ChatConfig. Supports `conv_template`,
`context_window_size`, `prefill_chunk_size`, `sliding_window_size`, `attention_sink_size`,
`max_batch_size` and `tensor_parallel_shards`. Meanwhile, model chat could be explicitly specified
via details knobs, e.g. --overrides "context_window_size=1024;prefill_chunk_size=128". (default: "")
--model-lib MODEL_LIB
The full path to the model library file to use (e.g. a ``.so`` file). If unspecified, we will use
the provided ``model`` to search over possible paths. It the model lib is not found, it will be
compiled in a JIT manner. (default: "None")
------------------------------------------------------------------------------------------------------------------------------------------
$ mlc_llm compile --help
usage: mlc_llm compile [-h]
[--quantization {q0f16,q0f32,q3f16_0,q3f16_1,q4f16_0,q4f16_1,q4f32_1,q4f16_2,q4f16_autoawq,q4f16_ft,e5m2_e5m2_f16,e4m3_e4m3_f16}]
[--model-type {auto,llama,mistral,gemma,gpt2,mixtral,gpt_neox,gpt_bigcode,phi-msft,phi,phi3,qwen,qwen2,stablelm,baichuan,internlm,rwkv5,orion,llava,rwkv6,chatglm,eagle,bert,medusa}]
[--device DEVICE] [--host HOST] [--opt OPT] [--system-lib-prefix SYSTEM_LIB_PREFIX] --output OUTPUT
[--overrides OVERRIDES] [--debug-dump DEBUG_DUMP]
model
positional arguments:
model A path to ``mlc-chat-config.json``, or an MLC model directory that contains `mlc-chat-config.json`.
It can also be a link to a HF repository pointing to an MLC compiled model. (required)
options:
-h, --help show this help message and exit
--quantization {q0f16,q0f32,q3f16_0,q3f16_1,q4f16_0,q4f16_1,q4f32_1,q4f16_2,q4f16_autoawq,q4f16_ft,e5m2_e5m2_f16,e4m3_e4m3_f16}
The quantization mode we use to compile. If unprovided, will infer from `model`. (default: look up
mlc-chat-config.json, choices: q0f16, q0f32, q3f16_0, q3f16_1, q4f16_0, q4f16_1, q4f32_1, q4f16_2,
q4f16_autoawq, q4f16_ft, e5m2_e5m2_f16, e4m3_e4m3_f16)
--model-type {auto,llama,mistral,gemma,gpt2,mixtral,gpt_neox,gpt_bigcode,phi-msft,phi,phi3,qwen,qwen2,stablelm,baichuan,internlm,rwkv5,orion,llava,rwkv6,chatglm,eagle,bert,medusa}
Model architecture such as "llama". If not set, it is inferred from `mlc-chat-config.json`.
(default: "auto")
--device DEVICE The GPU device to compile the model to. If not set, it is inferred from GPUs available locally.
(default: "auto")
--host HOST The host LLVM triple to compile the model to. If not set, it is inferred from the local CPU and OS.
Examples of the LLVM triple: 1) iPhones: arm64-apple-ios; 2) ARM64 Android phones: aarch64-linux-
android; 3) WebAssembly: wasm32-unknown-unknown-wasm; 4) Windows: x86_64-pc-windows-msvc; 5) ARM
macOS: arm64-apple-darwin. (default: "auto")
--opt OPT Optimization flags. MLC LLM maintains a predefined set of optimization flags, denoted as O0, O1, O2,
O3, where O0 means no optimization, O2 means majority of them, and O3 represents extreme
optimization that could potentially break the system. Meanwhile, optimization flags could be
explicitly specified via details knobs, e.g. --opt="cublas_gemm=1;cudagraph=0". (default: "O2")
--system-lib-prefix SYSTEM_LIB_PREFIX
Adding a prefix to all symbols exported. Similar to "objcopy --prefix-symbols". This is useful when
compiling multiple models into a single library to avoid symbol conflicts. Different from objcopy,
this takes no effect for shared library. (default: "auto")
--output OUTPUT, -o OUTPUT
The path to the output file. The suffix determines if the output file is a shared library or
objects. Available suffixes: 1) Linux: .so (shared), .tar (objects); 2) macOS: .dylib (shared), .tar
(objects); 3) Windows: .dll (shared), .tar (objects); 4) Android, iOS: .tar (objects); 5) Web: .wasm
(web assembly). (required)
--overrides OVERRIDES
Model configuration override. Configurations to override `mlc-chat-config.json`. Supports
`context_window_size`, `prefill_chunk_size`, `sliding_window_size`, `attention_sink_size`,
`max_batch_size` and `tensor_parallel_shards`. Meanwhile, model config could be explicitly specified
via details knobs, e.g. --overrides "context_window_size=1024;prefill_chunk_size=128". (default: "")
--debug-dump DEBUG_DUMP
Specifies the directory where the compiler will store its IRs for debugging purposes during various
phases of compilation. By default, this is set to `None`, indicating that debug dumping is disabled.
(default: None)
____________________________________________________________________________________________________________________________________________
$ mlc_llm convert_weight --help
usage: MLC AutoLLM Quantization Framework [-h] --quantization
{q0f16,q0f32,q3f16_0,q3f16_1,q4f16_0,q4f16_1,q4f32_1,q4f16_2,q4f16_autoawq,q4f16_ft,e5m2_e5m2_f16,e4m3_e4m3_f16}
[--model-type {auto,llama,mistral,gemma,gpt2,mixtral,gpt_neox,gpt_bigcode,phi-msft,phi,phi3,qwen,qwen2,stablelm,baichuan,internlm,rwkv5,orion,llava,rwkv6,chatglm,eagle,bert,medusa}]
[--device DEVICE] [--source SOURCE]
[--source-format {auto,huggingface-torch,huggingface-safetensor,awq}] --output
OUTPUT
config
positional arguments:
config 1) Path to a HuggingFace model directory that contains a `config.json` or 2) Path to `config.json`
in HuggingFace format, or 3) The name of a pre-defined model architecture. A `config.json` file in
HuggingFace format defines the model architecture, including the vocabulary size, the number of
layers, the hidden size, number of attention heads, etc. Example:
https://huggingface.co./codellama/CodeLlama-7b-hf/blob/main/config.json. A HuggingFace directory
often contains a `config.json` which defines the model architecture, the non-quantized model weights
in PyTorch or SafeTensor format, tokenizer configurations, as well as an optional
`generation_config.json` provides additional default configuration for text generation. Example:
https://huggingface.co./codellama/CodeLlama-7b-hf/tree/main. (required)
options:
-h, --help show this help message and exit
--quantization {q0f16,q0f32,q3f16_0,q3f16_1,q4f16_0,q4f16_1,q4f32_1,q4f16_2,q4f16_autoawq,q4f16_ft,e5m2_e5m2_f16,e4m3_e4m3_f16}
The quantization mode we use to compile. If unprovided, will infer from `model`. (required, choices:
q0f16, q0f32, q3f16_0, q3f16_1, q4f16_0, q4f16_1, q4f32_1, q4f16_2, q4f16_autoawq, q4f16_ft,
e5m2_e5m2_f16, e4m3_e4m3_f16)
--model-type {auto,llama,mistral,gemma,gpt2,mixtral,gpt_neox,gpt_bigcode,phi-msft,phi,phi3,qwen,qwen2,stablelm,baichuan,internlm,rwkv5,orion,llava,rwkv6,chatglm,eagle,bert,medusa}
Model architecture such as "llama". If not set, it is inferred from `mlc-chat-config.json`.
(default: "auto")
--device DEVICE The device used to do quantization such as "cuda" or "cuda:0". Will detect from local available GPUs
if not specified. (default: "auto")
--source SOURCE The path to original model weight, infer from `config` if missing. (default: "auto")
--source-format {auto,huggingface-torch,huggingface-safetensor,awq}
The format of source model weight, infer from `config` if missing. (default: "auto", choices: auto,
huggingface-torch, huggingface-safetensor, awq")
--output OUTPUT, -o OUTPUT
The output directory to save the quantized model weight. Will create `params_shard_*.bin` and
`ndarray-cache.json` in this directory. (required)
--------------------------------------------------------------------------------------------------------------------------------
$mlc_llm serve --help
usage: MLC LLM Serve CLI [-h] [--device DEVICE] [--model-lib MODEL_LIB] [--mode {local,interactive,server}]
[--additional-models [ADDITIONAL_MODELS ...]] [--max-batch-size MAX_BATCH_SIZE]
[--max-total-seq-length MAX_TOTAL_SEQ_LENGTH] [--prefill-chunk-size PREFILL_CHUNK_SIZE]
[--max-history-size MAX_HISTORY_SIZE] [--gpu-memory-utilization GPU_MEMORY_UTILIZATION]
[--speculative-mode {disable,small_draft,eagle,medusa}] [--spec-draft-length SPEC_DRAFT_LENGTH]
[--enable-tracing] [--host HOST] [--port PORT] [--allow-credentials]
[--allow-origins ALLOW_ORIGINS] [--allow-methods ALLOW_METHODS] [--allow-headers ALLOW_HEADERS]
model
positional arguments:
model A path to ``mlc-chat-config.json``, or an MLC model directory that contains `mlc-chat-config.json`.
It can also be a link to a HF repository pointing to an MLC compiled model. (required)
options:
-h, --help show this help message and exit
--device DEVICE The device used to deploy the model such as "cuda" or "cuda:0". Will detect from local available
GPUs if not specified. (default: "auto")
--model-lib MODEL_LIB
The full path to the model library file to use (e.g. a ``.so`` file). If unspecified, we will use
the provided ``model`` to search over possible paths. It the model lib is not found, it will be
compiled in a JIT manner. (default: "None")
--mode {local,interactive,server}
The engine mode in MLC LLM. We provide three preset modes: "local", "interactive" and "server". The
default mode is "local". The choice of mode decides the values of "--max-batch-size", "--max-total-
seq-length" and "--prefill-chunk-size" when they are not explicitly specified. 1. Mode "local"
refers to the local server deployment which has low request concurrency. So the max batch size will
be set to 4, and max total sequence length and prefill chunk size are set to the context window size
(or sliding window size) of the model. 2. Mode "interactive" refers to the interactive use of
server, which has at most 1 concurrent request. So the max batch size will be set to 1, and max
total sequence length and prefill chunk size are set to the context window size (or sliding window
size) of the model. 3. Mode "server" refers to the large server use case which may handle many
concurrent request and want to use GPU memory as much as possible. In this mode, we will
automatically infer the largest possible max batch size and max total sequence length. You can
manually specify arguments "--max-batch-size", "--max-total-seq-length" and "--prefill-chunk-size"
to override the automatic inferred values. (default: "local")
--additional-models [ADDITIONAL_MODELS ...]
The model paths and (optional) model library paths of additional models (other than the main model).
When engine is enabled with speculative decoding, additional models are needed. The way of
specifying additional models is: "--additional-models model_path_1 model_path_2 ..." or "--
additional-models model_path_1:model_lib_1 model_path_2 ...". When the model lib of a model is not
given, JIT model compilation will be activated to compile the model automatically.
--max-batch-size MAX_BATCH_SIZE
The maximum allowed batch size set for the KV cache to concurrently support.
--max-total-seq-length MAX_TOTAL_SEQ_LENGTH
The KV cache total token capacity, i.e., the maximum total number of tokens that the KV cache
support. This decides the GPU memory size that the KV cache consumes. If not specified, system will
automatically estimate the maximum capacity based on the vRAM size on GPU.
--prefill-chunk-size PREFILL_CHUNK_SIZE
The maximum number of tokens the model passes for prefill each time. It should not exceed the
prefill chunk size in model config. If not specified, this defaults to the prefill chunk size in
model config.
--max-history-size MAX_HISTORY_SIZE
The maximum history length for rolling back the RNN state. If unspecified, the default value is 1.
KV cache does not need this.
--gpu-memory-utilization GPU_MEMORY_UTILIZATION
A number in (0, 1) denoting the fraction of GPU memory used by the server in total. It is used to
infer to maximum possible KV cache capacity. When it is unspecified, it defaults to 0.85. Under mode
"local" or "interactive", the actual memory usage may be significantly smaller than this number.
Under mode "server", the actual memory usage may be slightly larger than this number.
--speculative-mode {disable,small_draft,eagle,medusa}
The speculative decoding mode. Right now three options are supported: - "disable", where speculative
decoding is not enabled, - "small_draft", denoting the normal speculative decoding (small draft)
style, - "eagle", denoting the eagle-style speculative decoding. The default mode is "disable".
(default: "disable")
--spec-draft-length SPEC_DRAFT_LENGTH
The number of draft tokens to generate in speculative proposal. The default values is 4.
--enable-tracing Enable Chrome Tracing for the server. After enabling, you can send POST request to the
"debug/dump_event_trace" entrypoint to get the Chrome Trace. For example, "curl -X POST
http://127.0.0.1:8000/debug/dump_event_trace -H "Content-Type: application/json" -d '{"model":
"dist/llama"}'"
--host HOST host name (default: "127.0.0.1")
--port PORT port (default: "8000")
--allow-credentials allow credentials
--allow-origins ALLOW_ORIGINS
allowed origins (default: "['*']")
--allow-methods ALLOW_METHODS
allowed methods (default: "['*']")
--allow-headers ALLOW_HEADERS
allowed headers (default: "['*']")
_________________________________________________________________________________________________________________________________________
$ mlc_llm gen_config --help
usage: MLC LLM Configuration Generator [-h] --quantization
{q0f16,q0f32,q3f16_0,q3f16_1,q4f16_0,q4f16_1,q4f32_1,q4f16_2,q4f16_autoawq,q4f16_ft,e5m2_e5m2_f16,e4m3_e4m3_f16}
[--model-type {auto,llama,mistral,gemma,gpt2,mixtral,gpt_neox,gpt_bigcode,phi-msft,phi,phi3,qwen,qwen2,stablelm,baichuan,internlm,rwkv5,orion,llava,rwkv6,chatglm,eagle,bert,medusa}]
--conv-template
{llama-3,custom,open_hermes_mistral,vicuna_v1.1,gorilla,gorilla-openfunctions-v2,llava,gpt2,minigpt,stablecode_completion,conv_one_shot,llama-2,stablelm-3b,guanaco,LM,rwkv_world,gpt_bigcode,codellama_instruct,phi-2,phi-3,wizardlm_7b,stablelm-2,mistral_default,redpajama_chat,oasst,stablelm,llama_default,moss,gemma_instruction,neural_hermes_mistral,rwkv,stablecode_instruct,codellama_completion,wizard_coder_or_math,chatml,orion,glm,dolly}
[--context-window-size CONTEXT_WINDOW_SIZE]
[--sliding-window-size SLIDING_WINDOW_SIZE] [--prefill-chunk-size PREFILL_CHUNK_SIZE]
[--attention-sink-size ATTENTION_SINK_SIZE]
[--tensor-parallel-shards TENSOR_PARALLEL_SHARDS] [--max-batch-size MAX_BATCH_SIZE]
--output OUTPUT
config
positional arguments:
config 1) Path to a HuggingFace model directory that contains a `config.json` or 2) Path to `config.json`
in HuggingFace format, or 3) The name of a pre-defined model architecture. A `config.json` file in
HuggingFace format defines the model architecture, including the vocabulary size, the number of
layers, the hidden size, number of attention heads, etc. Example:
https://huggingface.co./codellama/CodeLlama-7b-hf/blob/main/config.json. A HuggingFace directory
often contains a `config.json` which defines the model architecture, the non-quantized model weights
in PyTorch or SafeTensor format, tokenizer configurations, as well as an optional
`generation_config.json` provides additional default configuration for text generation. Example:
https://huggingface.co./codellama/CodeLlama-7b-hf/tree/main. (required)
options:
-h, --help show this help message and exit
--quantization {q0f16,q0f32,q3f16_0,q3f16_1,q4f16_0,q4f16_1,q4f32_1,q4f16_2,q4f16_autoawq,q4f16_ft,e5m2_e5m2_f16,e4m3_e4m3_f16}
The quantization mode we use to compile. If unprovided, will infer from `model`. (required, choices:
q0f16, q0f32, q3f16_0, q3f16_1, q4f16_0, q4f16_1, q4f32_1, q4f16_2, q4f16_autoawq, q4f16_ft,
e5m2_e5m2_f16, e4m3_e4m3_f16)
--model-type {auto,llama,mistral,gemma,gpt2,mixtral,gpt_neox,gpt_bigcode,phi-msft,phi,phi3,qwen,qwen2,stablelm,baichuan,internlm,rwkv5,orion,llava,rwkv6,chatglm,eagle,bert,medusa}
Model architecture such as "llama". If not set, it is inferred from `mlc-chat-config.json`.
(default: "auto", choices: auto, llama, mistral, gemma, gpt2, mixtral, gpt_neox, gpt_bigcode, phi-
msft, phi, phi3, qwen, qwen2, stablelm, baichuan, internlm, rwkv5, orion, llava, rwkv6, chatglm,
eagle, bert, medusa)
--conv-template {llama-3,custom,open_hermes_mistral,vicuna_v1.1,gorilla,gorilla-openfunctions-v2,llava,gpt2,minigpt,stablecode_completion,conv_one_shot,llama-2,stablelm-3b,guanaco,LM,rwkv_world,gpt_bigcode,codellama_instruct,phi-2,phi-3,wizardlm_7b,stablelm-2,mistral_default,redpajama_chat,oasst,stablelm,llama_default,moss,gemma_instruction,neural_hermes_mistral,rwkv,stablecode_instruct,codellama_completion,wizard_coder_or_math,chatml,orion,glm,dolly}
Conversation template. It depends on how the model is tuned. Use "LM" for vanilla base model
(required, choices: llama-3, custom, open_hermes_mistral, vicuna_v1.1, gorilla, gorilla-
openfunctions-v2, llava, gpt2, minigpt, stablecode_completion, conv_one_shot, llama-2, stablelm-3b,
guanaco, LM, rwkv_world, gpt_bigcode, codellama_instruct, phi-2, phi-3, wizardlm_7b, stablelm-2,
mistral_default, redpajama_chat, oasst, stablelm, llama_default, moss, gemma_instruction,
neural_hermes_mistral, rwkv, stablecode_instruct, codellama_completion, wizard_coder_or_math,
chatml, orion, glm, dolly)
--context-window-size CONTEXT_WINDOW_SIZE
Option to provide the maximum sequence length supported by the model. This is usually explicitly
shown as context length or context window in the model card. If this option is not set explicitly,
by default, it will be determined by `context_window_size` or `max_position_embeddings` in
`config.json`, and the latter is usually inaccurate for some models. (default: "None")
--sliding-window-size SLIDING_WINDOW_SIZE
(Experimental) The sliding window size in sliding window attention (SWA). This optional field
overrides the `sliding_window_size` in config.json for those models that use SWA. Currently only
useful when compiling Mistral. This flag subjects to future refactoring. (default: "None")
--prefill-chunk-size PREFILL_CHUNK_SIZE
(Experimental) The chunk size during prefilling. By default, the chunk size is the same as sliding
window or max sequence length. This flag subjects to future refactoring. (default: "None")
--attention-sink-size ATTENTION_SINK_SIZE
(Experimental) The number of stored sinks. Only supported on Mistral yet. By default, the number of
sinks is 4. This flag subjects to future refactoring. (default: "None")
--tensor-parallel-shards TENSOR_PARALLEL_SHARDS
Number of shards to split the model into in tensor parallelism multi-gpu inference. (default:
"None")
--max-batch-size MAX_BATCH_SIZE
The maximum allowed batch size set for the KV cache to concurrently support. (default: "80")
--output OUTPUT, -o OUTPUT
The output directory for generated configurations, including `mlc-chat-config.json` and tokenizer
configuration. (required)
________________________________________________________________________________________________________________________________________
$ mlc_llm bench --help
usage: MLC LLM Chat CLI [-h] [--prompt PROMPT] [--opt OPT] [--device DEVICE] [--overrides OVERRIDES]
[--generate-length GENERATE_LENGTH] [--model-lib MODEL_LIB]
model
positional arguments:
model A path to ``mlc-chat-config.json``, or an MLC model directory that contains `mlc-chat-config.json`.
It can also be a link to a HF repository pointing to an MLC compiled model. (required)
options:
-h, --help show this help message and exit
--prompt PROMPT The prompt of the text generation. (default: "What is the meaning of life?")
--opt OPT Optimization flags. MLC LLM maintains a predefined set of optimization flags, denoted as O0, O1, O2,
O3, where O0 means no optimization, O2 means majority of them, and O3 represents extreme
optimization that could potentially break the system. Meanwhile, optimization flags could be
explicitly specified via details knobs, e.g. --opt="cublas_gemm=1;cudagraph=0". (default: "O2")
--device DEVICE The device used to deploy the model such as "cuda" or "cuda:0". Will detect from local available
GPUs if not specified. (default: "auto")
--overrides OVERRIDES
Chat configuration override. Configurations to override ChatConfig. Supports `conv_template`,
`context_window_size`, `prefill_chunk_size`, `sliding_window_size`, `attention_sink_size`,
`max_batch_size` and `tensor_parallel_shards`. Meanwhile, model chat could be explicitly specified
via details knobs, e.g. --overrides "context_window_size=1024;prefill_chunk_size=128". (default: "")
--generate-length GENERATE_LENGTH
The target length of the text generation. (default: "256")
--model-lib MODEL_LIB
The full path to the model library file to use (e.g. a ``.so`` file). If unspecified, we will use
the provided ``model`` to search over possible paths. It the model lib is not found, it will be
compiled in a JIT manner. (default: "None")
__________________________________________________________________________________________________________________________________________
$ mlc_llm package --help
usage: MLC LLM Package CLI [-h] [--package-config PACKAGE_CONFIG] [--mlc-llm-home MLC_LLM_HOME] [--output OUTPUT]
options:
-h, --help show this help message and exit
--package-config PACKAGE_CONFIG
The path to "mlc-package-config.json" which is used for package build. See "https://github.com/mlc-
ai/mlc-llm/blob/main/ios/MLCChat/mlc-package-config.json" as an example. (default: "mlc-package-
config.json")
--mlc-llm-home MLC_LLM_HOME
The source code path to MLC LLM. (default: the $MLC_LLM_HOME environment variable)
--output OUTPUT, -o OUTPUT
The path of output directory for the package build outputs. (default: "dist")