# Setup MLC-LLM on CPU on UBUNTU 22.04 LTS ```sh sudo apt update sudo apt install ocl-icd-opencl-dev clinfo vulkan-tools python -m pip install --pre -U -f https://mlc.ai/wheels mlc-llm-nightly mlc-ai-nightly mlc_llm chat HF://mlc-ai/Llama-3-8B-Instruct-q4f16_1-MLC ___________________________________________________________________________________________________________________________________________ $ mlc_llm --help usage: MLC LLM Command Line Interface. [-h] {compile,convert_weight,gen_config,chat,serve,bench,package} positional arguments: {compile,convert_weight,gen_config,chat,serve,bench,package} Subcommand to to run. (choices: compile, convert_weight, gen_config, chat, serve, bench, package) options: -h, --help show this help message and exit ____________________________________________________________________________________________________________________________________________ $ mlc_llm chat --help usage: MLC LLM Chat CLI [-h] [--opt OPT] [--device DEVICE] [--overrides OVERRIDES] [--model-lib MODEL_LIB] model positional arguments: model A path to ``mlc-chat-config.json``, or an MLC model directory that contains `mlc-chat-config.json`. It can also be a link to a HF repository pointing to an MLC compiled model. (required) options: -h, --help show this help message and exit --opt OPT Optimization flags. MLC LLM maintains a predefined set of optimization flags, denoted as O0, O1, O2, O3, where O0 means no optimization, O2 means majority of them, and O3 represents extreme optimization that could potentially break the system. Meanwhile, optimization flags could be explicitly specified via details knobs, e.g. --opt="cublas_gemm=1;cudagraph=0". (default: "O2") --device DEVICE The device used to deploy the model such as "cuda" or "cuda:0". Will detect from local available GPUs if not specified. (default: "auto") --overrides OVERRIDES Chat configuration override. Configurations to override ChatConfig. Supports `conv_template`, `context_window_size`, `prefill_chunk_size`, `sliding_window_size`, `attention_sink_size`, `max_batch_size` and `tensor_parallel_shards`. Meanwhile, model chat could be explicitly specified via details knobs, e.g. --overrides "context_window_size=1024;prefill_chunk_size=128". (default: "") --model-lib MODEL_LIB The full path to the model library file to use (e.g. a ``.so`` file). If unspecified, we will use the provided ``model`` to search over possible paths. It the model lib is not found, it will be compiled in a JIT manner. (default: "None") ------------------------------------------------------------------------------------------------------------------------------------------ $ mlc_llm compile --help usage: mlc_llm compile [-h] [--quantization {q0f16,q0f32,q3f16_0,q3f16_1,q4f16_0,q4f16_1,q4f32_1,q4f16_2,q4f16_autoawq,q4f16_ft,e5m2_e5m2_f16,e4m3_e4m3_f16}] [--model-type {auto,llama,mistral,gemma,gpt2,mixtral,gpt_neox,gpt_bigcode,phi-msft,phi,phi3,qwen,qwen2,stablelm,baichuan,internlm,rwkv5,orion,llava,rwkv6,chatglm,eagle,bert,medusa}] [--device DEVICE] [--host HOST] [--opt OPT] [--system-lib-prefix SYSTEM_LIB_PREFIX] --output OUTPUT [--overrides OVERRIDES] [--debug-dump DEBUG_DUMP] model positional arguments: model A path to ``mlc-chat-config.json``, or an MLC model directory that contains `mlc-chat-config.json`. It can also be a link to a HF repository pointing to an MLC compiled model. (required) options: -h, --help show this help message and exit --quantization {q0f16,q0f32,q3f16_0,q3f16_1,q4f16_0,q4f16_1,q4f32_1,q4f16_2,q4f16_autoawq,q4f16_ft,e5m2_e5m2_f16,e4m3_e4m3_f16} The quantization mode we use to compile. If unprovided, will infer from `model`. (default: look up mlc-chat-config.json, choices: q0f16, q0f32, q3f16_0, q3f16_1, q4f16_0, q4f16_1, q4f32_1, q4f16_2, q4f16_autoawq, q4f16_ft, e5m2_e5m2_f16, e4m3_e4m3_f16) --model-type {auto,llama,mistral,gemma,gpt2,mixtral,gpt_neox,gpt_bigcode,phi-msft,phi,phi3,qwen,qwen2,stablelm,baichuan,internlm,rwkv5,orion,llava,rwkv6,chatglm,eagle,bert,medusa} Model architecture such as "llama". If not set, it is inferred from `mlc-chat-config.json`. (default: "auto") --device DEVICE The GPU device to compile the model to. If not set, it is inferred from GPUs available locally. (default: "auto") --host HOST The host LLVM triple to compile the model to. If not set, it is inferred from the local CPU and OS. Examples of the LLVM triple: 1) iPhones: arm64-apple-ios; 2) ARM64 Android phones: aarch64-linux- android; 3) WebAssembly: wasm32-unknown-unknown-wasm; 4) Windows: x86_64-pc-windows-msvc; 5) ARM macOS: arm64-apple-darwin. (default: "auto") --opt OPT Optimization flags. MLC LLM maintains a predefined set of optimization flags, denoted as O0, O1, O2, O3, where O0 means no optimization, O2 means majority of them, and O3 represents extreme optimization that could potentially break the system. Meanwhile, optimization flags could be explicitly specified via details knobs, e.g. --opt="cublas_gemm=1;cudagraph=0". (default: "O2") --system-lib-prefix SYSTEM_LIB_PREFIX Adding a prefix to all symbols exported. Similar to "objcopy --prefix-symbols". This is useful when compiling multiple models into a single library to avoid symbol conflicts. Different from objcopy, this takes no effect for shared library. (default: "auto") --output OUTPUT, -o OUTPUT The path to the output file. The suffix determines if the output file is a shared library or objects. Available suffixes: 1) Linux: .so (shared), .tar (objects); 2) macOS: .dylib (shared), .tar (objects); 3) Windows: .dll (shared), .tar (objects); 4) Android, iOS: .tar (objects); 5) Web: .wasm (web assembly). (required) --overrides OVERRIDES Model configuration override. Configurations to override `mlc-chat-config.json`. Supports `context_window_size`, `prefill_chunk_size`, `sliding_window_size`, `attention_sink_size`, `max_batch_size` and `tensor_parallel_shards`. Meanwhile, model config could be explicitly specified via details knobs, e.g. --overrides "context_window_size=1024;prefill_chunk_size=128". (default: "") --debug-dump DEBUG_DUMP Specifies the directory where the compiler will store its IRs for debugging purposes during various phases of compilation. By default, this is set to `None`, indicating that debug dumping is disabled. (default: None) ____________________________________________________________________________________________________________________________________________ $ mlc_llm convert_weight --help usage: MLC AutoLLM Quantization Framework [-h] --quantization {q0f16,q0f32,q3f16_0,q3f16_1,q4f16_0,q4f16_1,q4f32_1,q4f16_2,q4f16_autoawq,q4f16_ft,e5m2_e5m2_f16,e4m3_e4m3_f16} [--model-type {auto,llama,mistral,gemma,gpt2,mixtral,gpt_neox,gpt_bigcode,phi-msft,phi,phi3,qwen,qwen2,stablelm,baichuan,internlm,rwkv5,orion,llava,rwkv6,chatglm,eagle,bert,medusa}] [--device DEVICE] [--source SOURCE] [--source-format {auto,huggingface-torch,huggingface-safetensor,awq}] --output OUTPUT config positional arguments: config 1) Path to a HuggingFace model directory that contains a `config.json` or 2) Path to `config.json` in HuggingFace format, or 3) The name of a pre-defined model architecture. A `config.json` file in HuggingFace format defines the model architecture, including the vocabulary size, the number of layers, the hidden size, number of attention heads, etc. Example: https://huggingface.co./codellama/CodeLlama-7b-hf/blob/main/config.json. A HuggingFace directory often contains a `config.json` which defines the model architecture, the non-quantized model weights in PyTorch or SafeTensor format, tokenizer configurations, as well as an optional `generation_config.json` provides additional default configuration for text generation. Example: https://huggingface.co./codellama/CodeLlama-7b-hf/tree/main. (required) options: -h, --help show this help message and exit --quantization {q0f16,q0f32,q3f16_0,q3f16_1,q4f16_0,q4f16_1,q4f32_1,q4f16_2,q4f16_autoawq,q4f16_ft,e5m2_e5m2_f16,e4m3_e4m3_f16} The quantization mode we use to compile. If unprovided, will infer from `model`. (required, choices: q0f16, q0f32, q3f16_0, q3f16_1, q4f16_0, q4f16_1, q4f32_1, q4f16_2, q4f16_autoawq, q4f16_ft, e5m2_e5m2_f16, e4m3_e4m3_f16) --model-type {auto,llama,mistral,gemma,gpt2,mixtral,gpt_neox,gpt_bigcode,phi-msft,phi,phi3,qwen,qwen2,stablelm,baichuan,internlm,rwkv5,orion,llava,rwkv6,chatglm,eagle,bert,medusa} Model architecture such as "llama". If not set, it is inferred from `mlc-chat-config.json`. (default: "auto") --device DEVICE The device used to do quantization such as "cuda" or "cuda:0". Will detect from local available GPUs if not specified. (default: "auto") --source SOURCE The path to original model weight, infer from `config` if missing. (default: "auto") --source-format {auto,huggingface-torch,huggingface-safetensor,awq} The format of source model weight, infer from `config` if missing. (default: "auto", choices: auto, huggingface-torch, huggingface-safetensor, awq") --output OUTPUT, -o OUTPUT The output directory to save the quantized model weight. Will create `params_shard_*.bin` and `ndarray-cache.json` in this directory. (required) -------------------------------------------------------------------------------------------------------------------------------- $mlc_llm serve --help usage: MLC LLM Serve CLI [-h] [--device DEVICE] [--model-lib MODEL_LIB] [--mode {local,interactive,server}] [--additional-models [ADDITIONAL_MODELS ...]] [--max-batch-size MAX_BATCH_SIZE] [--max-total-seq-length MAX_TOTAL_SEQ_LENGTH] [--prefill-chunk-size PREFILL_CHUNK_SIZE] [--max-history-size MAX_HISTORY_SIZE] [--gpu-memory-utilization GPU_MEMORY_UTILIZATION] [--speculative-mode {disable,small_draft,eagle,medusa}] [--spec-draft-length SPEC_DRAFT_LENGTH] [--enable-tracing] [--host HOST] [--port PORT] [--allow-credentials] [--allow-origins ALLOW_ORIGINS] [--allow-methods ALLOW_METHODS] [--allow-headers ALLOW_HEADERS] model positional arguments: model A path to ``mlc-chat-config.json``, or an MLC model directory that contains `mlc-chat-config.json`. It can also be a link to a HF repository pointing to an MLC compiled model. (required) options: -h, --help show this help message and exit --device DEVICE The device used to deploy the model such as "cuda" or "cuda:0". Will detect from local available GPUs if not specified. (default: "auto") --model-lib MODEL_LIB The full path to the model library file to use (e.g. a ``.so`` file). If unspecified, we will use the provided ``model`` to search over possible paths. It the model lib is not found, it will be compiled in a JIT manner. (default: "None") --mode {local,interactive,server} The engine mode in MLC LLM. We provide three preset modes: "local", "interactive" and "server". The default mode is "local". The choice of mode decides the values of "--max-batch-size", "--max-total- seq-length" and "--prefill-chunk-size" when they are not explicitly specified. 1. Mode "local" refers to the local server deployment which has low request concurrency. So the max batch size will be set to 4, and max total sequence length and prefill chunk size are set to the context window size (or sliding window size) of the model. 2. Mode "interactive" refers to the interactive use of server, which has at most 1 concurrent request. So the max batch size will be set to 1, and max total sequence length and prefill chunk size are set to the context window size (or sliding window size) of the model. 3. Mode "server" refers to the large server use case which may handle many concurrent request and want to use GPU memory as much as possible. In this mode, we will automatically infer the largest possible max batch size and max total sequence length. You can manually specify arguments "--max-batch-size", "--max-total-seq-length" and "--prefill-chunk-size" to override the automatic inferred values. (default: "local") --additional-models [ADDITIONAL_MODELS ...] The model paths and (optional) model library paths of additional models (other than the main model). When engine is enabled with speculative decoding, additional models are needed. The way of specifying additional models is: "--additional-models model_path_1 model_path_2 ..." or "-- additional-models model_path_1:model_lib_1 model_path_2 ...". When the model lib of a model is not given, JIT model compilation will be activated to compile the model automatically. --max-batch-size MAX_BATCH_SIZE The maximum allowed batch size set for the KV cache to concurrently support. --max-total-seq-length MAX_TOTAL_SEQ_LENGTH The KV cache total token capacity, i.e., the maximum total number of tokens that the KV cache support. This decides the GPU memory size that the KV cache consumes. If not specified, system will automatically estimate the maximum capacity based on the vRAM size on GPU. --prefill-chunk-size PREFILL_CHUNK_SIZE The maximum number of tokens the model passes for prefill each time. It should not exceed the prefill chunk size in model config. If not specified, this defaults to the prefill chunk size in model config. --max-history-size MAX_HISTORY_SIZE The maximum history length for rolling back the RNN state. If unspecified, the default value is 1. KV cache does not need this. --gpu-memory-utilization GPU_MEMORY_UTILIZATION A number in (0, 1) denoting the fraction of GPU memory used by the server in total. It is used to infer to maximum possible KV cache capacity. When it is unspecified, it defaults to 0.85. Under mode "local" or "interactive", the actual memory usage may be significantly smaller than this number. Under mode "server", the actual memory usage may be slightly larger than this number. --speculative-mode {disable,small_draft,eagle,medusa} The speculative decoding mode. Right now three options are supported: - "disable", where speculative decoding is not enabled, - "small_draft", denoting the normal speculative decoding (small draft) style, - "eagle", denoting the eagle-style speculative decoding. The default mode is "disable". (default: "disable") --spec-draft-length SPEC_DRAFT_LENGTH The number of draft tokens to generate in speculative proposal. The default values is 4. --enable-tracing Enable Chrome Tracing for the server. After enabling, you can send POST request to the "debug/dump_event_trace" entrypoint to get the Chrome Trace. For example, "curl -X POST http://127.0.0.1:8000/debug/dump_event_trace -H "Content-Type: application/json" -d '{"model": "dist/llama"}'" --host HOST host name (default: "127.0.0.1") --port PORT port (default: "8000") --allow-credentials allow credentials --allow-origins ALLOW_ORIGINS allowed origins (default: "['*']") --allow-methods ALLOW_METHODS allowed methods (default: "['*']") --allow-headers ALLOW_HEADERS allowed headers (default: "['*']") _________________________________________________________________________________________________________________________________________ $ mlc_llm gen_config --help usage: MLC LLM Configuration Generator [-h] --quantization {q0f16,q0f32,q3f16_0,q3f16_1,q4f16_0,q4f16_1,q4f32_1,q4f16_2,q4f16_autoawq,q4f16_ft,e5m2_e5m2_f16,e4m3_e4m3_f16} [--model-type {auto,llama,mistral,gemma,gpt2,mixtral,gpt_neox,gpt_bigcode,phi-msft,phi,phi3,qwen,qwen2,stablelm,baichuan,internlm,rwkv5,orion,llava,rwkv6,chatglm,eagle,bert,medusa}] --conv-template {llama-3,custom,open_hermes_mistral,vicuna_v1.1,gorilla,gorilla-openfunctions-v2,llava,gpt2,minigpt,stablecode_completion,conv_one_shot,llama-2,stablelm-3b,guanaco,LM,rwkv_world,gpt_bigcode,codellama_instruct,phi-2,phi-3,wizardlm_7b,stablelm-2,mistral_default,redpajama_chat,oasst,stablelm,llama_default,moss,gemma_instruction,neural_hermes_mistral,rwkv,stablecode_instruct,codellama_completion,wizard_coder_or_math,chatml,orion,glm,dolly} [--context-window-size CONTEXT_WINDOW_SIZE] [--sliding-window-size SLIDING_WINDOW_SIZE] [--prefill-chunk-size PREFILL_CHUNK_SIZE] [--attention-sink-size ATTENTION_SINK_SIZE] [--tensor-parallel-shards TENSOR_PARALLEL_SHARDS] [--max-batch-size MAX_BATCH_SIZE] --output OUTPUT config positional arguments: config 1) Path to a HuggingFace model directory that contains a `config.json` or 2) Path to `config.json` in HuggingFace format, or 3) The name of a pre-defined model architecture. A `config.json` file in HuggingFace format defines the model architecture, including the vocabulary size, the number of layers, the hidden size, number of attention heads, etc. Example: https://huggingface.co./codellama/CodeLlama-7b-hf/blob/main/config.json. A HuggingFace directory often contains a `config.json` which defines the model architecture, the non-quantized model weights in PyTorch or SafeTensor format, tokenizer configurations, as well as an optional `generation_config.json` provides additional default configuration for text generation. Example: https://huggingface.co./codellama/CodeLlama-7b-hf/tree/main. (required) options: -h, --help show this help message and exit --quantization {q0f16,q0f32,q3f16_0,q3f16_1,q4f16_0,q4f16_1,q4f32_1,q4f16_2,q4f16_autoawq,q4f16_ft,e5m2_e5m2_f16,e4m3_e4m3_f16} The quantization mode we use to compile. If unprovided, will infer from `model`. (required, choices: q0f16, q0f32, q3f16_0, q3f16_1, q4f16_0, q4f16_1, q4f32_1, q4f16_2, q4f16_autoawq, q4f16_ft, e5m2_e5m2_f16, e4m3_e4m3_f16) --model-type {auto,llama,mistral,gemma,gpt2,mixtral,gpt_neox,gpt_bigcode,phi-msft,phi,phi3,qwen,qwen2,stablelm,baichuan,internlm,rwkv5,orion,llava,rwkv6,chatglm,eagle,bert,medusa} Model architecture such as "llama". If not set, it is inferred from `mlc-chat-config.json`. (default: "auto", choices: auto, llama, mistral, gemma, gpt2, mixtral, gpt_neox, gpt_bigcode, phi- msft, phi, phi3, qwen, qwen2, stablelm, baichuan, internlm, rwkv5, orion, llava, rwkv6, chatglm, eagle, bert, medusa) --conv-template {llama-3,custom,open_hermes_mistral,vicuna_v1.1,gorilla,gorilla-openfunctions-v2,llava,gpt2,minigpt,stablecode_completion,conv_one_shot,llama-2,stablelm-3b,guanaco,LM,rwkv_world,gpt_bigcode,codellama_instruct,phi-2,phi-3,wizardlm_7b,stablelm-2,mistral_default,redpajama_chat,oasst,stablelm,llama_default,moss,gemma_instruction,neural_hermes_mistral,rwkv,stablecode_instruct,codellama_completion,wizard_coder_or_math,chatml,orion,glm,dolly} Conversation template. It depends on how the model is tuned. Use "LM" for vanilla base model (required, choices: llama-3, custom, open_hermes_mistral, vicuna_v1.1, gorilla, gorilla- openfunctions-v2, llava, gpt2, minigpt, stablecode_completion, conv_one_shot, llama-2, stablelm-3b, guanaco, LM, rwkv_world, gpt_bigcode, codellama_instruct, phi-2, phi-3, wizardlm_7b, stablelm-2, mistral_default, redpajama_chat, oasst, stablelm, llama_default, moss, gemma_instruction, neural_hermes_mistral, rwkv, stablecode_instruct, codellama_completion, wizard_coder_or_math, chatml, orion, glm, dolly) --context-window-size CONTEXT_WINDOW_SIZE Option to provide the maximum sequence length supported by the model. This is usually explicitly shown as context length or context window in the model card. If this option is not set explicitly, by default, it will be determined by `context_window_size` or `max_position_embeddings` in `config.json`, and the latter is usually inaccurate for some models. (default: "None") --sliding-window-size SLIDING_WINDOW_SIZE (Experimental) The sliding window size in sliding window attention (SWA). This optional field overrides the `sliding_window_size` in config.json for those models that use SWA. Currently only useful when compiling Mistral. This flag subjects to future refactoring. (default: "None") --prefill-chunk-size PREFILL_CHUNK_SIZE (Experimental) The chunk size during prefilling. By default, the chunk size is the same as sliding window or max sequence length. This flag subjects to future refactoring. (default: "None") --attention-sink-size ATTENTION_SINK_SIZE (Experimental) The number of stored sinks. Only supported on Mistral yet. By default, the number of sinks is 4. This flag subjects to future refactoring. (default: "None") --tensor-parallel-shards TENSOR_PARALLEL_SHARDS Number of shards to split the model into in tensor parallelism multi-gpu inference. (default: "None") --max-batch-size MAX_BATCH_SIZE The maximum allowed batch size set for the KV cache to concurrently support. (default: "80") --output OUTPUT, -o OUTPUT The output directory for generated configurations, including `mlc-chat-config.json` and tokenizer configuration. (required) ________________________________________________________________________________________________________________________________________ $ mlc_llm bench --help usage: MLC LLM Chat CLI [-h] [--prompt PROMPT] [--opt OPT] [--device DEVICE] [--overrides OVERRIDES] [--generate-length GENERATE_LENGTH] [--model-lib MODEL_LIB] model positional arguments: model A path to ``mlc-chat-config.json``, or an MLC model directory that contains `mlc-chat-config.json`. It can also be a link to a HF repository pointing to an MLC compiled model. (required) options: -h, --help show this help message and exit --prompt PROMPT The prompt of the text generation. (default: "What is the meaning of life?") --opt OPT Optimization flags. MLC LLM maintains a predefined set of optimization flags, denoted as O0, O1, O2, O3, where O0 means no optimization, O2 means majority of them, and O3 represents extreme optimization that could potentially break the system. Meanwhile, optimization flags could be explicitly specified via details knobs, e.g. --opt="cublas_gemm=1;cudagraph=0". (default: "O2") --device DEVICE The device used to deploy the model such as "cuda" or "cuda:0". Will detect from local available GPUs if not specified. (default: "auto") --overrides OVERRIDES Chat configuration override. Configurations to override ChatConfig. Supports `conv_template`, `context_window_size`, `prefill_chunk_size`, `sliding_window_size`, `attention_sink_size`, `max_batch_size` and `tensor_parallel_shards`. Meanwhile, model chat could be explicitly specified via details knobs, e.g. --overrides "context_window_size=1024;prefill_chunk_size=128". (default: "") --generate-length GENERATE_LENGTH The target length of the text generation. (default: "256") --model-lib MODEL_LIB The full path to the model library file to use (e.g. a ``.so`` file). If unspecified, we will use the provided ``model`` to search over possible paths. It the model lib is not found, it will be compiled in a JIT manner. (default: "None") __________________________________________________________________________________________________________________________________________ $ mlc_llm package --help usage: MLC LLM Package CLI [-h] [--package-config PACKAGE_CONFIG] [--mlc-llm-home MLC_LLM_HOME] [--output OUTPUT] options: -h, --help show this help message and exit --package-config PACKAGE_CONFIG The path to "mlc-package-config.json" which is used for package build. See "https://github.com/mlc- ai/mlc-llm/blob/main/ios/MLCChat/mlc-package-config.json" as an example. (default: "mlc-package- config.json") --mlc-llm-home MLC_LLM_HOME The source code path to MLC LLM. (default: the $MLC_LLM_HOME environment variable) --output OUTPUT, -o OUTPUT The path of output directory for the package build outputs. (default: "dist")