ArunKr
/

MLC-LLM

Model card Files Files and versions Community

MLC-LLM / README.md

Arun Kumar Tiwary

Update README.md

681be52 verified 4 months ago

preview code

raw

history blame contribute delete

No virus

28.4 kB

	# Setup MLC-LLM on CPU on UBUNTU 22.04 LTS

	```sh
	sudo apt update
	sudo apt install ocl-icd-opencl-dev clinfo vulkan-tools
	python -m pip install --pre -U -f https://mlc.ai/wheels mlc-llm-nightly mlc-ai-nightly
	mlc_llm chat HF://mlc-ai/Llama-3-8B-Instruct-q4f16_1-MLC

	___________________________________________________________________________________________________________________________________________
	$ mlc_llm --help
	usage: MLC LLM Command Line Interface. [-h] {compile,convert_weight,gen_config,chat,serve,bench,package}

	positional arguments:
	{compile,convert_weight,gen_config,chat,serve,bench,package}
	Subcommand to to run. (choices: compile, convert_weight, gen_config, chat, serve, bench, package)

	options:
	-h, --help show this help message and exit


	____________________________________________________________________________________________________________________________________________
	$ mlc_llm chat --help
	usage: MLC LLM Chat CLI [-h] [--opt OPT] [--device DEVICE] [--overrides OVERRIDES] [--model-lib MODEL_LIB] model

	positional arguments:
	model A path to ``mlc-chat-config.json``, or an MLC model directory that contains `mlc-chat-config.json`.
	It can also be a link to a HF repository pointing to an MLC compiled model. (required)

	options:
	-h, --help show this help message and exit
	--opt OPT Optimization flags. MLC LLM maintains a predefined set of optimization flags, denoted as O0, O1, O2,
	O3, where O0 means no optimization, O2 means majority of them, and O3 represents extreme
	optimization that could potentially break the system. Meanwhile, optimization flags could be
	explicitly specified via details knobs, e.g. --opt="cublas_gemm=1;cudagraph=0". (default: "O2")
	--device DEVICE The device used to deploy the model such as "cuda" or "cuda:0". Will detect from local available
	GPUs if not specified. (default: "auto")
	--overrides OVERRIDES
	Chat configuration override. Configurations to override ChatConfig. Supports `conv_template`,
	`context_window_size`, `prefill_chunk_size`, `sliding_window_size`, `attention_sink_size`,
	`max_batch_size` and `tensor_parallel_shards`. Meanwhile, model chat could be explicitly specified
	via details knobs, e.g. --overrides "context_window_size=1024;prefill_chunk_size=128". (default: "")
	--model-lib MODEL_LIB
	The full path to the model library file to use (e.g. a ``.so`` file). If unspecified, we will use
	the provided ``model`` to search over possible paths. It the model lib is not found, it will be
	compiled in a JIT manner. (default: "None")


	------------------------------------------------------------------------------------------------------------------------------------------
	$ mlc_llm compile --help
	usage: mlc_llm compile [-h]
	[--quantization {q0f16,q0f32,q3f16_0,q3f16_1,q4f16_0,q4f16_1,q4f32_1,q4f16_2,q4f16_autoawq,q4f16_ft,e5m2_e5m2_f16,e4m3_e4m3_f16}]
	[--model-type {auto,llama,mistral,gemma,gpt2,mixtral,gpt_neox,gpt_bigcode,phi-msft,phi,phi3,qwen,qwen2,stablelm,baichuan,internlm,rwkv5,orion,llava,rwkv6,chatglm,eagle,bert,medusa}]
	[--device DEVICE] [--host HOST] [--opt OPT] [--system-lib-prefix SYSTEM_LIB_PREFIX] --output OUTPUT
	[--overrides OVERRIDES] [--debug-dump DEBUG_DUMP]
	model

	positional arguments:
	model A path to ``mlc-chat-config.json``, or an MLC model directory that contains `mlc-chat-config.json`.
	It can also be a link to a HF repository pointing to an MLC compiled model. (required)

	options:
	-h, --help show this help message and exit
	--quantization {q0f16,q0f32,q3f16_0,q3f16_1,q4f16_0,q4f16_1,q4f32_1,q4f16_2,q4f16_autoawq,q4f16_ft,e5m2_e5m2_f16,e4m3_e4m3_f16}
	The quantization mode we use to compile. If unprovided, will infer from `model`. (default: look up
	mlc-chat-config.json, choices: q0f16, q0f32, q3f16_0, q3f16_1, q4f16_0, q4f16_1, q4f32_1, q4f16_2,
	q4f16_autoawq, q4f16_ft, e5m2_e5m2_f16, e4m3_e4m3_f16)
	--model-type {auto,llama,mistral,gemma,gpt2,mixtral,gpt_neox,gpt_bigcode,phi-msft,phi,phi3,qwen,qwen2,stablelm,baichuan,internlm,rwkv5,orion,llava,rwkv6,chatglm,eagle,bert,medusa}
	Model architecture such as "llama". If not set, it is inferred from `mlc-chat-config.json`.
	(default: "auto")
	--device DEVICE The GPU device to compile the model to. If not set, it is inferred from GPUs available locally.
	(default: "auto")
	--host HOST The host LLVM triple to compile the model to. If not set, it is inferred from the local CPU and OS.
	Examples of the LLVM triple: 1) iPhones: arm64-apple-ios; 2) ARM64 Android phones: aarch64-linux-
	android; 3) WebAssembly: wasm32-unknown-unknown-wasm; 4) Windows: x86_64-pc-windows-msvc; 5) ARM
	macOS: arm64-apple-darwin. (default: "auto")
	--opt OPT Optimization flags. MLC LLM maintains a predefined set of optimization flags, denoted as O0, O1, O2,
	O3, where O0 means no optimization, O2 means majority of them, and O3 represents extreme
	optimization that could potentially break the system. Meanwhile, optimization flags could be
	explicitly specified via details knobs, e.g. --opt="cublas_gemm=1;cudagraph=0". (default: "O2")
	--system-lib-prefix SYSTEM_LIB_PREFIX
	Adding a prefix to all symbols exported. Similar to "objcopy --prefix-symbols". This is useful when
	compiling multiple models into a single library to avoid symbol conflicts. Different from objcopy,
	this takes no effect for shared library. (default: "auto")
	--output OUTPUT, -o OUTPUT
	The path to the output file. The suffix determines if the output file is a shared library or
	objects. Available suffixes: 1) Linux: .so (shared), .tar (objects); 2) macOS: .dylib (shared), .tar
	(objects); 3) Windows: .dll (shared), .tar (objects); 4) Android, iOS: .tar (objects); 5) Web: .wasm
	(web assembly). (required)
	--overrides OVERRIDES
	Model configuration override. Configurations to override `mlc-chat-config.json`. Supports
	`context_window_size`, `prefill_chunk_size`, `sliding_window_size`, `attention_sink_size`,
	`max_batch_size` and `tensor_parallel_shards`. Meanwhile, model config could be explicitly specified
	via details knobs, e.g. --overrides "context_window_size=1024;prefill_chunk_size=128". (default: "")
	--debug-dump DEBUG_DUMP
	Specifies the directory where the compiler will store its IRs for debugging purposes during various
	phases of compilation. By default, this is set to `None`, indicating that debug dumping is disabled.
	(default: None)

	____________________________________________________________________________________________________________________________________________
	$ mlc_llm convert_weight --help
	usage: MLC AutoLLM Quantization Framework [-h] --quantization
	{q0f16,q0f32,q3f16_0,q3f16_1,q4f16_0,q4f16_1,q4f32_1,q4f16_2,q4f16_autoawq,q4f16_ft,e5m2_e5m2_f16,e4m3_e4m3_f16}
	[--model-type {auto,llama,mistral,gemma,gpt2,mixtral,gpt_neox,gpt_bigcode,phi-msft,phi,phi3,qwen,qwen2,stablelm,baichuan,internlm,rwkv5,orion,llava,rwkv6,chatglm,eagle,bert,medusa}]
	[--device DEVICE] [--source SOURCE]
	[--source-format {auto,huggingface-torch,huggingface-safetensor,awq}] --output
	OUTPUT
	config

	positional arguments:
	config 1) Path to a HuggingFace model directory that contains a `config.json` or 2) Path to `config.json`
	in HuggingFace format, or 3) The name of a pre-defined model architecture. A `config.json` file in
	HuggingFace format defines the model architecture, including the vocabulary size, the number of
	layers, the hidden size, number of attention heads, etc. Example:
	https://huggingface.co./codellama/CodeLlama-7b-hf/blob/main/config.json. A HuggingFace directory
	often contains a `config.json` which defines the model architecture, the non-quantized model weights
	in PyTorch or SafeTensor format, tokenizer configurations, as well as an optional
	`generation_config.json` provides additional default configuration for text generation. Example:
	https://huggingface.co./codellama/CodeLlama-7b-hf/tree/main. (required)

	options:
	-h, --help show this help message and exit
	--quantization {q0f16,q0f32,q3f16_0,q3f16_1,q4f16_0,q4f16_1,q4f32_1,q4f16_2,q4f16_autoawq,q4f16_ft,e5m2_e5m2_f16,e4m3_e4m3_f16}
	The quantization mode we use to compile. If unprovided, will infer from `model`. (required, choices:
	q0f16, q0f32, q3f16_0, q3f16_1, q4f16_0, q4f16_1, q4f32_1, q4f16_2, q4f16_autoawq, q4f16_ft,
	e5m2_e5m2_f16, e4m3_e4m3_f16)
	--model-type {auto,llama,mistral,gemma,gpt2,mixtral,gpt_neox,gpt_bigcode,phi-msft,phi,phi3,qwen,qwen2,stablelm,baichuan,internlm,rwkv5,orion,llava,rwkv6,chatglm,eagle,bert,medusa}
	Model architecture such as "llama". If not set, it is inferred from `mlc-chat-config.json`.
	(default: "auto")
	--device DEVICE The device used to do quantization such as "cuda" or "cuda:0". Will detect from local available GPUs
	if not specified. (default: "auto")
	--source SOURCE The path to original model weight, infer from `config` if missing. (default: "auto")
	--source-format {auto,huggingface-torch,huggingface-safetensor,awq}
	The format of source model weight, infer from `config` if missing. (default: "auto", choices: auto,
	huggingface-torch, huggingface-safetensor, awq")
	--output OUTPUT, -o OUTPUT
	The output directory to save the quantized model weight. Will create `params_shard_*.bin` and
	`ndarray-cache.json` in this directory. (required)

	--------------------------------------------------------------------------------------------------------------------------------
	$mlc_llm serve --help
	usage: MLC LLM Serve CLI [-h] [--device DEVICE] [--model-lib MODEL_LIB] [--mode {local,interactive,server}]
	[--additional-models [ADDITIONAL_MODELS ...]] [--max-batch-size MAX_BATCH_SIZE]
	[--max-total-seq-length MAX_TOTAL_SEQ_LENGTH] [--prefill-chunk-size PREFILL_CHUNK_SIZE]
	[--max-history-size MAX_HISTORY_SIZE] [--gpu-memory-utilization GPU_MEMORY_UTILIZATION]
	[--speculative-mode {disable,small_draft,eagle,medusa}] [--spec-draft-length SPEC_DRAFT_LENGTH]
	[--enable-tracing] [--host HOST] [--port PORT] [--allow-credentials]
	[--allow-origins ALLOW_ORIGINS] [--allow-methods ALLOW_METHODS] [--allow-headers ALLOW_HEADERS]
	model

	positional arguments:
	model A path to ``mlc-chat-config.json``, or an MLC model directory that contains `mlc-chat-config.json`.
	It can also be a link to a HF repository pointing to an MLC compiled model. (required)

	options:
	-h, --help show this help message and exit
	--device DEVICE The device used to deploy the model such as "cuda" or "cuda:0". Will detect from local available
	GPUs if not specified. (default: "auto")
	--model-lib MODEL_LIB
	The full path to the model library file to use (e.g. a ``.so`` file). If unspecified, we will use
	the provided ``model`` to search over possible paths. It the model lib is not found, it will be
	compiled in a JIT manner. (default: "None")
	--mode {local,interactive,server}
	The engine mode in MLC LLM. We provide three preset modes: "local", "interactive" and "server". The
	default mode is "local". The choice of mode decides the values of "--max-batch-size", "--max-total-
	seq-length" and "--prefill-chunk-size" when they are not explicitly specified. 1. Mode "local"
	refers to the local server deployment which has low request concurrency. So the max batch size will
	be set to 4, and max total sequence length and prefill chunk size are set to the context window size
	(or sliding window size) of the model. 2. Mode "interactive" refers to the interactive use of
	server, which has at most 1 concurrent request. So the max batch size will be set to 1, and max
	total sequence length and prefill chunk size are set to the context window size (or sliding window
	size) of the model. 3. Mode "server" refers to the large server use case which may handle many
	concurrent request and want to use GPU memory as much as possible. In this mode, we will
	automatically infer the largest possible max batch size and max total sequence length. You can
	manually specify arguments "--max-batch-size", "--max-total-seq-length" and "--prefill-chunk-size"
	to override the automatic inferred values. (default: "local")
	--additional-models [ADDITIONAL_MODELS ...]
	The model paths and (optional) model library paths of additional models (other than the main model).
	When engine is enabled with speculative decoding, additional models are needed. The way of
	specifying additional models is: "--additional-models model_path_1 model_path_2 ..." or "--
	additional-models model_path_1:model_lib_1 model_path_2 ...". When the model lib of a model is not
	given, JIT model compilation will be activated to compile the model automatically.
	--max-batch-size MAX_BATCH_SIZE
	The maximum allowed batch size set for the KV cache to concurrently support.
	--max-total-seq-length MAX_TOTAL_SEQ_LENGTH
	The KV cache total token capacity, i.e., the maximum total number of tokens that the KV cache
	support. This decides the GPU memory size that the KV cache consumes. If not specified, system will
	automatically estimate the maximum capacity based on the vRAM size on GPU.
	--prefill-chunk-size PREFILL_CHUNK_SIZE
	The maximum number of tokens the model passes for prefill each time. It should not exceed the
	prefill chunk size in model config. If not specified, this defaults to the prefill chunk size in
	model config.
	--max-history-size MAX_HISTORY_SIZE
	The maximum history length for rolling back the RNN state. If unspecified, the default value is 1.
	KV cache does not need this.
	--gpu-memory-utilization GPU_MEMORY_UTILIZATION
	A number in (0, 1) denoting the fraction of GPU memory used by the server in total. It is used to
	infer to maximum possible KV cache capacity. When it is unspecified, it defaults to 0.85. Under mode
	"local" or "interactive", the actual memory usage may be significantly smaller than this number.
	Under mode "server", the actual memory usage may be slightly larger than this number.
	--speculative-mode {disable,small_draft,eagle,medusa}
	The speculative decoding mode. Right now three options are supported: - "disable", where speculative
	decoding is not enabled, - "small_draft", denoting the normal speculative decoding (small draft)
	style, - "eagle", denoting the eagle-style speculative decoding. The default mode is "disable".
	(default: "disable")
	--spec-draft-length SPEC_DRAFT_LENGTH
	The number of draft tokens to generate in speculative proposal. The default values is 4.
	--enable-tracing Enable Chrome Tracing for the server. After enabling, you can send POST request to the
	"debug/dump_event_trace" entrypoint to get the Chrome Trace. For example, "curl -X POST
	http://127.0.0.1:8000/debug/dump_event_trace -H "Content-Type: application/json" -d '{"model":
	"dist/llama"}'"
	--host HOST host name (default: "127.0.0.1")
	--port PORT port (default: "8000")
	--allow-credentials allow credentials
	--allow-origins ALLOW_ORIGINS
	allowed origins (default: "['*']")
	--allow-methods ALLOW_METHODS
	allowed methods (default: "['*']")
	--allow-headers ALLOW_HEADERS
	allowed headers (default: "['*']")

	_________________________________________________________________________________________________________________________________________
	$ mlc_llm gen_config --help
	usage: MLC LLM Configuration Generator [-h] --quantization
	{q0f16,q0f32,q3f16_0,q3f16_1,q4f16_0,q4f16_1,q4f32_1,q4f16_2,q4f16_autoawq,q4f16_ft,e5m2_e5m2_f16,e4m3_e4m3_f16}
	[--model-type {auto,llama,mistral,gemma,gpt2,mixtral,gpt_neox,gpt_bigcode,phi-msft,phi,phi3,qwen,qwen2,stablelm,baichuan,internlm,rwkv5,orion,llava,rwkv6,chatglm,eagle,bert,medusa}]
	--conv-template
	{llama-3,custom,open_hermes_mistral,vicuna_v1.1,gorilla,gorilla-openfunctions-v2,llava,gpt2,minigpt,stablecode_completion,conv_one_shot,llama-2,stablelm-3b,guanaco,LM,rwkv_world,gpt_bigcode,codellama_instruct,phi-2,phi-3,wizardlm_7b,stablelm-2,mistral_default,redpajama_chat,oasst,stablelm,llama_default,moss,gemma_instruction,neural_hermes_mistral,rwkv,stablecode_instruct,codellama_completion,wizard_coder_or_math,chatml,orion,glm,dolly}
	[--context-window-size CONTEXT_WINDOW_SIZE]
	[--sliding-window-size SLIDING_WINDOW_SIZE] [--prefill-chunk-size PREFILL_CHUNK_SIZE]
	[--attention-sink-size ATTENTION_SINK_SIZE]
	[--tensor-parallel-shards TENSOR_PARALLEL_SHARDS] [--max-batch-size MAX_BATCH_SIZE]
	--output OUTPUT
	config

	positional arguments:
	config 1) Path to a HuggingFace model directory that contains a `config.json` or 2) Path to `config.json`
	in HuggingFace format, or 3) The name of a pre-defined model architecture. A `config.json` file in
	HuggingFace format defines the model architecture, including the vocabulary size, the number of
	layers, the hidden size, number of attention heads, etc. Example:
	https://huggingface.co./codellama/CodeLlama-7b-hf/blob/main/config.json. A HuggingFace directory
	often contains a `config.json` which defines the model architecture, the non-quantized model weights
	in PyTorch or SafeTensor format, tokenizer configurations, as well as an optional
	`generation_config.json` provides additional default configuration for text generation. Example:
	https://huggingface.co./codellama/CodeLlama-7b-hf/tree/main. (required)

	options:
	-h, --help show this help message and exit
	--quantization {q0f16,q0f32,q3f16_0,q3f16_1,q4f16_0,q4f16_1,q4f32_1,q4f16_2,q4f16_autoawq,q4f16_ft,e5m2_e5m2_f16,e4m3_e4m3_f16}
	The quantization mode we use to compile. If unprovided, will infer from `model`. (required, choices:
	q0f16, q0f32, q3f16_0, q3f16_1, q4f16_0, q4f16_1, q4f32_1, q4f16_2, q4f16_autoawq, q4f16_ft,
	e5m2_e5m2_f16, e4m3_e4m3_f16)
	--model-type {auto,llama,mistral,gemma,gpt2,mixtral,gpt_neox,gpt_bigcode,phi-msft,phi,phi3,qwen,qwen2,stablelm,baichuan,internlm,rwkv5,orion,llava,rwkv6,chatglm,eagle,bert,medusa}
	Model architecture such as "llama". If not set, it is inferred from `mlc-chat-config.json`.
	(default: "auto", choices: auto, llama, mistral, gemma, gpt2, mixtral, gpt_neox, gpt_bigcode, phi-
	msft, phi, phi3, qwen, qwen2, stablelm, baichuan, internlm, rwkv5, orion, llava, rwkv6, chatglm,
	eagle, bert, medusa)
	--conv-template {llama-3,custom,open_hermes_mistral,vicuna_v1.1,gorilla,gorilla-openfunctions-v2,llava,gpt2,minigpt,stablecode_completion,conv_one_shot,llama-2,stablelm-3b,guanaco,LM,rwkv_world,gpt_bigcode,codellama_instruct,phi-2,phi-3,wizardlm_7b,stablelm-2,mistral_default,redpajama_chat,oasst,stablelm,llama_default,moss,gemma_instruction,neural_hermes_mistral,rwkv,stablecode_instruct,codellama_completion,wizard_coder_or_math,chatml,orion,glm,dolly}
	Conversation template. It depends on how the model is tuned. Use "LM" for vanilla base model
	(required, choices: llama-3, custom, open_hermes_mistral, vicuna_v1.1, gorilla, gorilla-
	openfunctions-v2, llava, gpt2, minigpt, stablecode_completion, conv_one_shot, llama-2, stablelm-3b,
	guanaco, LM, rwkv_world, gpt_bigcode, codellama_instruct, phi-2, phi-3, wizardlm_7b, stablelm-2,
	mistral_default, redpajama_chat, oasst, stablelm, llama_default, moss, gemma_instruction,
	neural_hermes_mistral, rwkv, stablecode_instruct, codellama_completion, wizard_coder_or_math,
	chatml, orion, glm, dolly)
	--context-window-size CONTEXT_WINDOW_SIZE
	Option to provide the maximum sequence length supported by the model. This is usually explicitly
	shown as context length or context window in the model card. If this option is not set explicitly,
	by default, it will be determined by `context_window_size` or `max_position_embeddings` in
	`config.json`, and the latter is usually inaccurate for some models. (default: "None")
	--sliding-window-size SLIDING_WINDOW_SIZE
	(Experimental) The sliding window size in sliding window attention (SWA). This optional field
	overrides the `sliding_window_size` in config.json for those models that use SWA. Currently only
	useful when compiling Mistral. This flag subjects to future refactoring. (default: "None")
	--prefill-chunk-size PREFILL_CHUNK_SIZE
	(Experimental) The chunk size during prefilling. By default, the chunk size is the same as sliding
	window or max sequence length. This flag subjects to future refactoring. (default: "None")
	--attention-sink-size ATTENTION_SINK_SIZE
	(Experimental) The number of stored sinks. Only supported on Mistral yet. By default, the number of
	sinks is 4. This flag subjects to future refactoring. (default: "None")
	--tensor-parallel-shards TENSOR_PARALLEL_SHARDS
	Number of shards to split the model into in tensor parallelism multi-gpu inference. (default:
	"None")
	--max-batch-size MAX_BATCH_SIZE
	The maximum allowed batch size set for the KV cache to concurrently support. (default: "80")
	--output OUTPUT, -o OUTPUT
	The output directory for generated configurations, including `mlc-chat-config.json` and tokenizer
	configuration. (required)
	________________________________________________________________________________________________________________________________________
	$ mlc_llm bench --help
	usage: MLC LLM Chat CLI [-h] [--prompt PROMPT] [--opt OPT] [--device DEVICE] [--overrides OVERRIDES]
	[--generate-length GENERATE_LENGTH] [--model-lib MODEL_LIB]
	model

	positional arguments:
	model A path to ``mlc-chat-config.json``, or an MLC model directory that contains `mlc-chat-config.json`.
	It can also be a link to a HF repository pointing to an MLC compiled model. (required)

	options:
	-h, --help show this help message and exit
	--prompt PROMPT The prompt of the text generation. (default: "What is the meaning of life?")
	--opt OPT Optimization flags. MLC LLM maintains a predefined set of optimization flags, denoted as O0, O1, O2,
	O3, where O0 means no optimization, O2 means majority of them, and O3 represents extreme
	optimization that could potentially break the system. Meanwhile, optimization flags could be
	explicitly specified via details knobs, e.g. --opt="cublas_gemm=1;cudagraph=0". (default: "O2")
	--device DEVICE The device used to deploy the model such as "cuda" or "cuda:0". Will detect from local available
	GPUs if not specified. (default: "auto")
	--overrides OVERRIDES
	Chat configuration override. Configurations to override ChatConfig. Supports `conv_template`,
	`context_window_size`, `prefill_chunk_size`, `sliding_window_size`, `attention_sink_size`,
	`max_batch_size` and `tensor_parallel_shards`. Meanwhile, model chat could be explicitly specified
	via details knobs, e.g. --overrides "context_window_size=1024;prefill_chunk_size=128". (default: "")
	--generate-length GENERATE_LENGTH
	The target length of the text generation. (default: "256")
	--model-lib MODEL_LIB
	The full path to the model library file to use (e.g. a ``.so`` file). If unspecified, we will use
	the provided ``model`` to search over possible paths. It the model lib is not found, it will be
	compiled in a JIT manner. (default: "None")

	__________________________________________________________________________________________________________________________________________

	$ mlc_llm package --help
	usage: MLC LLM Package CLI [-h] [--package-config PACKAGE_CONFIG] [--mlc-llm-home MLC_LLM_HOME] [--output OUTPUT]

	options:
	-h, --help show this help message and exit
	--package-config PACKAGE_CONFIG
	The path to "mlc-package-config.json" which is used for package build. See "https://github.com/mlc-
	ai/mlc-llm/blob/main/ios/MLCChat/mlc-package-config.json" as an example. (default: "mlc-package-
	config.json")
	--mlc-llm-home MLC_LLM_HOME
	The source code path to MLC LLM. (default: the $MLC_LLM_HOME environment variable)
	--output OUTPUT, -o OUTPUT
	The path of output directory for the package build outputs. (default: "dist")