File size: 28,445 Bytes
328378a
 
55e0718
328378a
 
 
 
cca0560
681be52
 
 
 
 
 
 
 
 
 
 
 
 
cca0560
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
681be52
 
 
 
cca0560
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
681be52
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
fa413cc
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
681be52
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
fa413cc
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
# Setup MLC-LLM on CPU on UBUNTU 22.04 LTS

```sh
sudo apt update
sudo apt install ocl-icd-opencl-dev clinfo vulkan-tools
python -m pip install --pre -U -f https://mlc.ai/wheels mlc-llm-nightly mlc-ai-nightly
mlc_llm chat HF://mlc-ai/Llama-3-8B-Instruct-q4f16_1-MLC

___________________________________________________________________________________________________________________________________________
$ mlc_llm  --help
usage: MLC LLM Command Line Interface. [-h] {compile,convert_weight,gen_config,chat,serve,bench,package}

positional arguments:
  {compile,convert_weight,gen_config,chat,serve,bench,package}
                        Subcommand to to run. (choices: compile, convert_weight, gen_config, chat, serve, bench, package)

options:
  -h, --help            show this help message and exit


____________________________________________________________________________________________________________________________________________
$ mlc_llm chat --help
usage: MLC LLM Chat CLI [-h] [--opt OPT] [--device DEVICE] [--overrides OVERRIDES] [--model-lib MODEL_LIB] model

positional arguments:
  model                 A path to ``mlc-chat-config.json``, or an MLC model directory that contains `mlc-chat-config.json`.
                        It can also be a link to a HF repository pointing to an MLC compiled model. (required)

options:
  -h, --help            show this help message and exit
  --opt OPT             Optimization flags. MLC LLM maintains a predefined set of optimization flags, denoted as O0, O1, O2,
                        O3, where O0 means no optimization, O2 means majority of them, and O3 represents extreme
                        optimization that could potentially break the system. Meanwhile, optimization flags could be
                        explicitly specified via details knobs, e.g. --opt="cublas_gemm=1;cudagraph=0". (default: "O2")
  --device DEVICE       The device used to deploy the model such as "cuda" or "cuda:0". Will detect from local available
                        GPUs if not specified. (default: "auto")
  --overrides OVERRIDES
                        Chat configuration override. Configurations to override ChatConfig. Supports `conv_template`,
                        `context_window_size`, `prefill_chunk_size`, `sliding_window_size`, `attention_sink_size`,
                        `max_batch_size` and `tensor_parallel_shards`. Meanwhile, model chat could be explicitly specified
                        via details knobs, e.g. --overrides "context_window_size=1024;prefill_chunk_size=128". (default: "")
  --model-lib MODEL_LIB
                        The full path to the model library file to use (e.g. a ``.so`` file). If unspecified, we will use
                        the provided ``model`` to search over possible paths. It the model lib is not found, it will be
                        compiled in a JIT manner. (default: "None")


------------------------------------------------------------------------------------------------------------------------------------------
$ mlc_llm compile --help
usage: mlc_llm compile [-h]
                       [--quantization {q0f16,q0f32,q3f16_0,q3f16_1,q4f16_0,q4f16_1,q4f32_1,q4f16_2,q4f16_autoawq,q4f16_ft,e5m2_e5m2_f16,e4m3_e4m3_f16}]
                       [--model-type {auto,llama,mistral,gemma,gpt2,mixtral,gpt_neox,gpt_bigcode,phi-msft,phi,phi3,qwen,qwen2,stablelm,baichuan,internlm,rwkv5,orion,llava,rwkv6,chatglm,eagle,bert,medusa}]
                       [--device DEVICE] [--host HOST] [--opt OPT] [--system-lib-prefix SYSTEM_LIB_PREFIX] --output OUTPUT
                       [--overrides OVERRIDES] [--debug-dump DEBUG_DUMP]
                       model

positional arguments:
  model                 A path to ``mlc-chat-config.json``, or an MLC model directory that contains `mlc-chat-config.json`.
                        It can also be a link to a HF repository pointing to an MLC compiled model. (required)

options:
  -h, --help            show this help message and exit
  --quantization {q0f16,q0f32,q3f16_0,q3f16_1,q4f16_0,q4f16_1,q4f32_1,q4f16_2,q4f16_autoawq,q4f16_ft,e5m2_e5m2_f16,e4m3_e4m3_f16}
                        The quantization mode we use to compile. If unprovided, will infer from `model`. (default: look up
                        mlc-chat-config.json, choices: q0f16, q0f32, q3f16_0, q3f16_1, q4f16_0, q4f16_1, q4f32_1, q4f16_2,
                        q4f16_autoawq, q4f16_ft, e5m2_e5m2_f16, e4m3_e4m3_f16)
  --model-type {auto,llama,mistral,gemma,gpt2,mixtral,gpt_neox,gpt_bigcode,phi-msft,phi,phi3,qwen,qwen2,stablelm,baichuan,internlm,rwkv5,orion,llava,rwkv6,chatglm,eagle,bert,medusa}
                        Model architecture such as "llama". If not set, it is inferred from `mlc-chat-config.json`.
                        (default: "auto")
  --device DEVICE       The GPU device to compile the model to. If not set, it is inferred from GPUs available locally.
                        (default: "auto")
  --host HOST           The host LLVM triple to compile the model to. If not set, it is inferred from the local CPU and OS.
                        Examples of the LLVM triple: 1) iPhones: arm64-apple-ios; 2) ARM64 Android phones: aarch64-linux-
                        android; 3) WebAssembly: wasm32-unknown-unknown-wasm; 4) Windows: x86_64-pc-windows-msvc; 5) ARM
                        macOS: arm64-apple-darwin. (default: "auto")
  --opt OPT             Optimization flags. MLC LLM maintains a predefined set of optimization flags, denoted as O0, O1, O2,
                        O3, where O0 means no optimization, O2 means majority of them, and O3 represents extreme
                        optimization that could potentially break the system. Meanwhile, optimization flags could be
                        explicitly specified via details knobs, e.g. --opt="cublas_gemm=1;cudagraph=0". (default: "O2")
  --system-lib-prefix SYSTEM_LIB_PREFIX
                        Adding a prefix to all symbols exported. Similar to "objcopy --prefix-symbols". This is useful when
                        compiling multiple models into a single library to avoid symbol conflicts. Different from objcopy,
                        this takes no effect for shared library. (default: "auto")
  --output OUTPUT, -o OUTPUT
                        The path to the output file. The suffix determines if the output file is a shared library or
                        objects. Available suffixes: 1) Linux: .so (shared), .tar (objects); 2) macOS: .dylib (shared), .tar
                        (objects); 3) Windows: .dll (shared), .tar (objects); 4) Android, iOS: .tar (objects); 5) Web: .wasm
                        (web assembly). (required)
  --overrides OVERRIDES
                        Model configuration override. Configurations to override `mlc-chat-config.json`. Supports
                        `context_window_size`, `prefill_chunk_size`, `sliding_window_size`, `attention_sink_size`,
                        `max_batch_size` and `tensor_parallel_shards`. Meanwhile, model config could be explicitly specified
                        via details knobs, e.g. --overrides "context_window_size=1024;prefill_chunk_size=128". (default: "")
  --debug-dump DEBUG_DUMP
                        Specifies the directory where the compiler will store its IRs for debugging purposes during various
                        phases of compilation. By default, this is set to `None`, indicating that debug dumping is disabled.
                        (default: None)

____________________________________________________________________________________________________________________________________________
$ mlc_llm convert_weight --help
usage: MLC AutoLLM Quantization Framework [-h] --quantization
                                          {q0f16,q0f32,q3f16_0,q3f16_1,q4f16_0,q4f16_1,q4f32_1,q4f16_2,q4f16_autoawq,q4f16_ft,e5m2_e5m2_f16,e4m3_e4m3_f16}
                                          [--model-type {auto,llama,mistral,gemma,gpt2,mixtral,gpt_neox,gpt_bigcode,phi-msft,phi,phi3,qwen,qwen2,stablelm,baichuan,internlm,rwkv5,orion,llava,rwkv6,chatglm,eagle,bert,medusa}]
                                          [--device DEVICE] [--source SOURCE]
                                          [--source-format {auto,huggingface-torch,huggingface-safetensor,awq}] --output
                                          OUTPUT
                                          config

positional arguments:
  config                1) Path to a HuggingFace model directory that contains a `config.json` or 2) Path to `config.json`
                        in HuggingFace format, or 3) The name of a pre-defined model architecture. A `config.json` file in
                        HuggingFace format defines the model architecture, including the vocabulary size, the number of
                        layers, the hidden size, number of attention heads, etc. Example:
                        https://huggingface.co./codellama/CodeLlama-7b-hf/blob/main/config.json. A HuggingFace directory
                        often contains a `config.json` which defines the model architecture, the non-quantized model weights
                        in PyTorch or SafeTensor format, tokenizer configurations, as well as an optional
                        `generation_config.json` provides additional default configuration for text generation. Example:
                        https://huggingface.co./codellama/CodeLlama-7b-hf/tree/main. (required)

options:
  -h, --help            show this help message and exit
  --quantization {q0f16,q0f32,q3f16_0,q3f16_1,q4f16_0,q4f16_1,q4f32_1,q4f16_2,q4f16_autoawq,q4f16_ft,e5m2_e5m2_f16,e4m3_e4m3_f16}
                        The quantization mode we use to compile. If unprovided, will infer from `model`. (required, choices:
                        q0f16, q0f32, q3f16_0, q3f16_1, q4f16_0, q4f16_1, q4f32_1, q4f16_2, q4f16_autoawq, q4f16_ft,
                        e5m2_e5m2_f16, e4m3_e4m3_f16)
  --model-type {auto,llama,mistral,gemma,gpt2,mixtral,gpt_neox,gpt_bigcode,phi-msft,phi,phi3,qwen,qwen2,stablelm,baichuan,internlm,rwkv5,orion,llava,rwkv6,chatglm,eagle,bert,medusa}
                        Model architecture such as "llama". If not set, it is inferred from `mlc-chat-config.json`.
                        (default: "auto")
  --device DEVICE       The device used to do quantization such as "cuda" or "cuda:0". Will detect from local available GPUs
                        if not specified. (default: "auto")
  --source SOURCE       The path to original model weight, infer from `config` if missing. (default: "auto")
  --source-format {auto,huggingface-torch,huggingface-safetensor,awq}
                        The format of source model weight, infer from `config` if missing. (default: "auto", choices: auto,
                        huggingface-torch, huggingface-safetensor, awq")
  --output OUTPUT, -o OUTPUT
                        The output directory to save the quantized model weight. Will create `params_shard_*.bin` and
                        `ndarray-cache.json` in this directory. (required)

--------------------------------------------------------------------------------------------------------------------------------
$mlc_llm serve --help
usage: MLC LLM Serve CLI [-h] [--device DEVICE] [--model-lib MODEL_LIB] [--mode {local,interactive,server}]
                         [--additional-models [ADDITIONAL_MODELS ...]] [--max-batch-size MAX_BATCH_SIZE]
                         [--max-total-seq-length MAX_TOTAL_SEQ_LENGTH] [--prefill-chunk-size PREFILL_CHUNK_SIZE]
                         [--max-history-size MAX_HISTORY_SIZE] [--gpu-memory-utilization GPU_MEMORY_UTILIZATION]
                         [--speculative-mode {disable,small_draft,eagle,medusa}] [--spec-draft-length SPEC_DRAFT_LENGTH]
                         [--enable-tracing] [--host HOST] [--port PORT] [--allow-credentials]
                         [--allow-origins ALLOW_ORIGINS] [--allow-methods ALLOW_METHODS] [--allow-headers ALLOW_HEADERS]
                         model

positional arguments:
  model                 A path to ``mlc-chat-config.json``, or an MLC model directory that contains `mlc-chat-config.json`.
                        It can also be a link to a HF repository pointing to an MLC compiled model. (required)

options:
  -h, --help            show this help message and exit
  --device DEVICE       The device used to deploy the model such as "cuda" or "cuda:0". Will detect from local available
                        GPUs if not specified. (default: "auto")
  --model-lib MODEL_LIB
                        The full path to the model library file to use (e.g. a ``.so`` file). If unspecified, we will use
                        the provided ``model`` to search over possible paths. It the model lib is not found, it will be
                        compiled in a JIT manner. (default: "None")
  --mode {local,interactive,server}
                        The engine mode in MLC LLM. We provide three preset modes: "local", "interactive" and "server". The
                        default mode is "local". The choice of mode decides the values of "--max-batch-size", "--max-total-
                        seq-length" and "--prefill-chunk-size" when they are not explicitly specified. 1. Mode "local"
                        refers to the local server deployment which has low request concurrency. So the max batch size will
                        be set to 4, and max total sequence length and prefill chunk size are set to the context window size
                        (or sliding window size) of the model. 2. Mode "interactive" refers to the interactive use of
                        server, which has at most 1 concurrent request. So the max batch size will be set to 1, and max
                        total sequence length and prefill chunk size are set to the context window size (or sliding window
                        size) of the model. 3. Mode "server" refers to the large server use case which may handle many
                        concurrent request and want to use GPU memory as much as possible. In this mode, we will
                        automatically infer the largest possible max batch size and max total sequence length. You can
                        manually specify arguments "--max-batch-size", "--max-total-seq-length" and "--prefill-chunk-size"
                        to override the automatic inferred values. (default: "local")
  --additional-models [ADDITIONAL_MODELS ...]
                        The model paths and (optional) model library paths of additional models (other than the main model).
                        When engine is enabled with speculative decoding, additional models are needed. The way of
                        specifying additional models is: "--additional-models model_path_1 model_path_2 ..." or "--
                        additional-models model_path_1:model_lib_1 model_path_2 ...". When the model lib of a model is not
                        given, JIT model compilation will be activated to compile the model automatically.
  --max-batch-size MAX_BATCH_SIZE
                        The maximum allowed batch size set for the KV cache to concurrently support.
  --max-total-seq-length MAX_TOTAL_SEQ_LENGTH
                        The KV cache total token capacity, i.e., the maximum total number of tokens that the KV cache
                        support. This decides the GPU memory size that the KV cache consumes. If not specified, system will
                        automatically estimate the maximum capacity based on the vRAM size on GPU.
  --prefill-chunk-size PREFILL_CHUNK_SIZE
                        The maximum number of tokens the model passes for prefill each time. It should not exceed the
                        prefill chunk size in model config. If not specified, this defaults to the prefill chunk size in
                        model config.
  --max-history-size MAX_HISTORY_SIZE
                        The maximum history length for rolling back the RNN state. If unspecified, the default value is 1.
                        KV cache does not need this.
  --gpu-memory-utilization GPU_MEMORY_UTILIZATION
                        A number in (0, 1) denoting the fraction of GPU memory used by the server in total. It is used to
                        infer to maximum possible KV cache capacity. When it is unspecified, it defaults to 0.85. Under mode
                        "local" or "interactive", the actual memory usage may be significantly smaller than this number.
                        Under mode "server", the actual memory usage may be slightly larger than this number.
  --speculative-mode {disable,small_draft,eagle,medusa}
                        The speculative decoding mode. Right now three options are supported: - "disable", where speculative
                        decoding is not enabled, - "small_draft", denoting the normal speculative decoding (small draft)
                        style, - "eagle", denoting the eagle-style speculative decoding. The default mode is "disable".
                        (default: "disable")
  --spec-draft-length SPEC_DRAFT_LENGTH
                        The number of draft tokens to generate in speculative proposal. The default values is 4.
  --enable-tracing      Enable Chrome Tracing for the server. After enabling, you can send POST request to the
                        "debug/dump_event_trace" entrypoint to get the Chrome Trace. For example, "curl -X POST
                        http://127.0.0.1:8000/debug/dump_event_trace -H "Content-Type: application/json" -d '{"model":
                        "dist/llama"}'"
  --host HOST           host name (default: "127.0.0.1")
  --port PORT           port (default: "8000")
  --allow-credentials   allow credentials
  --allow-origins ALLOW_ORIGINS
                        allowed origins (default: "['*']")
  --allow-methods ALLOW_METHODS
                        allowed methods (default: "['*']")
  --allow-headers ALLOW_HEADERS
                        allowed headers (default: "['*']")

_________________________________________________________________________________________________________________________________________
$ mlc_llm  gen_config --help
usage: MLC LLM Configuration Generator [-h] --quantization
                                       {q0f16,q0f32,q3f16_0,q3f16_1,q4f16_0,q4f16_1,q4f32_1,q4f16_2,q4f16_autoawq,q4f16_ft,e5m2_e5m2_f16,e4m3_e4m3_f16}
                                       [--model-type {auto,llama,mistral,gemma,gpt2,mixtral,gpt_neox,gpt_bigcode,phi-msft,phi,phi3,qwen,qwen2,stablelm,baichuan,internlm,rwkv5,orion,llava,rwkv6,chatglm,eagle,bert,medusa}]
                                       --conv-template
                                       {llama-3,custom,open_hermes_mistral,vicuna_v1.1,gorilla,gorilla-openfunctions-v2,llava,gpt2,minigpt,stablecode_completion,conv_one_shot,llama-2,stablelm-3b,guanaco,LM,rwkv_world,gpt_bigcode,codellama_instruct,phi-2,phi-3,wizardlm_7b,stablelm-2,mistral_default,redpajama_chat,oasst,stablelm,llama_default,moss,gemma_instruction,neural_hermes_mistral,rwkv,stablecode_instruct,codellama_completion,wizard_coder_or_math,chatml,orion,glm,dolly}
                                       [--context-window-size CONTEXT_WINDOW_SIZE]
                                       [--sliding-window-size SLIDING_WINDOW_SIZE] [--prefill-chunk-size PREFILL_CHUNK_SIZE]
                                       [--attention-sink-size ATTENTION_SINK_SIZE]
                                       [--tensor-parallel-shards TENSOR_PARALLEL_SHARDS] [--max-batch-size MAX_BATCH_SIZE]
                                       --output OUTPUT
                                       config

positional arguments:
  config                1) Path to a HuggingFace model directory that contains a `config.json` or 2) Path to `config.json`
                        in HuggingFace format, or 3) The name of a pre-defined model architecture. A `config.json` file in
                        HuggingFace format defines the model architecture, including the vocabulary size, the number of
                        layers, the hidden size, number of attention heads, etc. Example:
                        https://huggingface.co./codellama/CodeLlama-7b-hf/blob/main/config.json. A HuggingFace directory
                        often contains a `config.json` which defines the model architecture, the non-quantized model weights
                        in PyTorch or SafeTensor format, tokenizer configurations, as well as an optional
                        `generation_config.json` provides additional default configuration for text generation. Example:
                        https://huggingface.co./codellama/CodeLlama-7b-hf/tree/main. (required)

options:
  -h, --help            show this help message and exit
  --quantization {q0f16,q0f32,q3f16_0,q3f16_1,q4f16_0,q4f16_1,q4f32_1,q4f16_2,q4f16_autoawq,q4f16_ft,e5m2_e5m2_f16,e4m3_e4m3_f16}
                        The quantization mode we use to compile. If unprovided, will infer from `model`. (required, choices:
                        q0f16, q0f32, q3f16_0, q3f16_1, q4f16_0, q4f16_1, q4f32_1, q4f16_2, q4f16_autoawq, q4f16_ft,
                        e5m2_e5m2_f16, e4m3_e4m3_f16)
  --model-type {auto,llama,mistral,gemma,gpt2,mixtral,gpt_neox,gpt_bigcode,phi-msft,phi,phi3,qwen,qwen2,stablelm,baichuan,internlm,rwkv5,orion,llava,rwkv6,chatglm,eagle,bert,medusa}
                        Model architecture such as "llama". If not set, it is inferred from `mlc-chat-config.json`.
                        (default: "auto", choices: auto, llama, mistral, gemma, gpt2, mixtral, gpt_neox, gpt_bigcode, phi-
                        msft, phi, phi3, qwen, qwen2, stablelm, baichuan, internlm, rwkv5, orion, llava, rwkv6, chatglm,
                        eagle, bert, medusa)
  --conv-template {llama-3,custom,open_hermes_mistral,vicuna_v1.1,gorilla,gorilla-openfunctions-v2,llava,gpt2,minigpt,stablecode_completion,conv_one_shot,llama-2,stablelm-3b,guanaco,LM,rwkv_world,gpt_bigcode,codellama_instruct,phi-2,phi-3,wizardlm_7b,stablelm-2,mistral_default,redpajama_chat,oasst,stablelm,llama_default,moss,gemma_instruction,neural_hermes_mistral,rwkv,stablecode_instruct,codellama_completion,wizard_coder_or_math,chatml,orion,glm,dolly}
                        Conversation template. It depends on how the model is tuned. Use "LM" for vanilla base model
                        (required, choices: llama-3, custom, open_hermes_mistral, vicuna_v1.1, gorilla, gorilla-
                        openfunctions-v2, llava, gpt2, minigpt, stablecode_completion, conv_one_shot, llama-2, stablelm-3b,
                        guanaco, LM, rwkv_world, gpt_bigcode, codellama_instruct, phi-2, phi-3, wizardlm_7b, stablelm-2,
                        mistral_default, redpajama_chat, oasst, stablelm, llama_default, moss, gemma_instruction,
                        neural_hermes_mistral, rwkv, stablecode_instruct, codellama_completion, wizard_coder_or_math,
                        chatml, orion, glm, dolly)
  --context-window-size CONTEXT_WINDOW_SIZE
                        Option to provide the maximum sequence length supported by the model. This is usually explicitly
                        shown as context length or context window in the model card. If this option is not set explicitly,
                        by default, it will be determined by `context_window_size` or `max_position_embeddings` in
                        `config.json`, and the latter is usually inaccurate for some models. (default: "None")
  --sliding-window-size SLIDING_WINDOW_SIZE
                        (Experimental) The sliding window size in sliding window attention (SWA). This optional field
                        overrides the `sliding_window_size` in config.json for those models that use SWA. Currently only
                        useful when compiling Mistral. This flag subjects to future refactoring. (default: "None")
  --prefill-chunk-size PREFILL_CHUNK_SIZE
                        (Experimental) The chunk size during prefilling. By default, the chunk size is the same as sliding
                        window or max sequence length. This flag subjects to future refactoring. (default: "None")
  --attention-sink-size ATTENTION_SINK_SIZE
                        (Experimental) The number of stored sinks. Only supported on Mistral yet. By default, the number of
                        sinks is 4. This flag subjects to future refactoring. (default: "None")
  --tensor-parallel-shards TENSOR_PARALLEL_SHARDS
                        Number of shards to split the model into in tensor parallelism multi-gpu inference. (default:
                        "None")
  --max-batch-size MAX_BATCH_SIZE
                        The maximum allowed batch size set for the KV cache to concurrently support. (default: "80")
  --output OUTPUT, -o OUTPUT
                        The output directory for generated configurations, including `mlc-chat-config.json` and tokenizer
                        configuration. (required)
________________________________________________________________________________________________________________________________________
$ mlc_llm  bench --help
usage: MLC LLM Chat CLI [-h] [--prompt PROMPT] [--opt OPT] [--device DEVICE] [--overrides OVERRIDES]
                        [--generate-length GENERATE_LENGTH] [--model-lib MODEL_LIB]
                        model

positional arguments:
  model                 A path to ``mlc-chat-config.json``, or an MLC model directory that contains `mlc-chat-config.json`.
                        It can also be a link to a HF repository pointing to an MLC compiled model. (required)

options:
  -h, --help            show this help message and exit
  --prompt PROMPT       The prompt of the text generation. (default: "What is the meaning of life?")
  --opt OPT             Optimization flags. MLC LLM maintains a predefined set of optimization flags, denoted as O0, O1, O2,
                        O3, where O0 means no optimization, O2 means majority of them, and O3 represents extreme
                        optimization that could potentially break the system. Meanwhile, optimization flags could be
                        explicitly specified via details knobs, e.g. --opt="cublas_gemm=1;cudagraph=0". (default: "O2")
  --device DEVICE       The device used to deploy the model such as "cuda" or "cuda:0". Will detect from local available
                        GPUs if not specified. (default: "auto")
  --overrides OVERRIDES
                        Chat configuration override. Configurations to override ChatConfig. Supports `conv_template`,
                        `context_window_size`, `prefill_chunk_size`, `sliding_window_size`, `attention_sink_size`,
                        `max_batch_size` and `tensor_parallel_shards`. Meanwhile, model chat could be explicitly specified
                        via details knobs, e.g. --overrides "context_window_size=1024;prefill_chunk_size=128". (default: "")
  --generate-length GENERATE_LENGTH
                        The target length of the text generation. (default: "256")
  --model-lib MODEL_LIB
                        The full path to the model library file to use (e.g. a ``.so`` file). If unspecified, we will use
                        the provided ``model`` to search over possible paths. It the model lib is not found, it will be
                        compiled in a JIT manner. (default: "None")

__________________________________________________________________________________________________________________________________________

$ mlc_llm  package --help
usage: MLC LLM Package CLI [-h] [--package-config PACKAGE_CONFIG] [--mlc-llm-home MLC_LLM_HOME] [--output OUTPUT]

options:
  -h, --help            show this help message and exit
  --package-config PACKAGE_CONFIG
                        The path to "mlc-package-config.json" which is used for package build. See "https://github.com/mlc-
                        ai/mlc-llm/blob/main/ios/MLCChat/mlc-package-config.json" as an example. (default: "mlc-package-
                        config.json")
  --mlc-llm-home MLC_LLM_HOME
                        The source code path to MLC LLM. (default: the $MLC_LLM_HOME environment variable)
  --output OUTPUT, -o OUTPUT
                        The path of output directory for the package build outputs. (default: "dist")