Arun Kumar Tiwary
commited on
Commit
•
cca0560
1
Parent(s):
55e0718
Update README.md
Browse files
README.md
CHANGED
@@ -5,3 +5,78 @@ sudo apt update
|
|
5 |
sudo apt install ocl-icd-opencl-dev clinfo vulkan-tools
|
6 |
python -m pip install --pre -U -f https://mlc.ai/wheels mlc-llm-nightly mlc-ai-nightly
|
7 |
mlc_llm chat HF://mlc-ai/Llama-3-8B-Instruct-q4f16_1-MLC
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
5 |
sudo apt install ocl-icd-opencl-dev clinfo vulkan-tools
|
6 |
python -m pip install --pre -U -f https://mlc.ai/wheels mlc-llm-nightly mlc-ai-nightly
|
7 |
mlc_llm chat HF://mlc-ai/Llama-3-8B-Instruct-q4f16_1-MLC
|
8 |
+
|
9 |
+
$ mlc_llm chat --help
|
10 |
+
usage: MLC LLM Chat CLI [-h] [--opt OPT] [--device DEVICE] [--overrides OVERRIDES] [--model-lib MODEL_LIB] model
|
11 |
+
|
12 |
+
positional arguments:
|
13 |
+
model A path to ``mlc-chat-config.json``, or an MLC model directory that contains `mlc-chat-config.json`.
|
14 |
+
It can also be a link to a HF repository pointing to an MLC compiled model. (required)
|
15 |
+
|
16 |
+
options:
|
17 |
+
-h, --help show this help message and exit
|
18 |
+
--opt OPT Optimization flags. MLC LLM maintains a predefined set of optimization flags, denoted as O0, O1, O2,
|
19 |
+
O3, where O0 means no optimization, O2 means majority of them, and O3 represents extreme
|
20 |
+
optimization that could potentially break the system. Meanwhile, optimization flags could be
|
21 |
+
explicitly specified via details knobs, e.g. --opt="cublas_gemm=1;cudagraph=0". (default: "O2")
|
22 |
+
--device DEVICE The device used to deploy the model such as "cuda" or "cuda:0". Will detect from local available
|
23 |
+
GPUs if not specified. (default: "auto")
|
24 |
+
--overrides OVERRIDES
|
25 |
+
Chat configuration override. Configurations to override ChatConfig. Supports `conv_template`,
|
26 |
+
`context_window_size`, `prefill_chunk_size`, `sliding_window_size`, `attention_sink_size`,
|
27 |
+
`max_batch_size` and `tensor_parallel_shards`. Meanwhile, model chat could be explicitly specified
|
28 |
+
via details knobs, e.g. --overrides "context_window_size=1024;prefill_chunk_size=128". (default: "")
|
29 |
+
--model-lib MODEL_LIB
|
30 |
+
The full path to the model library file to use (e.g. a ``.so`` file). If unspecified, we will use
|
31 |
+
the provided ``model`` to search over possible paths. It the model lib is not found, it will be
|
32 |
+
compiled in a JIT manner. (default: "None")
|
33 |
+
(env) amd@volcano-9b20-os:~/workspace/Arun/data_dir/llamaCpp/mlc_LLM$ mlc_llm compile --help
|
34 |
+
usage: mlc_llm compile [-h]
|
35 |
+
[--quantization {q0f16,q0f32,q3f16_0,q3f16_1,q4f16_0,q4f16_1,q4f32_1,q4f16_2,q4f16_autoawq,q4f16_ft,e5m2_e5m2_f16,e4m3_e4m3_f16}]
|
36 |
+
[--model-type {auto,llama,mistral,gemma,gpt2,mixtral,gpt_neox,gpt_bigcode,phi-msft,phi,phi3,qwen,qwen2,stablelm,baichuan,internlm,rwkv5,orion,llava,rwkv6,chatglm,eagle,bert,medusa}]
|
37 |
+
[--device DEVICE] [--host HOST] [--opt OPT] [--system-lib-prefix SYSTEM_LIB_PREFIX] --output OUTPUT
|
38 |
+
[--overrides OVERRIDES] [--debug-dump DEBUG_DUMP]
|
39 |
+
model
|
40 |
+
|
41 |
+
positional arguments:
|
42 |
+
model A path to ``mlc-chat-config.json``, or an MLC model directory that contains `mlc-chat-config.json`.
|
43 |
+
It can also be a link to a HF repository pointing to an MLC compiled model. (required)
|
44 |
+
|
45 |
+
options:
|
46 |
+
-h, --help show this help message and exit
|
47 |
+
--quantization {q0f16,q0f32,q3f16_0,q3f16_1,q4f16_0,q4f16_1,q4f32_1,q4f16_2,q4f16_autoawq,q4f16_ft,e5m2_e5m2_f16,e4m3_e4m3_f16}
|
48 |
+
The quantization mode we use to compile. If unprovided, will infer from `model`. (default: look up
|
49 |
+
mlc-chat-config.json, choices: q0f16, q0f32, q3f16_0, q3f16_1, q4f16_0, q4f16_1, q4f32_1, q4f16_2,
|
50 |
+
q4f16_autoawq, q4f16_ft, e5m2_e5m2_f16, e4m3_e4m3_f16)
|
51 |
+
--model-type {auto,llama,mistral,gemma,gpt2,mixtral,gpt_neox,gpt_bigcode,phi-msft,phi,phi3,qwen,qwen2,stablelm,baichuan,internlm,rwkv5,orion,llava,rwkv6,chatglm,eagle,bert,medusa}
|
52 |
+
Model architecture such as "llama". If not set, it is inferred from `mlc-chat-config.json`.
|
53 |
+
(default: "auto")
|
54 |
+
--device DEVICE The GPU device to compile the model to. If not set, it is inferred from GPUs available locally.
|
55 |
+
(default: "auto")
|
56 |
+
--host HOST The host LLVM triple to compile the model to. If not set, it is inferred from the local CPU and OS.
|
57 |
+
Examples of the LLVM triple: 1) iPhones: arm64-apple-ios; 2) ARM64 Android phones: aarch64-linux-
|
58 |
+
android; 3) WebAssembly: wasm32-unknown-unknown-wasm; 4) Windows: x86_64-pc-windows-msvc; 5) ARM
|
59 |
+
macOS: arm64-apple-darwin. (default: "auto")
|
60 |
+
--opt OPT Optimization flags. MLC LLM maintains a predefined set of optimization flags, denoted as O0, O1, O2,
|
61 |
+
O3, where O0 means no optimization, O2 means majority of them, and O3 represents extreme
|
62 |
+
optimization that could potentially break the system. Meanwhile, optimization flags could be
|
63 |
+
explicitly specified via details knobs, e.g. --opt="cublas_gemm=1;cudagraph=0". (default: "O2")
|
64 |
+
--system-lib-prefix SYSTEM_LIB_PREFIX
|
65 |
+
Adding a prefix to all symbols exported. Similar to "objcopy --prefix-symbols". This is useful when
|
66 |
+
compiling multiple models into a single library to avoid symbol conflicts. Different from objcopy,
|
67 |
+
this takes no effect for shared library. (default: "auto")
|
68 |
+
--output OUTPUT, -o OUTPUT
|
69 |
+
The path to the output file. The suffix determines if the output file is a shared library or
|
70 |
+
objects. Available suffixes: 1) Linux: .so (shared), .tar (objects); 2) macOS: .dylib (shared), .tar
|
71 |
+
(objects); 3) Windows: .dll (shared), .tar (objects); 4) Android, iOS: .tar (objects); 5) Web: .wasm
|
72 |
+
(web assembly). (required)
|
73 |
+
--overrides OVERRIDES
|
74 |
+
Model configuration override. Configurations to override `mlc-chat-config.json`. Supports
|
75 |
+
`context_window_size`, `prefill_chunk_size`, `sliding_window_size`, `attention_sink_size`,
|
76 |
+
`max_batch_size` and `tensor_parallel_shards`. Meanwhile, model config could be explicitly specified
|
77 |
+
via details knobs, e.g. --overrides "context_window_size=1024;prefill_chunk_size=128". (default: "")
|
78 |
+
--debug-dump DEBUG_DUMP
|
79 |
+
Specifies the directory where the compiler will store its IRs for debugging purposes during various
|
80 |
+
phases of compilation. By default, this is set to `None`, indicating that debug dumping is disabled.
|
81 |
+
(default: None)
|
82 |
+
|