Arun Kumar Tiwary commited on
Commit
cca0560
1 Parent(s): 55e0718

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +75 -0
README.md CHANGED
@@ -5,3 +5,78 @@ sudo apt update
5
  sudo apt install ocl-icd-opencl-dev clinfo vulkan-tools
6
  python -m pip install --pre -U -f https://mlc.ai/wheels mlc-llm-nightly mlc-ai-nightly
7
  mlc_llm chat HF://mlc-ai/Llama-3-8B-Instruct-q4f16_1-MLC
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5
  sudo apt install ocl-icd-opencl-dev clinfo vulkan-tools
6
  python -m pip install --pre -U -f https://mlc.ai/wheels mlc-llm-nightly mlc-ai-nightly
7
  mlc_llm chat HF://mlc-ai/Llama-3-8B-Instruct-q4f16_1-MLC
8
+
9
+ $ mlc_llm chat --help
10
+ usage: MLC LLM Chat CLI [-h] [--opt OPT] [--device DEVICE] [--overrides OVERRIDES] [--model-lib MODEL_LIB] model
11
+
12
+ positional arguments:
13
+ model A path to ``mlc-chat-config.json``, or an MLC model directory that contains `mlc-chat-config.json`.
14
+ It can also be a link to a HF repository pointing to an MLC compiled model. (required)
15
+
16
+ options:
17
+ -h, --help show this help message and exit
18
+ --opt OPT Optimization flags. MLC LLM maintains a predefined set of optimization flags, denoted as O0, O1, O2,
19
+ O3, where O0 means no optimization, O2 means majority of them, and O3 represents extreme
20
+ optimization that could potentially break the system. Meanwhile, optimization flags could be
21
+ explicitly specified via details knobs, e.g. --opt="cublas_gemm=1;cudagraph=0". (default: "O2")
22
+ --device DEVICE The device used to deploy the model such as "cuda" or "cuda:0". Will detect from local available
23
+ GPUs if not specified. (default: "auto")
24
+ --overrides OVERRIDES
25
+ Chat configuration override. Configurations to override ChatConfig. Supports `conv_template`,
26
+ `context_window_size`, `prefill_chunk_size`, `sliding_window_size`, `attention_sink_size`,
27
+ `max_batch_size` and `tensor_parallel_shards`. Meanwhile, model chat could be explicitly specified
28
+ via details knobs, e.g. --overrides "context_window_size=1024;prefill_chunk_size=128". (default: "")
29
+ --model-lib MODEL_LIB
30
+ The full path to the model library file to use (e.g. a ``.so`` file). If unspecified, we will use
31
+ the provided ``model`` to search over possible paths. It the model lib is not found, it will be
32
+ compiled in a JIT manner. (default: "None")
33
+ (env) amd@volcano-9b20-os:~/workspace/Arun/data_dir/llamaCpp/mlc_LLM$ mlc_llm compile --help
34
+ usage: mlc_llm compile [-h]
35
+ [--quantization {q0f16,q0f32,q3f16_0,q3f16_1,q4f16_0,q4f16_1,q4f32_1,q4f16_2,q4f16_autoawq,q4f16_ft,e5m2_e5m2_f16,e4m3_e4m3_f16}]
36
+ [--model-type {auto,llama,mistral,gemma,gpt2,mixtral,gpt_neox,gpt_bigcode,phi-msft,phi,phi3,qwen,qwen2,stablelm,baichuan,internlm,rwkv5,orion,llava,rwkv6,chatglm,eagle,bert,medusa}]
37
+ [--device DEVICE] [--host HOST] [--opt OPT] [--system-lib-prefix SYSTEM_LIB_PREFIX] --output OUTPUT
38
+ [--overrides OVERRIDES] [--debug-dump DEBUG_DUMP]
39
+ model
40
+
41
+ positional arguments:
42
+ model A path to ``mlc-chat-config.json``, or an MLC model directory that contains `mlc-chat-config.json`.
43
+ It can also be a link to a HF repository pointing to an MLC compiled model. (required)
44
+
45
+ options:
46
+ -h, --help show this help message and exit
47
+ --quantization {q0f16,q0f32,q3f16_0,q3f16_1,q4f16_0,q4f16_1,q4f32_1,q4f16_2,q4f16_autoawq,q4f16_ft,e5m2_e5m2_f16,e4m3_e4m3_f16}
48
+ The quantization mode we use to compile. If unprovided, will infer from `model`. (default: look up
49
+ mlc-chat-config.json, choices: q0f16, q0f32, q3f16_0, q3f16_1, q4f16_0, q4f16_1, q4f32_1, q4f16_2,
50
+ q4f16_autoawq, q4f16_ft, e5m2_e5m2_f16, e4m3_e4m3_f16)
51
+ --model-type {auto,llama,mistral,gemma,gpt2,mixtral,gpt_neox,gpt_bigcode,phi-msft,phi,phi3,qwen,qwen2,stablelm,baichuan,internlm,rwkv5,orion,llava,rwkv6,chatglm,eagle,bert,medusa}
52
+ Model architecture such as "llama". If not set, it is inferred from `mlc-chat-config.json`.
53
+ (default: "auto")
54
+ --device DEVICE The GPU device to compile the model to. If not set, it is inferred from GPUs available locally.
55
+ (default: "auto")
56
+ --host HOST The host LLVM triple to compile the model to. If not set, it is inferred from the local CPU and OS.
57
+ Examples of the LLVM triple: 1) iPhones: arm64-apple-ios; 2) ARM64 Android phones: aarch64-linux-
58
+ android; 3) WebAssembly: wasm32-unknown-unknown-wasm; 4) Windows: x86_64-pc-windows-msvc; 5) ARM
59
+ macOS: arm64-apple-darwin. (default: "auto")
60
+ --opt OPT Optimization flags. MLC LLM maintains a predefined set of optimization flags, denoted as O0, O1, O2,
61
+ O3, where O0 means no optimization, O2 means majority of them, and O3 represents extreme
62
+ optimization that could potentially break the system. Meanwhile, optimization flags could be
63
+ explicitly specified via details knobs, e.g. --opt="cublas_gemm=1;cudagraph=0". (default: "O2")
64
+ --system-lib-prefix SYSTEM_LIB_PREFIX
65
+ Adding a prefix to all symbols exported. Similar to "objcopy --prefix-symbols". This is useful when
66
+ compiling multiple models into a single library to avoid symbol conflicts. Different from objcopy,
67
+ this takes no effect for shared library. (default: "auto")
68
+ --output OUTPUT, -o OUTPUT
69
+ The path to the output file. The suffix determines if the output file is a shared library or
70
+ objects. Available suffixes: 1) Linux: .so (shared), .tar (objects); 2) macOS: .dylib (shared), .tar
71
+ (objects); 3) Windows: .dll (shared), .tar (objects); 4) Android, iOS: .tar (objects); 5) Web: .wasm
72
+ (web assembly). (required)
73
+ --overrides OVERRIDES
74
+ Model configuration override. Configurations to override `mlc-chat-config.json`. Supports
75
+ `context_window_size`, `prefill_chunk_size`, `sliding_window_size`, `attention_sink_size`,
76
+ `max_batch_size` and `tensor_parallel_shards`. Meanwhile, model config could be explicitly specified
77
+ via details knobs, e.g. --overrides "context_window_size=1024;prefill_chunk_size=128". (default: "")
78
+ --debug-dump DEBUG_DUMP
79
+ Specifies the directory where the compiler will store its IRs for debugging purposes during various
80
+ phases of compilation. By default, this is set to `None`, indicating that debug dumping is disabled.
81
+ (default: None)
82
+