ArunKr
/

MLC-LLM

Model card Files Files and versions Community

Arun Kumar Tiwary commited on May 16

Commit

cca0560

•

1 Parent(s): 55e0718

Update README.md

Browse files

Files changed (1) hide show

README.md +75 -0

README.md CHANGED Viewed

@@ -5,3 +5,78 @@ sudo apt update
 sudo apt install ocl-icd-opencl-dev clinfo vulkan-tools
 python -m pip install --pre -U -f https://mlc.ai/wheels mlc-llm-nightly mlc-ai-nightly
 mlc_llm chat HF://mlc-ai/Llama-3-8B-Instruct-q4f16_1-MLC

 sudo apt install ocl-icd-opencl-dev clinfo vulkan-tools
 python -m pip install --pre -U -f https://mlc.ai/wheels mlc-llm-nightly mlc-ai-nightly
 mlc_llm chat HF://mlc-ai/Llama-3-8B-Instruct-q4f16_1-MLC
+$ mlc_llm chat --help
+usage: MLC LLM Chat CLI [-h] [--opt OPT] [--device DEVICE] [--overrides OVERRIDES] [--model-lib MODEL_LIB] model
+positional arguments:
+  model                 A path to ``mlc-chat-config.json``, or an MLC model directory that contains `mlc-chat-config.json`.
+                        It can also be a link to a HF repository pointing to an MLC compiled model. (required)
+options:
+  -h, --help            show this help message and exit
+  --opt OPT             Optimization flags. MLC LLM maintains a predefined set of optimization flags, denoted as O0, O1, O2,
+                        O3, where O0 means no optimization, O2 means majority of them, and O3 represents extreme
+                        optimization that could potentially break the system. Meanwhile, optimization flags could be
+                        explicitly specified via details knobs, e.g. --opt="cublas_gemm=1;cudagraph=0". (default: "O2")
+  --device DEVICE       The device used to deploy the model such as "cuda" or "cuda:0". Will detect from local available
+                        GPUs if not specified. (default: "auto")
+  --overrides OVERRIDES
+                        Chat configuration override. Configurations to override ChatConfig. Supports `conv_template`,
+                        `context_window_size`, `prefill_chunk_size`, `sliding_window_size`, `attention_sink_size`,
+                        `max_batch_size` and `tensor_parallel_shards`. Meanwhile, model chat could be explicitly specified
+                        via details knobs, e.g. --overrides "context_window_size=1024;prefill_chunk_size=128". (default: "")
+  --model-lib MODEL_LIB
+                        The full path to the model library file to use (e.g. a ``.so`` file). If unspecified, we will use
+                        the provided ``model`` to search over possible paths. It the model lib is not found, it will be
+                        compiled in a JIT manner. (default: "None")
+(env) amd@volcano-9b20-os:~/workspace/Arun/data_dir/llamaCpp/mlc_LLM$ mlc_llm compile --help
+usage: mlc_llm compile [-h]
+                       [--quantization {q0f16,q0f32,q3f16_0,q3f16_1,q4f16_0,q4f16_1,q4f32_1,q4f16_2,q4f16_autoawq,q4f16_ft,e5m2_e5m2_f16,e4m3_e4m3_f16}]
+                       [--model-type {auto,llama,mistral,gemma,gpt2,mixtral,gpt_neox,gpt_bigcode,phi-msft,phi,phi3,qwen,qwen2,stablelm,baichuan,internlm,rwkv5,orion,llava,rwkv6,chatglm,eagle,bert,medusa}]
+                       [--device DEVICE] [--host HOST] [--opt OPT] [--system-lib-prefix SYSTEM_LIB_PREFIX] --output OUTPUT
+                       [--overrides OVERRIDES] [--debug-dump DEBUG_DUMP]
+                       model
+positional arguments:
+  model                 A path to ``mlc-chat-config.json``, or an MLC model directory that contains `mlc-chat-config.json`.
+                        It can also be a link to a HF repository pointing to an MLC compiled model. (required)
+options:
+  -h, --help            show this help message and exit
+  --quantization {q0f16,q0f32,q3f16_0,q3f16_1,q4f16_0,q4f16_1,q4f32_1,q4f16_2,q4f16_autoawq,q4f16_ft,e5m2_e5m2_f16,e4m3_e4m3_f16}
+                        The quantization mode we use to compile. If unprovided, will infer from `model`. (default: look up
+                        mlc-chat-config.json, choices: q0f16, q0f32, q3f16_0, q3f16_1, q4f16_0, q4f16_1, q4f32_1, q4f16_2,
+                        q4f16_autoawq, q4f16_ft, e5m2_e5m2_f16, e4m3_e4m3_f16)
+  --model-type {auto,llama,mistral,gemma,gpt2,mixtral,gpt_neox,gpt_bigcode,phi-msft,phi,phi3,qwen,qwen2,stablelm,baichuan,internlm,rwkv5,orion,llava,rwkv6,chatglm,eagle,bert,medusa}
+                        Model architecture such as "llama". If not set, it is inferred from `mlc-chat-config.json`.
+                        (default: "auto")
+  --device DEVICE       The GPU device to compile the model to. If not set, it is inferred from GPUs available locally.
+                        (default: "auto")
+  --host HOST           The host LLVM triple to compile the model to. If not set, it is inferred from the local CPU and OS.
+                        Examples of the LLVM triple: 1) iPhones: arm64-apple-ios; 2) ARM64 Android phones: aarch64-linux-
+                        android; 3) WebAssembly: wasm32-unknown-unknown-wasm; 4) Windows: x86_64-pc-windows-msvc; 5) ARM
+                        macOS: arm64-apple-darwin. (default: "auto")
+  --opt OPT             Optimization flags. MLC LLM maintains a predefined set of optimization flags, denoted as O0, O1, O2,
+                        O3, where O0 means no optimization, O2 means majority of them, and O3 represents extreme
+                        optimization that could potentially break the system. Meanwhile, optimization flags could be
+                        explicitly specified via details knobs, e.g. --opt="cublas_gemm=1;cudagraph=0". (default: "O2")
+  --system-lib-prefix SYSTEM_LIB_PREFIX
+                        Adding a prefix to all symbols exported. Similar to "objcopy --prefix-symbols". This is useful when
+                        compiling multiple models into a single library to avoid symbol conflicts. Different from objcopy,
+                        this takes no effect for shared library. (default: "auto")
+  --output OUTPUT, -o OUTPUT
+                        The path to the output file. The suffix determines if the output file is a shared library or
+                        objects. Available suffixes: 1) Linux: .so (shared), .tar (objects); 2) macOS: .dylib (shared), .tar
+                        (objects); 3) Windows: .dll (shared), .tar (objects); 4) Android, iOS: .tar (objects); 5) Web: .wasm
+                        (web assembly). (required)
+  --overrides OVERRIDES
+                        Model configuration override. Configurations to override `mlc-chat-config.json`. Supports
+                        `context_window_size`, `prefill_chunk_size`, `sliding_window_size`, `attention_sink_size`,
+                        `max_batch_size` and `tensor_parallel_shards`. Meanwhile, model config could be explicitly specified
+                        via details knobs, e.g. --overrides "context_window_size=1024;prefill_chunk_size=128". (default: "")
+  --debug-dump DEBUG_DUMP
+                        Specifies the directory where the compiler will store its IRs for debugging purposes during various
+                        phases of compilation. By default, this is set to `None`, indicating that debug dumping is disabled.
+                        (default: None)