File size: 1,651 Bytes
3f466c9
 
 
 
 
 
 
 
 
 
 
 
 
76c79b0
3f466c9
2397160
3f466c9
2397160
3f466c9
76c79b0
 
2397160
 
76c79b0
 
 
 
 
 
 
2397160
 
 
 
 
 
 
3f466c9
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
---
language:
- zh
- en
tags:
- chatglm
- glm
- onnx
- onnxruntime
---

# ChatGLM-6B + ONNX

This model is exported from [ChatGLM-6b](https://huggingface.co./THUDM/chatglm-6b) with int8 quantization and optimized for [ONNXRuntime](https://onnxruntime.ai/) inference. Export code in [this repo](https://github.com/K024/chatglm-q).

Inference code with ONNXRuntime is uploaded with the model. Install requirements and run `streamlit run web-ui.py` to start chatting. Currently the `MatMulInteger` (for u8s8 data type) and `DynamicQuantizeLinear` operators are only supported on CPU. Arm64 with Neon support (Apple M1/M2) should be reasonably fast. 

安装依赖并运行 `streamlit run web-ui.py` 预览模型效果。由于 ONNXRuntime 算子支持问题,目前仅能够使用 CPU 进行推理,在 Arm64 (Apple M1/M2) 上有可观的速度。具体的 ONNX 导出代码在[这个仓库](https://github.com/K024/chatglm-q)中。

## Usage

Clone with [git-lfs](https://git-lfs.com/):

```sh
git lfs clone https://huggingface.co./K024/ChatGLM-6b-onnx-u8s8
cd ChatGLM-6b-onnx-u8s8
pip install -r requirements.txt
streamlit run web-ui.py
```

Or use `huggingface_hub` [python client lib](https://huggingface.co./docs/huggingface_hub/guides/download#download-files-to-local-folder) to download the repo snapshot:

```python
from huggingface_hub import snapshot_download
snapshot_download(repo_id="K024/ChatGLM-6b-onnx-u8s8", local_dir="./ChatGLM-6b-onnx-u8s8")
```

Codes are released under MIT license.

Model weights are released under the same license as ChatGLM-6b, see [MODEL LICENSE](https://huggingface.co./THUDM/chatglm-6b/blob/main/MODEL_LICENSE).