adding in disclaimer and cleaning up gpu start up
Browse files
README.md
CHANGED
@@ -82,14 +82,9 @@ model = AutoModelForCausalLM.from_pretrained("sambanovasystems/BLOOMChat-176B-v1
|
|
82 |
|
83 |
### Tutorial on using the model for text generation
|
84 |
|
85 |
-
[
|
86 |
|
87 |
-
|
88 |
-
|
89 |
-
Running command:
|
90 |
-
```
|
91 |
-
python -m inference_server.cli --model_name sambanovasystems/BLOOMChat-176B-v1 --model_class AutoModelForCausalLM --dtype int8 --deployment_framework hf_accelerate --generate_kwargs '{"do_sample": false, "temperature": 0.8, "repetition_penalty": 1.2, "top_p": 0.9, "max_new_tokens": 512}'
|
92 |
-
```
|
93 |
|
94 |
NOTE: Things that we had to modify in order for BLOOMChat to work:
|
95 |
- Install transformers version 4.27.0
|
@@ -113,7 +108,15 @@ class HFAccelerateModel(Model):
|
|
113 |
kwargs["max_memory"] = reduce_max_memory_dict
|
114 |
```
|
115 |
|
116 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
117 |
|
118 |
### Suggested Inference Parameters
|
119 |
- Temperature: 0.8
|
|
|
82 |
|
83 |
### Tutorial on using the model for text generation
|
84 |
|
85 |
+
[This tutorial](https://github.com/huggingface/transformers-bloom-inference) from Huggingface will be the base layer for running our model. The tutorial is intended for BLOOM; however, since our model is based off of BLOOM we can repurpose it.
|
86 |
|
87 |
+
For setup instructions follow the Huggingface tutorial.
|
|
|
|
|
|
|
|
|
|
|
88 |
|
89 |
NOTE: Things that we had to modify in order for BLOOMChat to work:
|
90 |
- Install transformers version 4.27.0
|
|
|
108 |
kwargs["max_memory"] = reduce_max_memory_dict
|
109 |
```
|
110 |
|
111 |
+
Running command for int8 (sub optimal performance, but fast inference time):
|
112 |
+
```
|
113 |
+
python -m inference_server.cli --model_name sambanovasystems/BLOOMChat-176B-v1 --model_class AutoModelForCausalLM --dtype int8 --deployment_framework hf_accelerate --generate_kwargs '{"do_sample": false, "temperature": 0.8, "repetition_penalty": 1.2, "top_p": 0.9, "max_new_tokens": 512}'
|
114 |
+
```
|
115 |
+
Running command for bf16
|
116 |
+
```
|
117 |
+
python -m inference_server.cli --model_name sambanovasystems/BLOOMChat-176B-v1 --model_class AutoModelForCausalLM --dtype bf16 --deployment_framework hf_accelerate --generate_kwargs '{"do_sample": false, "temperature": 0.8, "repetition_penalty": 1.2, "top_p": 0.9, "max_new_tokens": 512}'
|
118 |
+
```
|
119 |
+
**DISCLAIMER:** When using int8, the results will be subpar compared to bf16 as the model is being [quantized](https://huggingface.co/blog/hf-bitsandbytes-integration#introduction-to-model-quantization).
|
120 |
|
121 |
### Suggested Inference Parameters
|
122 |
- Temperature: 0.8
|