jayr014 commited on
Commit
3107f34
·
1 Parent(s): 444e998

adding in disclaimer and cleaning up gpu start up

Browse files
Files changed (1) hide show
  1. README.md +11 -8
README.md CHANGED
@@ -82,14 +82,9 @@ model = AutoModelForCausalLM.from_pretrained("sambanovasystems/BLOOMChat-176B-v1
82
 
83
  ### Tutorial on using the model for text generation
84
 
85
- [Transformers BLOOM Inference](https://github.com/huggingface/transformers-bloom-inference)
86
 
87
- Specifically we tested BLOOM inference via command-line in this repository.
88
-
89
- Running command:
90
- ```
91
- python -m inference_server.cli --model_name sambanovasystems/BLOOMChat-176B-v1 --model_class AutoModelForCausalLM --dtype int8 --deployment_framework hf_accelerate --generate_kwargs '{"do_sample": false, "temperature": 0.8, "repetition_penalty": 1.2, "top_p": 0.9, "max_new_tokens": 512}'
92
- ```
93
 
94
  NOTE: Things that we had to modify in order for BLOOMChat to work:
95
  - Install transformers version 4.27.0
@@ -113,7 +108,15 @@ class HFAccelerateModel(Model):
113
  kwargs["max_memory"] = reduce_max_memory_dict
114
  ```
115
 
116
-
 
 
 
 
 
 
 
 
117
 
118
  ### Suggested Inference Parameters
119
  - Temperature: 0.8
 
82
 
83
  ### Tutorial on using the model for text generation
84
 
85
+ [This tutorial](https://github.com/huggingface/transformers-bloom-inference) from Huggingface will be the base layer for running our model. The tutorial is intended for BLOOM; however, since our model is based off of BLOOM we can repurpose it.
86
 
87
+ For setup instructions follow the Huggingface tutorial.
 
 
 
 
 
88
 
89
  NOTE: Things that we had to modify in order for BLOOMChat to work:
90
  - Install transformers version 4.27.0
 
108
  kwargs["max_memory"] = reduce_max_memory_dict
109
  ```
110
 
111
+ Running command for int8 (sub optimal performance, but fast inference time):
112
+ ```
113
+ python -m inference_server.cli --model_name sambanovasystems/BLOOMChat-176B-v1 --model_class AutoModelForCausalLM --dtype int8 --deployment_framework hf_accelerate --generate_kwargs '{"do_sample": false, "temperature": 0.8, "repetition_penalty": 1.2, "top_p": 0.9, "max_new_tokens": 512}'
114
+ ```
115
+ Running command for bf16
116
+ ```
117
+ python -m inference_server.cli --model_name sambanovasystems/BLOOMChat-176B-v1 --model_class AutoModelForCausalLM --dtype bf16 --deployment_framework hf_accelerate --generate_kwargs '{"do_sample": false, "temperature": 0.8, "repetition_penalty": 1.2, "top_p": 0.9, "max_new_tokens": 512}'
118
+ ```
119
+ **DISCLAIMER:** When using int8, the results will be subpar compared to bf16 as the model is being [quantized](https://huggingface.co/blog/hf-bitsandbytes-integration#introduction-to-model-quantization).
120
 
121
  ### Suggested Inference Parameters
122
  - Temperature: 0.8