RonanMcGovern commited on
Commit
0b6cac6
·
1 Parent(s): 5344658

update notes on inference

Browse files
Files changed (1) hide show
  1. README.md +35 -1
README.md CHANGED
@@ -10,14 +10,48 @@ tags:
10
  - llama
11
  - llama-2
12
  - hosted inference
 
 
 
13
  ---
14
  # Llama 2 - hosted inference
15
 
16
  This is simply an 8-bit version of the Llama-2-7B model.
17
  - 8-bits allows the model to be below 10 GB
18
  - This allows for hosted inference of the model on the model's home page
 
19
 
20
- ~
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
21
 
22
  Below follows information on the original Llama 2 model...
23
 
 
10
  - llama
11
  - llama-2
12
  - hosted inference
13
+ - 8 bit
14
+ - 8bit
15
+ - 8-bit
16
  ---
17
  # Llama 2 - hosted inference
18
 
19
  This is simply an 8-bit version of the Llama-2-7B model.
20
  - 8-bits allows the model to be below 10 GB
21
  - This allows for hosted inference of the model on the model's home page
22
+ - Note that inference may be slow unless you have a HuggingFace Pro plan.
23
 
24
+ If you want to run inference yourself (e.g. in a Colab notebook) you can try:
25
+ ```
26
+ !pip install -q -U git+https://github.com/huggingface/accelerate.git
27
+ !pip install -q -U bitsandbytes
28
+ !pip install -q -U git+https://github.com/huggingface/transformers.git
29
+
30
+ model_id = 'Trelis/Llama-2-7b-chat-hf-hosted-inference-8bit'
31
+
32
+ import transformers
33
+ from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig, pipeline, TextStreamer
34
+
35
+ model = AutoModelForCausalLM.from_pretrained(model_id, device_map='auto')
36
+
37
+ #Llama 2 Inference
38
+ def stream(user_prompt):
39
+ system_prompt = 'You are a helpful assistant that provides accurate and concise responses'
40
+
41
+ B_INST, E_INST = "[INST]", "[/INST]"
42
+ B_SYS, E_SYS = "<<SYS>>\n", "\n<</SYS>>\n\n"
43
+
44
+ prompt = f"{B_INST} {B_SYS}{system_prompt.strip()}{E_SYS}{user_prompt.strip()} {E_INST}\n\n"
45
+
46
+ inputs = tokenizer([prompt], return_tensors="pt").to("cuda:0")
47
+
48
+ streamer = TextStreamer(tokenizer)
49
+
50
+ # Despite returning the usual output, the streamer will also print the generated text to stdout.
51
+ _ = model.generate(**inputs, streamer=streamer, max_new_tokens=500)
52
+
53
+ stream('Count to ten')
54
+ ```
55
 
56
  Below follows information on the original Llama 2 model...
57