ibm-ai-platform
/

llama3-70b-accelerator

Inference Endpoints

Model card Files Files and versions Community

sahilsuneja commited on Jul 26, 2024

Commit

5a5b6fb

·

verified ·

1 Parent(s): 7bb719b

Update README.md

Files changed (1) hide show

README.md +28 -0

README.md CHANGED Viewed

@@ -124,3 +124,31 @@ curl 127.0.0.1:8080/generate_stream \
     -d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":20}}' \
     -H 'Content-Type: application/json'
 ```

     -d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":20}}' \
     -H 'Content-Type: application/json'
 ```
+### Use in vLLM
+```from vllm import LLM, SamplingParams
+# Sample prompts.
+prompts = [
+    "The president of the United States is",
+]
+# Create a sampling params object.
+sampling_params = SamplingParams(temperature=0.0)
+# Create an LLM.
+llm = LLM(
+    model="/path/to/Meta-Llama-3-70B-Instruct",
+    tensor_parallel_size=4,
+    speculative_model="/path/to/llama3-70b-accelerator",
+    speculative_draft_tensor_parallel_size=1,
+    use_v2_block_manager=True,
+)
+# Generate texts from the prompts. The output is a list of RequestOutput objects
+# that contain the prompt, generated text, and other information.
+outputs = llm.generate(prompts, sampling_params)
+# Print the outputs.
+for output in outputs:
+    prompt = output.prompt
+    generated_text = output.outputs[0].text
+    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
+```