models for inf2.

#33

by AC2132 - opened Mar 24

Mar 24

Is it possible for me to run 7B models on an inf2 device? I got the cached version of zephyr 7b beta working, but that had a sequence length of only 256, for the other models that would be useful for me, the aws repo either does not have the pytorch model.bin files or it gives me an error of some neff files missing. Has anyone been able to run a 7B model on inf2, if so, please help!

dacorvo

AWS Inferentia and Trainium org Mar 25

Several 7b models are available in the cache, and a snippet to deploy each of them on SageMaker is available in the model card (Deploy > Amazon SageMaker > AWS Inferentia & Trainium).

Here is for instance the sniipet to deploy zephyr-7b-beta:

import json
import sagemaker
import boto3
from sagemaker.huggingface import HuggingFaceModel, get_huggingface_llm_image_uri

try:
    role = sagemaker.get_execution_role()
except ValueError:
    iam = boto3.client("iam")
    role = iam.get_role(RoleName="sagemaker_execution_role")["Role"]["Arn"]

# Hub Model configuration. https://huggingface.co./models
hub = {
    "HF_MODEL_ID": "HuggingFaceH4/zephyr-7b-beta",
    "HF_NUM_CORES": "8",
    "HF_BATCH_SIZE": "1",
    "HF_SEQUENCE_LENGTH": "4096",
    "HF_AUTO_CAST_TYPE": "bf16",  
    "MAX_BATCH_SIZE": "1",
    "MAX_INPUT_LENGTH": "3686",
    "MAX_TOTAL_TOKENS": "4096",
}


# create Hugging Face Model Class
huggingface_model = HuggingFaceModel(
    image_uri=get_huggingface_llm_image_uri("huggingface-neuronx", version="0.0.20"),
    env=hub,
    role=role,
)

# deploy model to SageMaker Inference
predictor = huggingface_model.deploy(
    initial_instance_count=1,
    instance_type="ml.inf2.24xlarge",
    container_startup_health_check_timeout=1800,
    volume_size=512,
)

# send request
predictor.predict(
    {
        "inputs": "What is is the capital of France?",
        "parameters": {
            "do_sample": True,
            "max_new_tokens": 128,
            "temperature": 0.7,
            "top_k": 50,
            "top_p": 0.95,
        }
    }
)

Alternatively, you can also export them locally in your ec2 environment following the instructions here and using the same configuration parameters as the cached version:

juliensimon

Mar 25

Works for me, thank you. Don't forget to upgrade the Sagemaker SDK with 'pip install sagemaker --upgrade'. For the record, I used 2.214.0.

Gerald001

Apr 18

•

edited Apr 18

where do you see the available models for aws inf2? im looking for llama2 7b chat for "ml.inf2.xlarge"

dacorvo

AWS Inferentia and Trainium org Apr 19

You can see the list of cached models here:

https://huggingface.co./aws-neuron/optimum-neuron-cache/tree/main/inference-cache-config

Alternatively, you can use the optimum-cli neuron cache lookup command to look for a specific model and see the cached configurations.

Since you want to deploy on an ml.inf2.xlarge, you need to select a configuration with 2 cores.

The following configuration is available:

batch_size: 1
sequence_length: 4096
num_cores: 2
auto_cast_type: fp16

You can adapt the snippet from the model card (Deploy/Amazon Sagemaker/AWS Inferentia & Trainium).

Gerald001

Apr 19

•

edited Apr 19

Hi @dacorvo - can i use that model cache with djl serving too?
see: https://docs.djl.ai/docs/demos/aws/sagemaker/large-model-inference/sample-llm/tnx_rollingbatch_deploy_llama_7b_int8.html

if not what steps i need to do in that case?

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment