Deploy with Sagemaker LMI

#2
by josete89 - opened

I've tried to deploy this with Sagemaker LMI but is not possible. Seems the model should follow this layout:

  • compiled: neff files
  • checkpoint: pytorch weights compiled
  • tokenizer...
    Is it possible to get something like that? Or least some code snippet how to deploy this as an endpoint? I've tried but still no luck
AWS Inferentia and Trainium org

Hey @josete89 . What you are describing is the layout for the Optimum library. This example was originally built with Transformers because Optimum didn't have the support for Mistral, but I saw a PR went through last week with it. We should be able to update it to work. Reach out to me.

AWS Inferentia and Trainium org

optimum-neuron >= 0.0.17 compatible models have been added for several configurations.

I was able to compile the model seamlessly, but when I tried to deploy it:

from sagemaker.huggingface.model import HuggingFaceModel

# create Hugging Face Model Class
model = HuggingFaceModel(
   model_data=s3_model_uri,        # path to your model.tar.gz on s3
   role=role,                      # iam role with permissions to create an Endpoint
   transformers_version="4.34.1",  # transformers version used
   pytorch_version="1.13.1",       # pytorch version used
   py_version='py310',             # python version used
   model_server_workers=1,         # number of workers for the model server
)

I got the following message when I sent a request:
"Pretrained model is compiled with neuronx-cc(2.12.54.0+f631c2365) newer than current compiler (2.11.0.34+c5231f848), which may cause runtime".

I guess the base image needs to be updated.

AWS Inferentia and Trainium org

@josete89 Yes, Mistral requires the new 2.16 SDK. Not all the images are updated yet.

As of today, you would need to use 763104351884.dkr.ecr.us-east-1.amazonaws.com/djl-inference:0.26.0-neuronx-sdk2.16.0

That may require you to repackage your model depending on what image you were using previously. Watch for updates at https://github.com/aws/deep-learning-containers/blob/master/available_images.md#large-model-inference-containers

What image are you using to deploy now? You may be able to update that and deploy it as a custom image.

Right now I'm using: "763104351884.dkr.ecr.eu-west-1.amazonaws.com/huggingface-pytorch-inference-neuronx:1.13.1-transformers4.34.1-neuronx-py310-sdk2.15.0-ubuntu20.04" I guess that's the problem :) How can I deploy it as custom image then?

AWS Inferentia and Trainium org

You can specify a custom image by using
image_uri="763104351884.dkr.ecr.us-west-2.amazonaws.com/huggingface-pytorch-tgi-inference:1.13.1-optimum0.0.16-neuronx-py310-ubuntu22.04-v1.0",
instead of the version settings--those are just used to find the right image automatically for you.
You can try the above--it is a Hugging Face Text Generation Image update, but I am not sure of the SDK version.

You can also create a sagemaker compatible image and upload it to your private ECR repository.

git clone https://github.com/huggingface/optimum-neuron
cd optimum-neuron
make neuronx-tgi-sagemaker

It is extra steps, but you can specify the exact version of the SDK in the Docker file:
https://github.com/huggingface/optimum-neuron/blob/main/text-generation-inference/Dockerfile

It is less steps if you can wait for the SageMaker team to release an updated image.

AWS Inferentia and Trainium org

The new image with AWS Neuron SDK 2.16 and optimum-neuron 0.0.17 has been released: https://github.com/aws/deep-learning-containers/releases/tag/v1.0-hf-tgi-0.0.17-pt-1.13.1-inf-neuronx-py310

Thanks a lot @jburtoft @dacorvo ! I will give it a try :)

AWS Inferentia and Trainium org

@josete89 Make sure you check out the new blog post from HF that walks you through it. No image updates needed!

https://huggingface.co./blog/text-generation-inference-on-inferentia2

Sign up or log in to comment