Failing to deploy in AWS SageMaker
Hello,
I have been trying to deploy the medalpaca-13b model through SageMaker Notebook, however I keep getting the following error:
UnexpectedStatusException: Error hosting endpoint hugging-face-medalpaca-13b-20230901-0: Failed. Reason: The primary container for production variant AllTraffic did not pass the ping health check. Please check CloudWatch logs for this endpoint..
I was getting the same error for medalpaca-7b, but I fixed it by adding the 'MAX_BATCH_TOTAL_TOKENS' config. Here is the reference code I am using :
reference zero shot deployment
import json
import sagemaker
import boto3
from sagemaker.huggingface import HuggingFaceModel, get_huggingface_llm_image_uri
try:
role = sagemaker.get_execution_role()
except ValueError:
iam = boto3.client('iam')
role = iam.get_role(RoleName='sagemaker_execution_role')['Role']['Arn']
Hub Model configuration. https://huggingface.co./models
hub = {
'HF_MODEL_ID':'medalpaca/medalpaca-13b',
'SM_NUM_GPUS': json.dumps(1),
'MAX_INPUT_LENGTH': json.dumps(1024), # Max length of input text 1024
'MAX_TOTAL_TOKENS': json.dumps(2048), # Max length of the generation (including input text) 2048
'MAX_BATCH_TOTAL_TOKENS': json.dumps(4096) ## Limits the number of tokens that can be processed in parallel during the generation
}
create Hugging Face Model Class
huggingface_model = HuggingFaceModel(
image_uri=get_huggingface_llm_image_uri("huggingface",version="0.9.3"),
env=hub,
role=role,
)
deploy model to SageMaker Inference
predictor = huggingface_model.deploy(
initial_instance_count=1,
instance_type="ml.g5.12xlarge",
endpoint_name = "hugging-face-medalpaca-13b-20230901-0"
#container_startup_health_check_timeout=3000,
)
send request
predictor.predict({
"inputs": "My name is Julien and I like to",
})
Any help on this would be really appreciated.
I have no experience with SageMaker, so I can't help you I am afraid. Some general thoughts:
Maybe the model is too large? You limit the number of tokens that are processed for the 7b model, which affects the hardware requirements. The SageMaker Instance has 4 GPUs with 24 GB each. Maybe this is not enough to load the model. You could try to load a quantized version of the model (I believe someone has converted the weights here in Hugging Face).
Sometimes people run into errors, when trying to train models trained with LoRA. Here only the adapters are provided and you still need to load base the LLaMA model. Do you have access to any more detailed error logs from SageMaker?