|
--- |
|
license: llama2 |
|
language: |
|
- en |
|
pipeline_tag: text-generation |
|
inference: false |
|
tags: |
|
- facebook |
|
- meta |
|
- pytorch |
|
- llama |
|
- llama-2 |
|
- inferentia2 |
|
- neuron |
|
--- |
|
# Neuronx model for [codellama/CodeLlama-7b-hf](https://huggingface.co./codellama/CodeLlama-7b-hf) |
|
|
|
This repository contains [**AWS Inferentia2**](https://aws.amazon.com/ec2/instance-types/inf2/) and [`neuronx`](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/) compatible checkpoints for [codellama/CodeLlama-7b-hf](https://huggingface.co./codellama/CodeLlama-7b-hf). |
|
You can find detailed information about the base model on its [Model Card](https://huggingface.co./codellama/CodeLlama-7b-hf). |
|
|
|
This model has been exported to the `neuron` format using specific `input_shapes` and `compiler` parameters detailed in the paragraphs below. |
|
|
|
It has been compiled to run on an inf2.8xlarge instance on AWS. It also runs on an inf2.xlarge (the smallest Inferentia2 instance), but it pretty much maxes out the RAM. Be sure to test before using in production on the smaller instance. |
|
|
|
Please refer to the π€ `optimum-neuron` [documentation](https://huggingface.co./docs/optimum-neuron/main/en/guides/models#configuring-the-export-of-a-generative-model) for an explanation of these parameters. |
|
|
|
## Usage on Amazon SageMaker |
|
|
|
_coming soon_ |
|
|
|
## Usage with π€ `optimum-neuron` |
|
|
|
```python |
|
from optimum.neuron import pipeline |
|
|
|
p = pipeline('text-generation', 'aws-neuron/CodeLlama-7b-hf-neuron-8xlarge') |
|
p("import socket\n\ndef ping_exponential_backoff(host: str):", |
|
do_sample=True, |
|
top_k=10, |
|
temperature=0.1, |
|
top_p=0.95, |
|
num_return_sequences=1, |
|
max_length=200, |
|
) |
|
``` |
|
``` |
|
[{'generated_text': 'import socket\n\ndef ping_exponential_backoff(host: str):\n """\n Ping a host with exponential backoff.\n\n :param host: Host to ping\n :return: True if host is reachable, False otherwise\n """\n for i in range(1, 10):\n try:\n socket.create_connection((host, 80), 1).close()\n return True\n except OSError:\n time.sleep(2 ** i)\n return False\n\n\ndef ping_exponential_backoff_with_timeout(host: str, timeout: int):\n """\n Ping a host with exponential backoff and timeout.\n\n :param host: Host to ping\n :param timeout: Timeout in seconds\n :return: True if host is reachable, False otherwise\n """\n for'}] |
|
``` |
|
This repository contains tags specific to versions of `neuronx`. When using with π€ `optimum-neuron`, use the repo revision specific to the version of `neuronx` you are using, to load the right serialized checkpoints. |
|
|
|
## Arguments passed during export |
|
|
|
**input_shapes** |
|
|
|
```json |
|
{ |
|
"batch_size": 1, |
|
"sequence_length": 2048, |
|
} |
|
``` |
|
|
|
**compiler_args** |
|
|
|
```json |
|
{ |
|
"auto_cast_type": "fp16", |
|
"num_cores": 2, |
|
} |
|
``` |