AWS Trainium & Inferentia

Mistral-7b-Instruct-v0.2 performance on AWS Inferentia2 (Latency & Througput)

How fast is Mistralv0.2 on Inferentia2? Let’s figure out!

For this benchmark we will use the following configurations:

Model type	batch_size	sequence_length
Mistral 7B BS1	1	4096
Mistral 7B BS4	4	4096
Mistral 7B BS8	8	4096
Mistral 7B BS16	16	4096
Mistral 7B BS32	32	4096

Note: all models are compiled to use 4 devices corresponding to 8 cores on the inf2.48xlarge instance.

Note: please refer to the inferentia2 product page for details on the available instances.

Time to first token

The time to first token is the time required to process the input tokens and generate the first output token. It is a very important metric, as it corresponds to the latency directly perceived by the user when streaming generated tokens.

We test the time to first token for increasing context sizes, from a typical Q/A usage, to heavy Retrieval Augmented Generation (RAG) use-cases.

Time to first token is expressed in seconds.

Mistral 7b inferentia2 TTFT

Inter-token Latency

The inter-token latency corresponds to the average time elapsed between two generated tokens.

It is expressed in milliseconds.

Mistral 7b inferentia2 inter-token latency

Throughput

Unlike some other benchmarks, we evaluate the throughput using generated tokens only, by dividing their number by the end-to-end latency.

Throughput is expressed in tokens/second.

Mistral 7b inferentia2 throughput