Pipeline

The Pipeline is a simple but powerful inference API that is readily available for a variety of machine learning tasks with any model from the Hugging Face Hub.

Tailor the Pipeline to your task with task specific parameters such as adding timestamps to an automatic speech recognition (ASR) pipeline for transcribing meeting notes. Pipeline supports GPUs, Apple Silicon, and half-precision weights to accelerate inference and save memory.

Transformers has two pipeline classes, a generic Pipeline and many individual task-specific pipelines like TextGenerationPipeline or VisualQuestionAnsweringPipeline. Load these individual pipelines by setting the task identifier in the task parameter in Pipeline. You can find the task identifier for each pipeline in their API documentation.

Each task is configured to use a default pretrained model and preprocessor, but this can be overridden with the model parameter if you want to use a different model.

For example, to use the TextGenerationPipeline with Gemma 2, set task="text-generation" and model="google/gemma-2-2b".

from transformers import pipeline

pipeline = pipeline(task="text-generation", model="google/gemma-2-2b")
pipeline("the secret to baking a really good cake is ")
[{'generated_text': 'the secret to baking a really good cake is 1. the right ingredients 2. the'}]

When you have more than one input, pass them as a list.

from transformers import pipeline

pipeline = pipeline(task="text-generation", model="google/gemma-2-2b", device="cuda")
pipeline(["the secret to baking a really good cake is ", "a baguette is "])
[[{'generated_text': 'the secret to baking a really good cake is 1. the right ingredients 2. the'}],
 [{'generated_text': 'a baguette is 100% bread.\n\na baguette is 100%'}]]

This guide will introduce you to the Pipeline, demonstrate its features, and show how to configure its various parameters.

Tasks

Pipeline is compatible with many machine learning tasks across different modalities. Pass an appropriate input to the pipeline and it will handle the rest.

Here are some examples of how to use Pipeline for different tasks and modalities.

summarization

automatic speech recognition

image classification

visual question answering

Parameters

At a minimum, Pipeline only requires a task identifier, model, and the appropriate input. But there are many parameters available to configure the pipeline with, from task-specific parameters to optimizing performance.

This section introduces you to some of the more important parameters.

Device

Pipeline is compatible with many hardware types, including GPUs, CPUs, Apple Silicon, and more. Configure the hardware type with the device parameter. By default, Pipeline runs on a CPU which is given by device=-1.

GPU

Apple silicon

Batch inference

Pipeline can also process batches of inputs with the batch_size parameter. Batch inference may improve speed, especially on a GPU, but it isn’t guaranteed. Other variables such as hardware, data, and the model itself can affect whether batch inference improves speed. For this reason, batch inference is disabled by default.

In the example below, when there are 4 inputs and batch_size is set to 2, Pipeline passes a batch of 2 inputs to the model at a time.

from transformers import pipeline

pipeline = pipeline(task="text-generation", model="google/gemma-2-2b", device="cuda", batch_size=2)
pipeline(["the secret to baking a really good cake is", "a baguette is", "paris is the", "hotdogs are"])
[[{'generated_text': 'the secret to baking a really good cake is to use a good cake mix.\n\ni’'}],
 [{'generated_text': 'a baguette is'}],
 [{'generated_text': 'paris is the most beautiful city in the world.\n\ni’ve been to paris 3'}],
 [{'generated_text': 'hotdogs are a staple of the american diet. they are a great source of protein and can'}]]

Another good use case for batch inference is for streaming data in Pipeline.

from transformers import pipeline
from transformers.pipelines.pt_utils import KeyDataset
import datasets

# KeyDataset is a utility that returns the item in the dict returned by the dataset
dataset = datasets.load_dataset("imdb", name="plain_text", split="unsupervised")
pipeline = pipeline(task="text-classification", model="distilbert/distilbert-base-uncased-finetuned-sst-2-english", device="cuda")
for out in pipeline(KeyDataset(dataset, "text"), batch_size=8, truncation="only_first"):
    print(out)

Keep the following general rules of thumb in mind for determining whether batch inference can help improve performance.

The only way to know for sure is to measure performance on your model, data, and hardware.
Don’t batch inference if you’re constrained by latency (a live inference product for example).
Don’t batch inference if you’re using a CPU.
Don’t batch inference if you don’t know the sequence_length of your data. Measure performance, iteratively add to sequence_length, and include out-of-memory (OOM) checks to recover from failures.
Do batch inference if your sequence_length is regular, and keep pushing it until you reach an OOM error. The larger the GPU, the more helpful batch inference is.
Do make sure you can handle OOM errors if you decide to do batch inference.

Task-specific parameters

Pipeline accepts any parameters that are supported by each individual task pipeline. Make sure to check out each individual task pipeline to see what type of parameters are available. If you can’t find a parameter that is useful for your use case, please feel free to open a GitHub issue to request it!

The examples below demonstrate some of the task-specific parameters available.

automatic speech recognition

text generation

Chunk batching

There are some instances where you need to process data in chunks.

for some data types, a single input (for example, a really long audio file) may need to be chunked into multiple parts before it can be processed
for some tasks, like zero-shot classification or question answering, a single input may need multiple forward passes which can cause issues with the batch_size parameter

The ChunkPipeline class is designed to handle these use cases. Both pipeline classes are used in the same way, but since ChunkPipeline can automatically handle batching, you don’t need to worry about the number of forward passes your inputs trigger. Instead, you can optimize batch_size independently of the inputs.

The example below shows how it differs from Pipeline.

# ChunkPipeline
all_model_outputs = []
for preprocessed in pipeline.preprocess(inputs):
    model_outputs = pipeline.model_forward(preprocessed)
    all_model_outputs.append(model_outputs)
outputs =pipeline.postprocess(all_model_outputs)

# Pipeline
preprocessed = pipeline.preprocess(inputs)
model_outputs = pipeline.forward(preprocessed)
outputs = pipeline.postprocess(model_outputs)

Large datasets

For inference with large datasets, you can iterate directly over the dataset itself. This avoids immediately allocating memory for the entire dataset, and you don’t need to worry about creating batches yourself. Try Batch inference with the batch_size parameter to see if it improves performance.

from transformers.pipelines.pt_utils import KeyDataset
from transformers import pipeline
from datasets import load_dataset

dataset = datasets.load_dataset("imdb", name="plain_text", split="unsupervised")
pipeline = pipeline(task="text-classification", model="distilbert/distilbert-base-uncased-finetuned-sst-2-english", device="cuda")
for out in pipeline(KeyDataset(dataset, "text"), batch_size=8, truncation="only_first"):
    print(out)

Other ways to run inference on large datasets with Pipeline include using an iterator or generator.

def data():
    for i in range(1000):
        yield f"My example {i}"

pipeline = pipeline(model="openai-community/gpt2", device=0)
generated_characters = 0
for out in pipeline(data()):
    generated_characters += len(out[0]["generated_text"])

Large models

Accelerate enables a couple of optimizations for running large models with Pipeline. Make sure Accelerate is installed first.

!pip install -U accelerate

The device_map="auto" setting is useful for automatically distributing the model across the fastest devices (GPUs) first before dispatching to other slower devices if available (CPU, hard drive).

Pipeline supports half-precision weights (torch.float16), which can be significantly faster and save memory. Performance loss is negligible for most models, especially for larger ones. If your hardware supports it, you can enable torch.bfloat16 instead for more range.

Inputs are internally converted to torch.float16 and it only works for models with a PyTorch backend.

Lastly, Pipeline also accepts quantized models to reduce memory usage even further. Make sure you have the bitsandbytes library installed first, and then add load_in_8bit=True to model_kwargs in the pipeline.

import torch
from transformers import pipeline, BitsAndBytesConfig

pipeline = pipeline(model="google/gemma-7b", torch_dtype=torch.bfloat16, device_map="auto", model_kwargs={"quantization_config": BitsAndBytesConfig(load_in_8bit=True)})
pipeline("the secret to baking a good cake is ")
[{'generated_text': 'the secret to baking a good cake is 1. the right ingredients 2. the right'}]

< > Update on GitHub