Training scripts

Transformers provides many example training scripts for deep learning frameworks (PyTorch, TensorFlow, Flax) and tasks in transformers/examples. There are additional scripts in transformers/research projects and transformers/legacy, but these aren’t actively maintained and requires a specific version of Transformers.

Example scripts are only examples and you may need to adapt the script to your use-case. To help you with this, most scripts are very transparent in how data is preprocessed, allowing you to edit it as necessary.

For any feature you’d like to implement in an example script, please discuss it on the forum or in an issue before submitting a pull request. While we welcome contributions, it is unlikely a pull request that adds more functionality is added at the cost of readability.

This guide will show you how to run an example summarization training script in PyTorch and TensorFlow.

Setup

Install Transformers from source in a new virtual environment to run the latest version of the example script.

git clone https://github.com/huggingface/transformers
cd transformers
pip install .

Run the command below to checkout a script from a specific or older version of Transformers.

git checkout tags/v3.5.1

After you’ve setup the correct version, navigate to the example folder of your choice and install the example specific requirements.

pip install -r requirements.txt

Run a script

Start with a smaller dataset by including the max_train_samples, max_eval_samples, and max_predict_samples parameters to truncate the dataset to a maximum number of samples. This helps ensure training works as expected before committing to the entire dataset which can take hours to complete.

Not all example scripts support the max_predict_samples parameter. Run the command below to check whether a script supports it or not.

examples/pytorch/summarization/run_summarization.py -h

The example below fine-tunes T5-small on the CNN/DailyMail dataset. T5 requires an additional source_prefix parameter to prompt it to summarize.

PyTorch

TensorFlow

Accelerate

Accelerate is designed to simplify distributed training while offering complete visibility into the PyTorch training loop. If you’re planning on training with a script with Accelerate, use the _no_trainer.py version of the script.

Install Accelerate from source to ensure you have the latest version.

pip install git+https://github.com/huggingface/accelerate

Run the accelerate config command to answer a few questions about your training setup. This creates and saves a config file about your system.

accelerate config

You can use accelerate test to ensure your system is properly configured.

accelerate test

Run accelerate launch to start training.

accelerate launch run_summarization_no_trainer.py \
    --model_name_or_path google-t5/t5-small \
    --dataset_name cnn_dailymail \
    --dataset_config "3.0.0" \
    --source_prefix "summarize: " \
    --output_dir ~/tmp/tst-summarization \

Custom dataset

The summarization scripts supports custom datasets as long as they are a CSV or JSONL file. When using your own dataset, you need to specify the following additional parameters.

train_file and validation_file specify the path to your training and validation files.
text_column is the input text to summarize.
summary_column is the target text to output.

An example command for summarizing a custom dataset is shown below.

python examples/pytorch/summarization/run_summarization.py \
    --model_name_or_path google-t5/t5-small \
    --do_train \
    --do_eval \
    --train_file path_to_csv_or_jsonlines_file \
    --validation_file path_to_csv_or_jsonlines_file \
    --text_column text_column_name \
    --summary_column summary_column_name \
    --source_prefix "summarize: " \
    --output_dir /tmp/tst-summarization \
    --overwrite_output_dir \
    --per_device_train_batch_size=4 \
    --per_device_eval_batch_size=4 \
    --predict_with_generate \

< > Update on GitHub