# Databricks & Hugging Face ML Quickstart: Model Training

This notebook provides a quick overview of machine learning model training on Databricks using Hugging Face transformers. The notebook includes using MLflow to track the trained models.

This tutorial covers:
- Part 1: Training a text classification transformer model with MLflow tracking

### Requirements
- Cluster running Databricks Runtime 7.5 ML or above
- Training is super slow/unusable if there is no GPU attached to the cluster

### Libraries
Import the necessary libraries. These libraries are preinstalled on Databricks Runtime for Machine Learning ([AWS](https://docs.databricks.com/runtime/mlruntime.html)|[Azure](https://docs.microsoft.com/azure/databricks/runtime/mlruntime)|[GCP](https://docs.gcp.databricks.com/runtime/mlruntime.html)) clusters and are tuned for compatibility and performance.

In [0]:
%pip install transformers datasets mlflow torch

Python interpreter will be restarted.
Collecting transformers
  Downloading transformers-4.20.1-py3-none-any.whl (4.4 MB)
Collecting datasets
  Downloading datasets-2.3.2-py3-none-any.whl (362 kB)
Collecting mlflow
  Downloading mlflow-1.27.0-py3-none-any.whl (17.9 MB)
Collecting torch
  Downloading torch-1.12.0-cp38-cp38-manylinux1_x86_64.whl (776.3 MB)
Collecting xxhash
  Downloading xxhash-3.0.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (212 kB)
Collecting aiohttp
  Downloading aiohttp-3.8.1-cp38-cp38-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (1.3 MB)
Collecting fsspec[http]>=2021.05.0
  Downloading fsspec-2022.5.0-py3-none-any.whl (140 kB)
Collecting dill<0.3.6
  Downloading dill-0.3.5.1-py2.py3-none-any.whl (95 kB)
Collecting multiprocess
  Downloading multiprocess-0.70.13-py38-none-any.whl (131 kB)
Collecting responses<0.19
  Downloading responses-0.18.0-py3-none-any.whl (38 kB)
Collecting huggingface-hub<1.0.0,>=0.1.0
  

### Install Git LFS

In [0]:
%sh
curl -s https://packagecloud.io/install/repositories/github/git-lfs/script.deb.sh | sudo bash
sudo apt-get install git-lfs

Detected operating system as Ubuntu/focal.
Checking for curl...
Detected curl...
Checking for gpg...
Detected gpg...
Running apt-get update... done.
Installing apt-transport-https... done.
Installing /etc/apt/sources.list.d/github_git-lfs.list...done.
Importing packagecloud gpg key... done.
Running apt-get update... done.

The repository is setup! You can now install packages.
Reading package lists...
Building dependency tree...
Reading state information...
The following NEW packages will be installed:
  git-lfs
0 upgraded, 1 newly installed, 0 to remove and 68 not upgraded.
Need to get 7,168 kB of archives.
After this operation, 15.6 MB of additional disk space will be used.
Get:1 https://packagecloud.io/github/git-lfs/ubuntu focal/main amd64 git-lfs amd64 3.2.0 [7,168 kB]
debconf: delaying package configuration, since apt-utils is not installed
Fetched 7,168 kB in 0s (15.4 MB/s)
Selecting previously unselected package git-lfs.
(Reading database ... 
(Reading database ... 5%
(Reading 

In [0]:
import mlflow
import torch
#from hyperopt import fmin, tpe, hp, SparkTrials, Trials, STATUS_OK
#from hyperopt.pyll import scope
from datasets import load_dataset, load_metric
from huggingface_hub import notebook_login
from transformers import (
    AutoModelForSequenceClassification,
    AutoTokenizer,
    Trainer,
    TrainingArguments,
)

### Log into Hugging Face Hub

This uses the command line to login into the hugging face hub.  If the Hugging Face hub is private, specify the location using the "HF_ENDPOINT" parameter.

In [0]:
from huggingface_hub.commands.user import _login
from huggingface_hub import HfApi
api = HfApi()
_login(hf_api = api, token = "API Token")

Login successful
Your token has been saved to /root/.huggingface/token
[1m[31mAuthenticated through git-credential store but this isn't the helper defined on your machine.
You might have to re-authenticate when pushing to the Hugging Face Hub. Run the following command in your terminal in case you want to set this credential helper as the default

git config --global credential.helper store[0m


In [0]:
#Verify Login
!huggingface-cli whoami

rajistics
[1morgs: [0m huggingface,spaces-explorers,demo-org,HF-test-lab,qualitydatalab


### Load data
The tutorial uses the IMDB dataset for move reviews.  The complete [model card](https://huggingface.co./datasets/imdb) can be found at Hugging Face with details on the dataset. 

The goal is to classify reviews as positive or negative. 

The dataset is loaded using the Hugging Face datasets package.

In [0]:
# Load and preprocess data
train_dataset, test_dataset = load_dataset("imdb", split=["train", "test"])

Downloading builder script:   0%|          | 0.00/1.79k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/1.05k [00:00<?, ?B/s]

Downloading and preparing dataset imdb/plain_text (download: 80.23 MiB, generated: 127.02 MiB, post-processed: Unknown size, total: 207.25 MiB) to /root/.cache/huggingface/datasets/imdb/plain_text/1.0.0/2fdd8b9bcadd6e7055e742a706876ba43f19faee861df134affd7a3f60fc38a1...


Downloading data:   0%|          | 0.00/84.1M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating unsupervised split:   0%|          | 0/50000 [00:00<?, ? examples/s]

Dataset imdb downloaded and prepared to /root/.cache/huggingface/datasets/imdb/plain_text/1.0.0/2fdd8b9bcadd6e7055e742a706876ba43f19faee861df134affd7a3f60fc38a1. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

## Part 1. Train a classification model

### MLflow Tracking
[MLflow tracking](https://www.mlflow.org/docs/latest/tracking.html) allows you to organize your machine learning training code, parameters, and models. 

You can enable automatic MLflow tracking by using [*autologging*](https://www.mlflow.org/docs/latest/tracking.html#automatic-logging).

In [0]:
# Enable MLflow autologging for this notebook
mlflow.autolog()

2022/07/20 20:53:02 INFO mlflow.tracking.fluent: Autologging successfully enabled for sklearn.
'JavaPackage' object is not callable
'JavaPackage' object is not callable
'JavaPackage' object is not callable
2022/07/20 20:53:02 INFO mlflow.tracking.fluent: Autologging successfully enabled for pyspark.ml.


Next, train a classifier within the context of an MLflow run, which automatically logs the trained model and many associated metrics and parameters. You can supplement the logging with additional metrics such as the model's AUC score on the test dataset.
If the model is private, another way to access the model is by using the `use_auth_token` parameter to specify the API key that has access to the model.

In [0]:
model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-cased", num_labels=2)
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-cased")

Downloading:   0%|          | 0.00/411 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/251M [00:00<?, ?B/s]

Some weights of the model checkpoint at distilbert-base-cased were not used when initializing DistilBertForSequenceClassification: ['vocab_projector.weight', 'vocab_layer_norm.weight', 'vocab_transform.weight', 'vocab_transform.bias', 'vocab_layer_norm.bias', 'vocab_projector.bias']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-cased and are newly initialized: ['pre_classifier.weight', 'classifier.weight', 'classifier.b

Downloading:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/208k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/426k [00:00<?, ?B/s]

In [0]:
def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)


train_dataset = train_dataset.map(tokenize_function, batched=True)
test_dataset = test_dataset.map(tokenize_function, batched=True)

metric = load_metric("accuracy")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)



  0%|          | 0/25 [00:00<?, ?ba/s]

  0%|          | 0/25 [00:00<?, ?ba/s]

Downloading builder script:   0%|          | 0.00/1.65k [00:00<?, ?B/s]

In [0]:
training_args = TrainingArguments(
    hub_model_id="rajistics/distilbert-imdb-mlflow2",
    num_train_epochs=1,
    output_dir="./output",
    logging_steps=500,
    save_strategy="epoch",
    push_to_hub=True,
)

In [0]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
    compute_metrics=compute_metrics,
)

Cloning https://huggingface.co./rajistics/distilbert-imdb-mlflow into local empty directory.


In [0]:
trainer.train()

The following columns in the training set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 25000
  Num Epochs = 1
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 3125


  0%|          | 0/3125 [00:00<?, ?it/s]

In [0]:
mlflow.end_run()
trainer.push_to_hub()

Saving model checkpoint to ./output
Configuration saved in ./output/config.json
Model weights saved in ./output/pytorch_model.bin


Upload file pytorch_model.bin:   0%|          | 32.0k/251M [00:00<?, ?B/s]

Upload file training_args.bin: 100%|##########| 3.23k/3.23k [00:00<?, ?B/s]

To https://huggingface.co./rajistics/distilbert-imdb-mlflow
   11f9d35..565ce9d  main -> main

Dropping the following result as it does not have all the necessary fields:
{'task': {'name': 'Text Classification', 'type': 'text-classification'}, 'dataset': {'name': 'imdb', 'type': 'imdb', 'args': 'plain_text'}}
To https://huggingface.co./rajistics/distilbert-imdb-mlflow
   565ce9d..2e139c6  main -> main

Out[12]: 'https://huggingface.co./rajistics/distilbert-imdb-mlflow/commit/565ce9de2a3bf303432d5ca277711f8237b8097c'

In [0]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline
# Model Card: https://huggingface.co./lvwerra/distilbert-imdb
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-cased")
model = AutoModelForSequenceClassification.from_pretrained("rajistics/distilbert-imdb-mlflow")
moviereview = pipeline("text-classification", model = model, tokenizer = tokenizer)

loading configuration file https://huggingface.co./distilbert-base-cased/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/ebe1ea24d11aa664488b8de5b21e33989008ca78f207d4e30ec6350b693f073f.302bfd1b5e031cc1b17796e0b6e5b242ba2045d31d00f97589e12b458ebff27a
Model config DistilBertConfig {
  "_name_or_path": "distilbert-base-cased",
  "activation": "gelu",
  "attention_dropout": 0.1,
  "dim": 768,
  "dropout": 0.1,
  "hidden_dim": 3072,
  "initializer_range": 0.02,
  "max_position_embeddings": 512,
  "model_type": "distilbert",
  "n_heads": 12,
  "n_layers": 6,
  "output_past": true,
  "pad_token_id": 0,
  "qa_dropout": 0.1,
  "seq_classif_dropout": 0.2,
  "sinusoidal_pos_embds": false,
  "tie_weights_": true,
  "transformers_version": "4.20.1",
  "vocab_size": 28996
}

loading file https://huggingface.co./distilbert-base-cased/resolve/main/vocab.txt from cache at /root/.cache/huggingface/transformers/ba377304984dc63e3ede0e23a938bbbf04d5c3835b66d5bb48343aecca188429

Downloading:   0%|          | 0.00/251M [00:00<?, ?B/s]

storing https://huggingface.co./rajistics/distilbert-imdb-mlflow/resolve/main/pytorch_model.bin in cache at /root/.cache/huggingface/transformers/8c86731c548ff3ed880ff6d4a39acb090934b1be168769182cafe112afea3621.5aa75fe74f6e4ab0031e967ed46c152a9e096a093827e37659fb4bc253522c5a
creating metadata file for /root/.cache/huggingface/transformers/8c86731c548ff3ed880ff6d4a39acb090934b1be168769182cafe112afea3621.5aa75fe74f6e4ab0031e967ed46c152a9e096a093827e37659fb4bc253522c5a
loading weights file https://huggingface.co./rajistics/distilbert-imdb-mlflow/resolve/main/pytorch_model.bin from cache at /root/.cache/huggingface/transformers/8c86731c548ff3ed880ff6d4a39acb090934b1be168769182cafe112afea3621.5aa75fe74f6e4ab0031e967ed46c152a9e096a093827e37659fb4bc253522c5a
All model checkpoint weights were used when initializing DistilBertForSequenceClassification.

All the weights of DistilBertForSequenceClassification were initialized from the model checkpoint at rajistics/distilbert-imdb-mlflow.
If your 

In [0]:
moviereview("This move was a bit crazy, but I liked it")

Out[25]: [{'label': 'LABEL_0', 'score': 0.5706899762153625}]

### View MLflow runs
To view the logged training runs, click the **Experiment** icon at the upper right of the notebook to display the experiment sidebar. If necessary, click the refresh icon to fetch and monitor the latest runs. 

<img width="350" src="https://docs.databricks.com/_static/images/mlflow/quickstart/experiment-sidebar-icons.png"/>

You can then click the experiment page icon to display the more detailed MLflow experiment page ([AWS](https://docs.databricks.com/applications/mlflow/tracking.html#notebook-experiments)|[Azure](https://docs.microsoft.com/azure/databricks/applications/mlflow/tracking#notebook-experiments)|[GCP](https://docs.gcp.databricks.com/applications/mlflow/tracking.html#notebook-experiments)). This page allows you to compare runs and view details for specific runs.

<img width="800" src="https://docs.databricks.com/_static/images/mlflow/quickstart/compare-runs.png"/>

In [0]:
runs = mlflow.search_runs("3759898664210413")

In [0]:
import pandas
runs.to_csv("output/mlflow_runs.csv")

fatal: not a git repository (or any of the parent directories): .git
fatal: not a git repository (or any of the parent directories): .git
fatal: not a git repository (or any of the parent directories): .git


In [0]:
%sh
cd output
git add mlflow_runs.csv
git commit -m "Add MLFlow results"
git push

[main 7d3e3d7] Add MLFlow results
 1 file changed, 4 insertions(+)
 create mode 100644 mlflow_runs.csv
To https://huggingface.co./rajistics/distilbert-imdb-mlflow
   2e139c6..7d3e3d7  main -> main
