{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Convert & Optimize model with Optimum \n", "\n", "\n", "Steps:\n", "1. Convert model to ONNX\n", "2. Optimize & quantize model with Optimum\n", "3. Create Custom Handler for Inference Endpoints\n", "4. Test Custom Handler Locally\n", "5. Push to repository and create Inference Endpoint\n", "\n", "Helpful links:\n", "* [Accelerate Transformers with Hugging Face Optimum](https://huggingface.co./blog/optimum-inference)\n", "* [Optimizing Transformers for GPUs with Optimum](https://www.philschmid.de/optimizing-transformers-with-optimum-gpu)\n", "* [Optimum Documentation](https://huggingface.co./docs/optimum/onnxruntime/modeling_ort)\n", "* [Create Custom Handler Endpoints](https://link-to-docs)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Setup & Installation" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Writing requirements.txt\n" ] } ], "source": [ "%%writefile requirements.txt\n", "optimum[onnxruntime]==1.4.0\n", "mkl-include\n", "mkl" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!pip install -r requirements.txt" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 0. Base line Performance\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from transformers import pipeline\n", "\n", "qa = pipeline(\"question-answering\",model=\"deepset/roberta-base-squad2\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Okay, let's test the performance (latency) with sequence length of 128." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "context=\"Hello, my name is Philipp and I live in Nuremberg, Germany. Currently I am working as a Technical Lead at Hugging Face to democratize artificial intelligence through open source and open science. In the past I designed and implemented cloud-native machine learning architectures for fin-tech and insurance companies. I found my passion for cloud concepts and machine learning 5 years ago. Since then I never stopped learning. Currently, I am focusing myself in the area NLP and how to leverage models like BERT, Roberta, T5, ViT, and GPT2 to generate business value.\" \n", "question=\"As what is Philipp working?\" \n", "\n", "payload = {\"inputs\": {\"question\": question, \"context\": context}}" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Vanilla model Average latency (ms) - 64.15 +\\- 2.44\n" ] } ], "source": [ "from time import perf_counter\n", "import numpy as np \n", "\n", "def measure_latency(pipe,payload):\n", " latencies = []\n", " # warm up\n", " for _ in range(10):\n", " _ = pipe(question=payload[\"inputs\"][\"question\"], context=payload[\"inputs\"][\"context\"])\n", " # Timed run\n", " for _ in range(50):\n", " start_time = perf_counter()\n", " _ = pipe(question=payload[\"inputs\"][\"question\"], context=payload[\"inputs\"][\"context\"])\n", " latency = perf_counter() - start_time\n", " latencies.append(latency)\n", " # Compute run statistics\n", " time_avg_ms = 1000 * np.mean(latencies)\n", " time_std_ms = 1000 * np.std(latencies)\n", " return f\"Average latency (ms) - {time_avg_ms:.2f} +\\- {time_std_ms:.2f}\"\n", "\n", "print(f\"Vanilla model {measure_latency(qa,payload)}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1. Convert model to ONNX" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "df00c03d67b546bf8a3d1a327b9380f5", "version_major": 2, "version_minor": 0 }, "text/plain": [ "Downloading: 0%| | 0.00/571 [00:00 List[List[Dict[str, float]]]:\n", " \"\"\"\n", " Args:\n", " data (:obj:):\n", " includes the input data and the parameters for the inference.\n", " Return:\n", " A :obj:`list`:. The list contains the answer and scores of the inference inputs\n", " \"\"\"\n", " inputs = data.get(\"inputs\", data)\n", " # run the model\n", " prediction = self.pipeline(**inputs)\n", " # return prediction\n", " return prediction" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 4. Test Custom Handler Locally\n" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'score': 0.4749588668346405,\n", " 'start': 88,\n", " 'end': 102,\n", " 'answer': 'Technical Lead'}" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from handler import EndpointHandler\n", "\n", "# init handler\n", "my_handler = EndpointHandler(path=\".\")\n", "\n", "# prepare sample payload\n", "context=\"Hello, my name is Philipp and I live in Nuremberg, Germany. Currently I am working as a Technical Lead at Hugging Face to democratize artificial intelligence through open source and open science. In the past I designed and implemented cloud-native machine learning architectures for fin-tech and insurance companies. I found my passion for cloud concepts and machine learning 5 years ago. Since then I never stopped learning. Currently, I am focusing myself in the area NLP and how to leverage models like BERT, Roberta, T5, ViT, and GPT2 to generate business value.\" \n", "question=\"As what is Philipp working?\" \n", "\n", "payload = {\"inputs\": {\"question\": question, \"context\": context}}\n", "\n", "# test the handler\n", "my_handler(payload)" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Optimized & Quantized model Average latency (ms) - 29.90 +\\- 0.53\n" ] } ], "source": [ "from time import perf_counter\n", "import numpy as np \n", "\n", "def measure_latency(handler,payload):\n", " latencies = []\n", " # warm up\n", " for _ in range(10):\n", " _ = handler(payload)\n", " # Timed run\n", " for _ in range(50):\n", " start_time = perf_counter()\n", " _ = handler(payload)\n", " latency = perf_counter() - start_time\n", " latencies.append(latency)\n", " # Compute run statistics\n", " time_avg_ms = 1000 * np.mean(latencies)\n", " time_std_ms = 1000 * np.std(latencies)\n", " return f\"Average latency (ms) - {time_avg_ms:.2f} +\\- {time_std_ms:.2f}\"\n", "\n", "print(f\"Optimized & Quantized model {measure_latency(my_handler,payload)}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`Vanilla model Average latency (ms) - 64.15 +\\- 2.44`" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 5. Push to repository and create Inference Endpoint\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# add all our new files\n", "!git add * \n", "# commit our files\n", "!git commit -m \"add custom handler\"\n", "# push the files to the hub\n", "!git push" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3.9.12 ('az': conda)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.12" }, "orig_nbformat": 4, "vscode": { "interpreter": { "hash": "bddb99ecda5b40a820d97bf37f3ff3a89fb9dbcf726ae84d28624ac628a665b4" } } }, "nbformat": 4, "nbformat_minor": 2 }