File size: 14,816 Bytes
4b010d1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
{
  "cells": [
    {
      "cell_type": "markdown",
      "id": "p1dCsisKaCPx",
      "metadata": {
        "id": "p1dCsisKaCPx"
      },
      "source": [
        "Install prerequisite packages."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "KtsIcBFOZ4Wc",
      "metadata": {
        "id": "KtsIcBFOZ4Wc"
      },
      "outputs": [],
      "source": [
        "!pip install -U datasets sentence-transformers pinecone-client tqdm"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "7bb8e8fd",
      "metadata": {
        "id": "7bb8e8fd"
      },
      "source": [
        "# YouTube Indexing and Queries\n",
        "\n",
        "In this notebook we will work through an example of indexing and querying the YouTube video transcriptions data. We start by loading the dataset."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "e43af099",
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "e43af099",
        "outputId": "cda841cc-95be-4bb2-bc9b-7990a7748918"
      },
      "outputs": [
        {
          "name": "stderr",
          "output_type": "stream",
          "text": [
            "Using custom data configuration pinecone--yt-transcriptions-b2ddba2bc158a89e\n",
            "Reusing dataset json (/Users/jamesbriggs/.cache/huggingface/datasets/json/pinecone--yt-transcriptions-b2ddba2bc158a89e/0.0.0/ac0ca5f5289a6cf108e706efcf040422dbbfa8e658dee6a819f20d76bb84d26b)\n"
          ]
        },
        {
          "data": {
            "text/plain": [
              "Dataset({\n",
              "    features: ['video_id', 'text', 'start_second', 'end_second', 'url', 'title', 'thumbnail'],\n",
              "    num_rows: 11298\n",
              "})"
            ]
          },
          "execution_count": 3,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "from datasets import load_dataset\n",
        "\n",
        "ytt = load_dataset(\n",
        "    \"pinecone/yt-transcriptions\",\n",
        "    split=\"train\",\n",
        "    revision=\"926a45\"\n",
        ")\n",
        "ytt"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "3db35007",
      "metadata": {
        "id": "3db35007"
      },
      "source": [
        "Each sample includes video-level information (ID, title, url and thumbnail) and snippet-level information (text, start_second, end_second)."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "e6e73d96",
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "e6e73d96",
        "outputId": "1a807672-e5da-4c52-e10a-6066326a0b42"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "{'video_id': 'ZPewmEu7644', 'text': \" hi this is Jeff Dean welcome to applications of deep neural networks of Washington University in this video we're going to look at how we can use ganz to generate additional training data for the latest on my a I course and projects click subscribe in the bell next to it to be notified of every new video Dan's have a wide array of uses beyond just the face generation that you\", 'start_second': 0, 'end_second': 20, 'url': 'https://www.youtube.com/watch?v=ZPewmEu7644&t=0s', 'title': 'GANS for Semi-Supervised Learning in Keras (7.4)', 'thumbnail': 'https://i.ytimg.com/vi/ZPewmEu7644/maxresdefault.jpg'}\n"
          ]
        }
      ],
      "source": [
        "for x in ytt:\n",
        "    print(x)\n",
        "    break"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "b38004c9",
      "metadata": {
        "id": "b38004c9"
      },
      "source": [
        "# Uploading Documents to Pinecone Index\n",
        "\n",
        "The next step is indexing this dataset in Pinecone, for this we need a sentence transformer model (to encode the text into embeddings), and a Pinecone index.\n",
        "\n",
        "We will initialize the sentence transformer first."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "7f6b8e87",
      "metadata": {
        "id": "7f6b8e87",
        "outputId": "67e7f141-84aa-416c-beee-ccfeefd433f8"
      },
      "outputs": [
        {
          "data": {
            "text/plain": [
              "SentenceTransformer(\n",
              "  (0): Transformer({'max_seq_length': 128, 'do_lower_case': False}) with Transformer model: MPNetModel \n",
              "  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})\n",
              "  (2): Normalize()\n",
              ")"
            ]
          },
          "execution_count": 3,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "from sentence_transformers import SentenceTransformer\n",
        "\n",
        "retriever = SentenceTransformer('flax-sentence-embeddings/all_datasets_v3_mpnet-base')\n",
        "retriever"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "a02ff222",
      "metadata": {
        "id": "a02ff222"
      },
      "source": [
        "We can see the embedding dimension of `768` above, we will need this when creating our Pinecone index."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "69fa3db9",
      "metadata": {
        "id": "69fa3db9",
        "outputId": "1b00b6d9-7cb4-409d-f97d-6085ff52d69c"
      },
      "outputs": [
        {
          "data": {
            "text/plain": [
              "768"
            ]
          },
          "execution_count": 4,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "embed_dim = retriever.get_sentence_embedding_dimension()\n",
        "embed_dim"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "bcea76e1",
      "metadata": {
        "id": "bcea76e1"
      },
      "source": [
        "Now we can initialize our index."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "fc94564a",
      "metadata": {
        "id": "fc94564a"
      },
      "outputs": [],
      "source": [
        "import pinecone\n",
        "\n",
        "# get api key from app.pinecone.io\n",
        "pinecone.init(\n",
        "    api_key=\"<<YOUR_API_KEY>>\",\n",
        "    environment=\"us-west1-gcp\"\n",
        ")\n",
        "\n",
        "# create index\n",
        "pinecone.create_index(\n",
        "    \"youtube-search\",\n",
        "    dimension=embed_dim,\n",
        "    metric=\"cosine\"\n",
        ")\n",
        "\n",
        "# connect to new index\n",
        "index = pinecone.Index(\"youtube-search\")"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "7b936d2b",
      "metadata": {
        "id": "7b936d2b"
      },
      "source": [
        "We will index our data in batches of `64`, the data we insert into our index will contain records (eg *documents*) containing a unique document/snippet ID, embedding, and metadata in the format:\n",
        "\n",
        "```json\n",
        "{\n",
        "    'doc-id',\n",
        "    [0.0, 0.3, 0.1, ...],\n",
        "    {'title': '???', 'start_seconds': 12, ...}\n",
        "}\n",
        "```"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "a71fca5f",
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "referenced_widgets": [
            "428965e30ec84e56b9e8b7be96be8320"
          ]
        },
        "id": "a71fca5f",
        "outputId": "11b97ee1-ee4e-4ad9-fdde-9a2e2a13e048"
      },
      "outputs": [
        {
          "data": {
            "application/vnd.jupyter.widget-view+json": {
              "model_id": "428965e30ec84e56b9e8b7be96be8320",
              "version_major": 2,
              "version_minor": 0
            },
            "text/plain": [
              "  0%|          | 0/177 [00:00<?, ?it/s]"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        }
      ],
      "source": [
        "from tqdm.auto import tqdm\n",
        "\n",
        "docs = []  # this will store IDs, embeddings, and metadata\n",
        "\n",
        "batch_size = 64\n",
        "\n",
        "for i in tqdm(range(0, len(ytt), batch_size)):\n",
        "    i_end = min(i+batch_size, len(ytt))\n",
        "    # extract batch from YT transactions data\n",
        "    batch = ytt[i:i_end]\n",
        "    # encode batch of text\n",
        "    embeds = retriever.encode(batch['text']).tolist()\n",
        "    # each snippet needs a unique ID\n",
        "    # we will merge video ID and start_seconds for this\n",
        "    ids = [f\"{x[0]}-{x[1]}\" for x in zip(batch['video_id'], batch['start_second'])]\n",
        "    # create metadata records\n",
        "    meta = [{\n",
        "        'video_id': x[0],\n",
        "        'title': x[1],\n",
        "        'text': x[2],\n",
        "        'start_second': x[3],\n",
        "        'end_second': x[4],\n",
        "        'url': x[5],\n",
        "        'thumbnail': x[6]\n",
        "    } for x in zip(\n",
        "        batch['video_id'],\n",
        "        batch['title'],\n",
        "        batch['text'],\n",
        "        batch['start_second'],\n",
        "        batch['end_second'],\n",
        "        batch['url'],\n",
        "        batch['thumbnail']\n",
        "    )]\n",
        "    # create list of (IDs, vectors, metadata) to upsert\n",
        "    to_upsert = list(zip(ids, embeds, meta))\n",
        "    # add to pinecone\n",
        "    index.upsert(vectors=to_upsert)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "82c936e3",
      "metadata": {
        "id": "82c936e3",
        "outputId": "ebf73926-9755-4746-b752-4851376686ad"
      },
      "outputs": [
        {
          "data": {
            "text/plain": [
              "{'dimension': 768,\n",
              " 'index_fullness': 0.01,\n",
              " 'namespaces': {'': {'vector_count': 11298}}}"
            ]
          },
          "execution_count": 13,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "index.describe_index_stats()"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "317f712a",
      "metadata": {
        "id": "317f712a"
      },
      "source": [
        "# Querying\n",
        "\n",
        "When query we encode our text with the same retriever model and pass it to the Pinecone `query` endpoint."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "15668c1b",
      "metadata": {
        "id": "15668c1b"
      },
      "outputs": [],
      "source": [
        "query = \"What is deep learning?\"\n",
        "\n",
        "xq = retriever.encode([query]).tolist()"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "c98c5714",
      "metadata": {
        "id": "c98c5714",
        "outputId": "df71be47-eea6-497c-fa51-9b0160a73f2d"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            " terms of optimization but what's the algorithm for updating the parameters or updating whatever the state of the network is and then the the last part is the the data set like how do you actually represent the world as it comes into your machine learning system so I think of deep learning as telling us something about what does the model look like and basically to qualify as deep I\n",
            "---\n",
            " any theoretical components any theoretical things that you need to understand about deep learning can be sick later for that link again just watched the word doc file again in that I mentioned the link also the second channel is my channel because deep learning might be complete deep learning playlist that I have created is completely in order okay to the other\n",
            "---\n",
            " under a rock for the last few years you have heard of the deep networks and how they have revolutionised computer vision and kind of the standard classic way of doing this is it's basically a classic supervised learning problem you are giving a network which you can think of as a big black box a pairs of input images and output labels XY pairs okay and this big black box essentially you\n",
            "---\n",
            " do the task at hand. Now deep learning is just a subset of machine learning which takes this idea even a step further and says how can we automatically extract the useful pieces of information needed to inform those future predictions or make a decision And that's what this class is all about teaching algorithms how to learn a task directly from raw data. We want to\n",
            "---\n",
            " algorithm and yelled at everybody in a good way that nobody was answering it correctly everybody knew what the alkyl it was graduate course everybody knew what an algorithm was but they weren't able to answer it well let me ask you in that same spirit what is deep learning I would say deep learning is any kind of machine learning that involves learning parameters of more than one consecutive\n",
            "---\n"
          ]
        }
      ],
      "source": [
        "xc = index.query(xq, top_k=5,\n",
        "                 include_metadata=True)\n",
        "for context in xc['matches']:\n",
        "    print(context['metadata']['text'], end=\"\\n---\\n\")"
      ]
    }
  ],
  "metadata": {
    "accelerator": "GPU",
    "colab": {
      "collapsed_sections": [],
      "name": "01_pinecone_yt_search.ipynb",
      "provenance": []
    },
    "interpreter": {
      "hash": "e81a84c338879f0412495ea98350e80595740634d3ce0fba8d30f35c60f1a4c3"
    },
    "kernelspec": {
      "display_name": "Python 3.8.12 ('stoic')",
      "language": "python",
      "name": "python3"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.8.12"
    },
    "widgets": {
      "application/vnd.jupyter.widget-state+json": {}
    }
  },
  "nbformat": 4,
  "nbformat_minor": 5
}