AlaFalaki commited on
Commit
8d12c87
Β·
1 Parent(s): 223673a

Created using Colaboratory

Browse files
Files changed (1) hide show
  1. notebooks/02-Basic_RAG.ipynb +727 -0
notebooks/02-Basic_RAG.ipynb ADDED
@@ -0,0 +1,727 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "nbformat": 4,
3
+ "nbformat_minor": 0,
4
+ "metadata": {
5
+ "colab": {
6
+ "provenance": [],
7
+ "authorship_tag": "ABX9TyND3B7W2jhm8NF7/+wlqCAN",
8
+ "include_colab_link": true
9
+ },
10
+ "kernelspec": {
11
+ "name": "python3",
12
+ "display_name": "Python 3"
13
+ },
14
+ "language_info": {
15
+ "name": "python"
16
+ }
17
+ },
18
+ "cells": [
19
+ {
20
+ "cell_type": "markdown",
21
+ "metadata": {
22
+ "id": "view-in-github",
23
+ "colab_type": "text"
24
+ },
25
+ "source": [
26
+ "<a href=\"https://colab.research.google.com/github/towardsai/ai-tutor-rag-system/blob/main/notebooks/02-Basic_RAG.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
27
+ ]
28
+ },
29
+ {
30
+ "cell_type": "code",
31
+ "execution_count": 1,
32
+ "metadata": {
33
+ "colab": {
34
+ "base_uri": "https://localhost:8080/"
35
+ },
36
+ "id": "HaB4G9zr0BYm",
37
+ "outputId": "03c7161a-e3e2-4bf6-e148-ef94ddca73a4"
38
+ },
39
+ "outputs": [
40
+ {
41
+ "output_type": "stream",
42
+ "name": "stdout",
43
+ "text": [
44
+ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m225.4/225.4 kB\u001b[0m \u001b[31m978.9 kB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
45
+ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m51.7/51.7 kB\u001b[0m \u001b[31m963.7 kB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
46
+ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m2.0/2.0 MB\u001b[0m \u001b[31m15.0 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
47
+ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m75.9/75.9 kB\u001b[0m \u001b[31m2.9 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
48
+ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m3.1/3.1 MB\u001b[0m \u001b[31m29.0 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
49
+ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m76.9/76.9 kB\u001b[0m \u001b[31m5.5 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
50
+ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m58.3/58.3 kB\u001b[0m \u001b[31m4.8 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
51
+ "\u001b[?25h\u001b[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.\n",
52
+ "tensorflow-probability 0.22.0 requires typing-extensions<4.6.0, but you have typing-extensions 4.9.0 which is incompatible.\u001b[0m\u001b[31m\n",
53
+ "\u001b[0m"
54
+ ]
55
+ }
56
+ ],
57
+ "source": [
58
+ "!pip install -q openai==1.6.0 cohere==4.39 tiktoken==0.5.2"
59
+ ]
60
+ },
61
+ {
62
+ "cell_type": "code",
63
+ "source": [
64
+ "import os\n",
65
+ "\n",
66
+ "os.environ[\"OPENAI_API_KEY\"] = \"<YOUR_OPENAI_KEY>\""
67
+ ],
68
+ "metadata": {
69
+ "id": "MYvUA6CF2Le6"
70
+ },
71
+ "execution_count": 2,
72
+ "outputs": []
73
+ },
74
+ {
75
+ "cell_type": "code",
76
+ "source": [
77
+ "embedding_file = \"./index_with_embedding.json\""
78
+ ],
79
+ "metadata": {
80
+ "id": "0ViVXXIqXBai"
81
+ },
82
+ "execution_count": 116,
83
+ "outputs": []
84
+ },
85
+ {
86
+ "cell_type": "markdown",
87
+ "source": [
88
+ "# Load Dataset"
89
+ ],
90
+ "metadata": {
91
+ "id": "D8Nzx-cN_bDz"
92
+ }
93
+ },
94
+ {
95
+ "cell_type": "markdown",
96
+ "source": [
97
+ "### Download Paul Graham Essay (JSON)"
98
+ ],
99
+ "metadata": {
100
+ "id": "5JpI7GiZ--Gw"
101
+ }
102
+ },
103
+ {
104
+ "cell_type": "code",
105
+ "source": [
106
+ "!wget https://raw.githubusercontent.com/run-llama/llama_index/main/examples/paul_graham_essay/index.json"
107
+ ],
108
+ "metadata": {
109
+ "colab": {
110
+ "base_uri": "https://localhost:8080/"
111
+ },
112
+ "id": "p6NEJT9S2OoH",
113
+ "outputId": "b7b4eb4c-77e2-439f-ed10-d987dbce77f0"
114
+ },
115
+ "execution_count": 6,
116
+ "outputs": [
117
+ {
118
+ "output_type": "stream",
119
+ "name": "stdout",
120
+ "text": [
121
+ "--2023-12-20 16:11:36-- https://raw.githubusercontent.com/run-llama/llama_index/main/examples/paul_graham_essay/index.json\n",
122
+ "Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...\n",
123
+ "Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.\n",
124
+ "HTTP request sent, awaiting response... 200 OK\n",
125
+ "Length: 176588 (172K) [text/plain]\n",
126
+ "Saving to: β€˜index.json’\n",
127
+ "\n",
128
+ "\rindex.json 0%[ ] 0 --.-KB/s \rindex.json 100%[===================>] 172.45K --.-KB/s in 0.02s \n",
129
+ "\n",
130
+ "2023-12-20 16:11:36 (7.10 MB/s) - β€˜index.json’ saved [176588/176588]\n",
131
+ "\n"
132
+ ]
133
+ }
134
+ ]
135
+ },
136
+ {
137
+ "cell_type": "markdown",
138
+ "source": [
139
+ "### Read JSON"
140
+ ],
141
+ "metadata": {
142
+ "id": "oYDd03Qn_clh"
143
+ }
144
+ },
145
+ {
146
+ "cell_type": "code",
147
+ "source": [
148
+ "import json\n",
149
+ "\n",
150
+ "with open('./index.json', 'r') as file:\n",
151
+ " data = json.load(file)"
152
+ ],
153
+ "metadata": {
154
+ "id": "XjW10Cnz_DVZ"
155
+ },
156
+ "execution_count": 69,
157
+ "outputs": []
158
+ },
159
+ {
160
+ "cell_type": "code",
161
+ "source": [
162
+ "data = data['index_struct']['all_nodes']"
163
+ ],
164
+ "metadata": {
165
+ "id": "A2U5PnB2CnUO"
166
+ },
167
+ "execution_count": 70,
168
+ "outputs": []
169
+ },
170
+ {
171
+ "cell_type": "code",
172
+ "source": [
173
+ "len( data )"
174
+ ],
175
+ "metadata": {
176
+ "colab": {
177
+ "base_uri": "https://localhost:8080/"
178
+ },
179
+ "id": "UcQ7Ge_XCuXa",
180
+ "outputId": "669be133-a912-4d74-83c2-3ee109a47ed7"
181
+ },
182
+ "execution_count": 71,
183
+ "outputs": [
184
+ {
185
+ "output_type": "execute_result",
186
+ "data": {
187
+ "text/plain": [
188
+ "63"
189
+ ]
190
+ },
191
+ "metadata": {},
192
+ "execution_count": 71
193
+ }
194
+ ]
195
+ },
196
+ {
197
+ "cell_type": "code",
198
+ "source": [
199
+ "import pandas as pd\n",
200
+ "\n",
201
+ "df = pd.DataFrame(data).transpose()"
202
+ ],
203
+ "metadata": {
204
+ "id": "YNk1hxGVDqTu"
205
+ },
206
+ "execution_count": 103,
207
+ "outputs": []
208
+ },
209
+ {
210
+ "cell_type": "code",
211
+ "source": [
212
+ "df.keys()"
213
+ ],
214
+ "metadata": {
215
+ "colab": {
216
+ "base_uri": "https://localhost:8080/"
217
+ },
218
+ "id": "JKdFSOb0NXjx",
219
+ "outputId": "0b16e381-fea1-4c04-f656-03488705f770"
220
+ },
221
+ "execution_count": 104,
222
+ "outputs": [
223
+ {
224
+ "output_type": "execute_result",
225
+ "data": {
226
+ "text/plain": [
227
+ "Index(['text', 'doc_id', 'index', 'child_indices', 'embedding', 'ref_doc_id'], dtype='object')"
228
+ ]
229
+ },
230
+ "metadata": {},
231
+ "execution_count": 104
232
+ }
233
+ ]
234
+ },
235
+ {
236
+ "cell_type": "markdown",
237
+ "source": [
238
+ "### Apply Embedding"
239
+ ],
240
+ "metadata": {
241
+ "id": "21pFDgNdW9rO"
242
+ }
243
+ },
244
+ {
245
+ "cell_type": "code",
246
+ "source": [
247
+ "from openai import OpenAI\n",
248
+ "\n",
249
+ "client = OpenAI()\n",
250
+ "\n",
251
+ "def get_embedding(text):\n",
252
+ " try:\n",
253
+ " # Remove newlines\n",
254
+ " text = text.replace(\"\\n\", \" \")\n",
255
+ " res = client.embeddings.create(input = [text], model=\"text-embedding-ada-002\")\n",
256
+ "\n",
257
+ " return res.data[0].embedding\n",
258
+ "\n",
259
+ " except:\n",
260
+ " return None"
261
+ ],
262
+ "metadata": {
263
+ "id": "AfS9w9eQAKyu"
264
+ },
265
+ "execution_count": 88,
266
+ "outputs": []
267
+ },
268
+ {
269
+ "cell_type": "code",
270
+ "source": [
271
+ "not bool( embedding_file )"
272
+ ],
273
+ "metadata": {
274
+ "colab": {
275
+ "base_uri": "https://localhost:8080/"
276
+ },
277
+ "id": "p010q7cqXWi3",
278
+ "outputId": "8fb409b1-e0ce-4412-f5a5-46cb475f5ce1"
279
+ },
280
+ "execution_count": 118,
281
+ "outputs": [
282
+ {
283
+ "output_type": "execute_result",
284
+ "data": {
285
+ "text/plain": [
286
+ "False"
287
+ ]
288
+ },
289
+ "metadata": {},
290
+ "execution_count": 118
291
+ }
292
+ ]
293
+ },
294
+ {
295
+ "cell_type": "code",
296
+ "source": [
297
+ "from tqdm.notebook import tqdm\n",
298
+ "\n",
299
+ "if not embedding_file:\n",
300
+ " print(\"Generating embeddings...\")\n",
301
+ " for index, row in tqdm( df.iterrows() ):\n",
302
+ " row['embedding'] = get_embedding( row['text'] )\n",
303
+ "\n",
304
+ " df.to_json(\"./index_with_embedding.json\")\n",
305
+ "\n",
306
+ "else:\n",
307
+ " print(\"Loaded the embedding file.\")\n",
308
+ " with open('./index_with_embedding.json', 'r') as file:\n",
309
+ " data = json.load(file)\n",
310
+ " df = pd.DataFrame(data)"
311
+ ],
312
+ "metadata": {
313
+ "colab": {
314
+ "base_uri": "https://localhost:8080/"
315
+ },
316
+ "id": "qC6aeFr3Rmi2",
317
+ "outputId": "7f531fd3-ae10-4ff1-f42e-1a52e90b71b7"
318
+ },
319
+ "execution_count": 121,
320
+ "outputs": [
321
+ {
322
+ "output_type": "stream",
323
+ "name": "stdout",
324
+ "text": [
325
+ "Loaded the embedding file.\n"
326
+ ]
327
+ }
328
+ ]
329
+ },
330
+ {
331
+ "cell_type": "markdown",
332
+ "source": [
333
+ "# User Question"
334
+ ],
335
+ "metadata": {
336
+ "id": "E_qrXwImXrXJ"
337
+ }
338
+ },
339
+ {
340
+ "cell_type": "code",
341
+ "source": [
342
+ "QUESTION = \"How much budget Paul Graham had to spend each day when he was in Florence?\""
343
+ ],
344
+ "metadata": {
345
+ "id": "mHt4bZjYRaiY"
346
+ },
347
+ "execution_count": 293,
348
+ "outputs": []
349
+ },
350
+ {
351
+ "cell_type": "code",
352
+ "source": [
353
+ "QUESTION_emb = get_embedding( QUESTION )"
354
+ ],
355
+ "metadata": {
356
+ "id": "OMQG6DtaVKGl"
357
+ },
358
+ "execution_count": 294,
359
+ "outputs": []
360
+ },
361
+ {
362
+ "cell_type": "code",
363
+ "source": [
364
+ "len( QUESTION_emb )"
365
+ ],
366
+ "metadata": {
367
+ "colab": {
368
+ "base_uri": "https://localhost:8080/"
369
+ },
370
+ "id": "xGTa7cqCX97q",
371
+ "outputId": "d38fbd8e-61f7-4603-d2b2-d1b5ff706cbd"
372
+ },
373
+ "execution_count": 295,
374
+ "outputs": [
375
+ {
376
+ "output_type": "execute_result",
377
+ "data": {
378
+ "text/plain": [
379
+ "1536"
380
+ ]
381
+ },
382
+ "metadata": {},
383
+ "execution_count": 295
384
+ }
385
+ ]
386
+ },
387
+ {
388
+ "cell_type": "markdown",
389
+ "source": [
390
+ "# Calculate Cosine Similarities"
391
+ ],
392
+ "metadata": {
393
+ "id": "BXNzNWrJYWhU"
394
+ }
395
+ },
396
+ {
397
+ "cell_type": "code",
398
+ "source": [
399
+ "BAD_SOURCE_emb = get_embedding( \"The sky is blue.\" )\n",
400
+ "GOOD_SOURCE_emb = get_embedding( \"Paul Graham had budget of $1 per day in Florence.\" )"
401
+ ],
402
+ "metadata": {
403
+ "id": "LqDWcPd4b-ZI"
404
+ },
405
+ "execution_count": 296,
406
+ "outputs": []
407
+ },
408
+ {
409
+ "cell_type": "code",
410
+ "source": [
411
+ "from sklearn.metrics.pairwise import cosine_similarity\n",
412
+ "\n",
413
+ "# A sample that how a good piece of text can achieve high similarity score compared\n",
414
+ "# to a completely unrelated text.\n",
415
+ "print(\"> Bad Response Score:\", cosine_similarity([QUESTION_emb], [BAD_SOURCE_emb]) )\n",
416
+ "print(\"> Good Response Score:\", cosine_similarity([QUESTION_emb], [GOOD_SOURCE_emb]) )"
417
+ ],
418
+ "metadata": {
419
+ "colab": {
420
+ "base_uri": "https://localhost:8080/"
421
+ },
422
+ "id": "OI00eN86YZKB",
423
+ "outputId": "cb4c84a8-97ea-4def-d28a-2fea16514bd7"
424
+ },
425
+ "execution_count": 297,
426
+ "outputs": [
427
+ {
428
+ "output_type": "stream",
429
+ "name": "stdout",
430
+ "text": [
431
+ "> Bad Response Score: [[0.71542785]]\n",
432
+ "> Good Response Score: [[0.94895731]]\n"
433
+ ]
434
+ }
435
+ ]
436
+ },
437
+ {
438
+ "cell_type": "code",
439
+ "source": [
440
+ "cosine_similarities = cosine_similarity( [QUESTION_emb], df['embedding'].tolist() )"
441
+ ],
442
+ "metadata": {
443
+ "id": "iYzyEU6-bwSz"
444
+ },
445
+ "execution_count": 298,
446
+ "outputs": []
447
+ },
448
+ {
449
+ "cell_type": "code",
450
+ "source": [
451
+ "cosine_similarities"
452
+ ],
453
+ "metadata": {
454
+ "colab": {
455
+ "base_uri": "https://localhost:8080/"
456
+ },
457
+ "id": "PNPN7OAXemmH",
458
+ "outputId": "6f1bccd1-9b49-41c6-b660-1b679cceae53"
459
+ },
460
+ "execution_count": 299,
461
+ "outputs": [
462
+ {
463
+ "output_type": "execute_result",
464
+ "data": {
465
+ "text/plain": [
466
+ "array([[0.74283738, 0.72700095, 0.75178514, 0.72451674, 0.71790563,\n",
467
+ " 0.71921944, 0.75048449, 0.74530984, 0.75329709, 0.76653264,\n",
468
+ " 0.76974562, 0.72420627, 0.73249944, 0.76877011, 0.79248341,\n",
469
+ " 0.73697747, 0.71247212, 0.78187566, 0.76031472, 0.73767647,\n",
470
+ " 0.72234783, 0.74282612, 0.75314902, 0.74521852, 0.75741529,\n",
471
+ " 0.7501051 , 0.75077242, 0.76393937, 0.76099337, 0.76104132,\n",
472
+ " 0.74024353, 0.71837665, 0.76223564, 0.71520494, 0.72417841,\n",
473
+ " 0.76740974, 0.74417228, 0.7428031 , 0.74380591, 0.76662179,\n",
474
+ " 0.76371986, 0.74027053, 0.72705364, 0.76053316, 0.72927648,\n",
475
+ " 0.72150454, 0.7451773 , 0.69488923, 0.73433034, 0.77833212,\n",
476
+ " 0.74265443, 0.76117176, 0.7494365 , 0.71624952, 0.72440837,\n",
477
+ " 0.73051577, 0.72781133, 0.74202563, 0.75499305, 0.7157058 ,\n",
478
+ " 0.72145234, 0.76054956, 0.71896691]])"
479
+ ]
480
+ },
481
+ "metadata": {},
482
+ "execution_count": 299
483
+ }
484
+ ]
485
+ },
486
+ {
487
+ "cell_type": "code",
488
+ "source": [
489
+ "import numpy as np\n",
490
+ "\n",
491
+ "# Sort the scores\n",
492
+ "highest_index = np.argmax( cosine_similarities )"
493
+ ],
494
+ "metadata": {
495
+ "id": "g0cBfePFaayw"
496
+ },
497
+ "execution_count": 300,
498
+ "outputs": []
499
+ },
500
+ {
501
+ "cell_type": "code",
502
+ "source": [
503
+ "number_of_chunks_to_retrieve = 3\n",
504
+ "\n",
505
+ "indices = np.argsort(cosine_similarities[0])[::-1][:number_of_chunks_to_retrieve]"
506
+ ],
507
+ "metadata": {
508
+ "id": "vBbHJ7uihfKO"
509
+ },
510
+ "execution_count": 301,
511
+ "outputs": []
512
+ },
513
+ {
514
+ "cell_type": "code",
515
+ "source": [
516
+ "indices"
517
+ ],
518
+ "metadata": {
519
+ "colab": {
520
+ "base_uri": "https://localhost:8080/"
521
+ },
522
+ "id": "1-XI1_7mhlw4",
523
+ "outputId": "ac5a55ad-bfc7-409e-cf2e-5cc1dce75544"
524
+ },
525
+ "execution_count": 302,
526
+ "outputs": [
527
+ {
528
+ "output_type": "execute_result",
529
+ "data": {
530
+ "text/plain": [
531
+ "array([14, 17, 49])"
532
+ ]
533
+ },
534
+ "metadata": {},
535
+ "execution_count": 302
536
+ }
537
+ ]
538
+ },
539
+ {
540
+ "cell_type": "code",
541
+ "source": [
542
+ "# Get the highest scored pieces of text\n",
543
+ "for idx, item in enumerate( df.text[indices] ):\n",
544
+ " print(f\"> Chunk {idx+1}\")\n",
545
+ " print(item)\n",
546
+ " print(\"----\")"
547
+ ],
548
+ "metadata": {
549
+ "colab": {
550
+ "base_uri": "https://localhost:8080/"
551
+ },
552
+ "id": "JPmhCb9kfB0w",
553
+ "outputId": "aafb01ef-65e1-4630-f849-96a2eaf1ee06"
554
+ },
555
+ "execution_count": 303,
556
+ "outputs": [
557
+ {
558
+ "output_type": "stream",
559
+ "name": "stdout",
560
+ "text": [
561
+ "> Chunk 1\n",
562
+ "of thinking, but at the time it caused a lot of friction. Toward the end of the year I spent much of my time surreptitiously working on On Lisp, which I had by this time gotten a contract to publish.\n",
563
+ "\n",
564
+ "The good part was that I got paid huge amounts of money, especially by art student standards. In Florence, after paying my part of the rent, my budget for everything else had been $7 a day. Now I was getting paid more than 4 times that every hour, even when I was just sitting in a meeting. By living cheaply I not only managed to save enough to go back to RISD, but also paid off my college loans.\n",
565
+ "\n",
566
+ "I learned some useful things at Interleaf, though they were mostly about what not to do. I learned that it's better for technology companies to be run by product people than sales people (though sales is a real skill and people who are good at it are really good at it), that it leads to bugs when code is edited by too many people, that cheap office space is no bargain if it's depressing, that planned meetings are inferior to corridor conversations, that big, bureaucratic customers are a dangerous source of money, and that there's not much overlap between conventional office hours and the optimal time for hacking, or conventional offices and the optimal place for it.\n",
567
+ "\n",
568
+ "But the most important thing I learned, and which I used in both Viaweb and Y Combinator, is that the low end eats the high end: that it's good to\n",
569
+ "----\n",
570
+ "> Chunk 2\n",
571
+ "I took at RISD, but otherwise I was basically teaching myself to paint, and I could do that for free. So in 1993 I dropped out. I hung around Providence for a bit, and then my college friend Nancy Parmet did me a big favor. A rent-controlled apartment in a building her mother owned in New York was becoming vacant. Did I want it? It wasn't much more than my current place, and New York was supposed to be where the artists were. So yes, I wanted it! [7]\n",
572
+ "\n",
573
+ "Asterix comics begin by zooming in on a tiny corner of Roman Gaul that turns out not to be controlled by the Romans. You can do something similar on a map of New York City: if you zoom in on the Upper East Side, there's a tiny corner that's not rich, or at least wasn't in 1993. It's called Yorkville, and that was my new home. Now I was a New York artist β€” in the strictly technical sense of making paintings and living in New York.\n",
574
+ "\n",
575
+ "I was nervous about money, because I could sense that Interleaf was on the way down. Freelance Lisp hacking work was very rare, and I didn't want to have to program in another language, which in those days would have meant C++ if I was lucky. So with my unerring nose for financial opportunity, I decided to write another book on Lisp. This would be a popular book, the sort of book that could be used as a textbook. I imagined\n",
576
+ "----\n",
577
+ "> Chunk 3\n",
578
+ "from writing essays during most of this time, or I'd never have finished. In late 2015 I spent 3 months writing essays, and when I went back to working on Bel I could barely understand the code. Not so much because it was badly written as because the problem is so convoluted. When you're working on an interpreter written in itself, it's hard to keep track of what's happening at what level, and errors can be practically encrypted by the time you get them.\n",
579
+ "\n",
580
+ "So I said no more essays till Bel was done. But I told few people about Bel while I was working on it. So for years it must have seemed that I was doing nothing, when in fact I was working harder than I'd ever worked on anything. Occasionally after wrestling for hours with some gruesome bug I'd check Twitter or HN and see someone asking \"Does Paul Graham still code?\"\n",
581
+ "\n",
582
+ "Working on Bel was hard but satisfying. I worked on it so intensively that at any given time I had a decent chunk of the code in my head and could write more there. I remember taking the boys to the coast on a sunny day in 2015 and figuring out how to deal with some problem involving continuations while I watched them play in the tide pools. It felt like I was doing life right. I remember that because I was slightly dismayed at how novel it felt. The good news is that I had more moments like this over the next few years.\n",
583
+ "\n",
584
+ "In the summer of 2016 we moved to England. We wanted our kids to see what it was like living in another\n",
585
+ "----\n"
586
+ ]
587
+ }
588
+ ]
589
+ },
590
+ {
591
+ "cell_type": "markdown",
592
+ "source": [
593
+ "# Augment the Prompt"
594
+ ],
595
+ "metadata": {
596
+ "id": "7uvQACqAkHg4"
597
+ }
598
+ },
599
+ {
600
+ "cell_type": "code",
601
+ "source": [
602
+ "try:\n",
603
+ " # Formulating the system prompt\n",
604
+ " system_prompt = (\n",
605
+ " \"You are an assistant and expert in answering questions from a chunks of content. \"\n",
606
+ " \"Only answer AI-related question, else say that you cannot answer this question.\"\n",
607
+ " )\n",
608
+ "\n",
609
+ " # Combining the system prompt with the user's question\n",
610
+ " prompt = (\n",
611
+ " \"Read the following informations that might contain the context you require to answer the question. You can use the informations starting from the <START_OF_CONTEXT> tag and end with the <END_OF_CONTEXT> tag. Here is the content:\\n\\n<START_OF_CONTEXT>\\n{}\\n<END_OF_CONTEXT>\\n\\n\"\n",
612
+ " \"Please provide an informative and accurate answer to the following question based on the avaiable context. Be concise and take your time. \\nQuestion: {}\\nAnswer:\"\n",
613
+ " )\n",
614
+ " prompt = prompt.format( \"\".join( df.text[indices] ), QUESTION )\n",
615
+ "\n",
616
+ " # Call the OpenAI API\n",
617
+ " response = client.chat.completions.create(\n",
618
+ " model='gpt-3.5-turbo-16k',\n",
619
+ " temperature=0.0,\n",
620
+ " messages=[\n",
621
+ " {\"role\": \"system\", \"content\": system_prompt},\n",
622
+ " {\"role\": \"user\", \"content\": prompt}\n",
623
+ " ]\n",
624
+ " )\n",
625
+ "\n",
626
+ " # Return the AI's response\n",
627
+ " res = response.choices[0].message.content.strip()\n",
628
+ "\n",
629
+ "except Exception as e:\n",
630
+ " print( f\"An error occurred: {e}\" )"
631
+ ],
632
+ "metadata": {
633
+ "id": "MXRdzta5kJ3V"
634
+ },
635
+ "execution_count": 305,
636
+ "outputs": []
637
+ },
638
+ {
639
+ "cell_type": "code",
640
+ "source": [
641
+ "print( res )"
642
+ ],
643
+ "metadata": {
644
+ "colab": {
645
+ "base_uri": "https://localhost:8080/"
646
+ },
647
+ "id": "9tBvJ8oMucha",
648
+ "outputId": "2f4e9f03-a694-41d3-b3fe-7ac14a63b440"
649
+ },
650
+ "execution_count": 306,
651
+ "outputs": [
652
+ {
653
+ "output_type": "stream",
654
+ "name": "stdout",
655
+ "text": [
656
+ "Paul Graham had a budget of $7 a day when he was in Florence.\n"
657
+ ]
658
+ }
659
+ ]
660
+ },
661
+ {
662
+ "cell_type": "markdown",
663
+ "source": [
664
+ "# Without Augmentation"
665
+ ],
666
+ "metadata": {
667
+ "id": "pW-BNCAC2JzE"
668
+ }
669
+ },
670
+ {
671
+ "cell_type": "code",
672
+ "source": [
673
+ "# Formulating the system prompt\n",
674
+ "system_prompt = (\n",
675
+ " \"You are an assistant and expert in answering questions.\"\n",
676
+ ")\n",
677
+ "\n",
678
+ "# Combining the system prompt with the user's question\n",
679
+ "prompt = (\n",
680
+ " \"Be concise and take your time to answer the following question. \\nQuestion: {}\\nAnswer:\"\n",
681
+ ")\n",
682
+ "prompt = prompt.format( QUESTION )\n",
683
+ "\n",
684
+ "# Call the OpenAI API\n",
685
+ "response = client.chat.completions.create(\n",
686
+ " model='gpt-3.5-turbo-16k',\n",
687
+ " temperature=.9,\n",
688
+ " messages=[\n",
689
+ " {\"role\": \"system\", \"content\": system_prompt},\n",
690
+ " {\"role\": \"user\", \"content\": prompt}\n",
691
+ " ]\n",
692
+ ")\n",
693
+ "\n",
694
+ "# Return the AI's response\n",
695
+ "res = response.choices[0].message.content.strip()"
696
+ ],
697
+ "metadata": {
698
+ "id": "RuyXjzZyuecE"
699
+ },
700
+ "execution_count": 307,
701
+ "outputs": []
702
+ },
703
+ {
704
+ "cell_type": "code",
705
+ "source": [
706
+ "print( res )"
707
+ ],
708
+ "metadata": {
709
+ "colab": {
710
+ "base_uri": "https://localhost:8080/"
711
+ },
712
+ "id": "YAy34tPTzGbh",
713
+ "outputId": "a6eb8efc-ce40-46fc-e307-b8e9772f9657"
714
+ },
715
+ "execution_count": 308,
716
+ "outputs": [
717
+ {
718
+ "output_type": "stream",
719
+ "name": "stdout",
720
+ "text": [
721
+ "There is no specific information available about Paul Graham's daily budget when he was in Florence.\n"
722
+ ]
723
+ }
724
+ ]
725
+ }
726
+ ]
727
+ }