[ { "start": 0.04, "text": "hi everyone so in this video I'd like us" }, { "start": 2.04, "text": "to cover the process of tokenization in" }, { "start": 4.08, "text": "large language models now you see here" }, { "start": 6.44, "text": "that I have a set face and that's" }, { "start": 8.28, "text": "because uh tokenization is my least" }, { "start": 10.32, "text": "favorite part of working with large" }, { "start": 11.679, "text": "language models but unfortunately it is" }, { "start": 13.48, "text": "necessary to understand in some detail" }, { "start": 15.519, "text": "because it it is fairly hairy gnarly and" }, { "start": 17.6, "text": "there's a lot of hidden foot guns to be" }, { "start": 19.48, "text": "aware of and a lot of oddness with large" }, { "start": 21.84, "text": "language models typically traces back to" }, { "start": 24.599, "text": "tokenization so what is" }, { "start": 26.64, "text": "tokenization now in my previous video" }, { "start": 28.92, "text": "Let's Build GPT from scratch uh we" }, { "start": 31.56, "text": "actually already did tokenization but we" }, { "start": 33.48, "text": "did a very naive simple version of" }, { "start": 35.8, "text": "tokenization so when you go to the" }, { "start": 37.48, "text": "Google colab for that video uh you see" }, { "start": 40.559, "text": "here that we loaded our training set and" }, { "start": 43.2, "text": "our training set was this uh Shakespeare" }, { "start": 45.52, "text": "uh data set now in the beginning the" }, { "start": 48.12, "text": "Shakespeare data set is just a large" }, { "start": 49.76, "text": "string in Python it's just text and so" }, { "start": 52.44, "text": "the question is how do we plug text into" }, { "start": 54.84, "text": "large language models and in this case" }, { "start": 58.079, "text": "here we created a vocabulary of 65" }, { "start": 61.44, "text": "possible characters that we saw occur in" }, { "start": 63.96, "text": "this string these were the possible" }, { "start": 65.799, "text": "characters and we saw that there are 65" }, { "start": 67.96, "text": "of them and then we created a a lookup" }, { "start": 70.64, "text": "table for converting from every possible" }, { "start": 73.4, "text": "character a little string piece into a" }, { "start": 76.32, "text": "token an" }, { "start": 77.759, "text": "integer so here for example we tokenized" }, { "start": 80.52, "text": "the string High there and we received" }, { "start": 83.28, "text": "this sequence of" }, { "start": 84.72, "text": "tokens and here we took the first 1,000" }, { "start": 87.6, "text": "characters of our data set and we" }, { "start": 89.92, "text": "encoded it into tokens and because it is" }, { "start": 92.56, "text": "this is character level we received" }, { "start": 94.64, "text": "1,000 tokens in a sequence so token 18" }, { "start": 98.96, "text": "47" }, { "start": 100.119, "text": "Etc now later we saw that the way we" }, { "start": 103.439, "text": "plug these tokens into the language" }, { "start": 105.64, "text": "model is by using an embedding" }, { "start": 108.479, "text": "table and so basically if we have 65" }, { "start": 111.479, "text": "possible tokens then this embedding" }, { "start": 113.479, "text": "table is going to have 65 rows and" }, { "start": 116.439, "text": "roughly speaking we're taking the" }, { "start": 118.159, "text": "integer associated with every single" }, { "start": 119.799, "text": "sing Le token we're using that as a" }, { "start": 121.52, "text": "lookup into this table and we're" }, { "start": 124.039, "text": "plucking out the corresponding row and" }, { "start": 126.479, "text": "this row is a uh is trainable parameters" }, { "start": 129.36, "text": "that we're going to train using back" }, { "start": 130.479, "text": "propagation and this is the vector that" }, { "start": 132.879, "text": "then feeds into the Transformer um and" }, { "start": 135.36, "text": "that's how the Transformer Ser of" }, { "start": 136.56, "text": "perceives every single" }, { "start": 138.12, "text": "token so here we had a very naive" }, { "start": 141.28, "text": "tokenization process that was a" }, { "start": 143.12, "text": "character level tokenizer but in" }, { "start": 145.239, "text": "practice in state-ofthe-art uh language" }, { "start": 147.28, "text": "models people use a lot more complicated" }, { "start": 148.959, "text": "schemes unfortunately" }, { "start": 150.44, "text": "uh for constructing these uh token" }, { "start": 154.36, "text": "vocabularies so we're not dealing on the" }, { "start": 156.64, "text": "Character level we're dealing on chunk" }, { "start": 158.64, "text": "level and the way these um character" }, { "start": 161.519, "text": "chunks are constructed is using" }, { "start": 163.879, "text": "algorithms such as for example the bik" }, { "start": 165.48, "text": "pair in coding algorithm which we're" }, { "start": 166.959, "text": "going to go into in detail um and cover" }, { "start": 171.0, "text": "in this video I'd like to briefly show" }, { "start": 172.879, "text": "you the paper that introduced a bite" }, { "start": 174.84, "text": "level encoding as a mechanism for" }, { "start": 176.92, "text": "tokenization in the context of large" }, { "start": 178.44, "text": "language models and I would say that" }, { "start": 180.599, "text": "that's probably the gpt2 paper and if" }, { "start": 182.72, "text": "you scroll down here to the section" }, { "start": 185.56, "text": "input representation this is where they" }, { "start": 187.72, "text": "cover tokenization the kinds of" }, { "start": 189.48, "text": "properties that you'd like the" }, { "start": 190.56, "text": "tokenization to have and they conclude" }, { "start": 193.0, "text": "here that they're going to have a" }, { "start": 194.959, "text": "tokenizer where you have a vocabulary of" }, { "start": 197.599, "text": "50,2 57 possible" }, { "start": 200.68, "text": "tokens and the context size is going to" }, { "start": 204.4, "text": "be 1,24 tokens so in the in in the" }, { "start": 207.36, "text": "attention layer of the Transformer" }, { "start": 209.239, "text": "neural network" }, { "start": 210.48, "text": "every single token is attending to the" }, { "start": 212.319, "text": "previous tokens in the sequence and it's" }, { "start": 214.08, "text": "going to see up to 1,24 tokens so tokens" }, { "start": 217.92, "text": "are this like fundamental unit um the" }, { "start": 220.68, "text": "atom of uh large language models if you" }, { "start": 223.12, "text": "will and everything is in units of" }, { "start": 224.799, "text": "tokens everything is about tokens and" }, { "start": 227.08, "text": "tokenization is the process for" }, { "start": 228.36, "text": "translating strings or text into" }, { "start": 231.08, "text": "sequences of tokens and uh vice versa" }, { "start": 234.879, "text": "when you go into the Llama 2 paper as" }, { "start": 236.879, "text": "well I can show you that when you search" }, { "start": 238.28, "text": "token you're going to get get 63 hits um" }, { "start": 241.72, "text": "and that's because tokens are again" }, { "start": 243.319, "text": "pervasive so here they mentioned that" }, { "start": 245.12, "text": "they trained on two trillion tokens of" }, { "start": 246.879, "text": "data and so" }, { "start": 248.439, "text": "on so we're going to build our own" }, { "start": 251.079, "text": "tokenizer luckily the bite be encoding" }, { "start": 253.04, "text": "algorithm is not uh that super" }, { "start": 255.12, "text": "complicated and we can build it from" }, { "start": 256.959, "text": "scratch ourselves and we'll see exactly" }, { "start": 258.519, "text": "how this works before we dive into code" }, { "start": 260.72, "text": "I'd like to give you a brief Taste of" }, { "start": 262.56, "text": "some of the complexities that come from" }, { "start": 264.12, "text": "the tokenization because I just want to" }, { "start": 266.12, "text": "make sure that we motivate it" }, { "start": 267.199, "text": "sufficiently for why we are doing all" }, { "start": 269.479, "text": "this and why this is so gross so" }, { "start": 272.639, "text": "tokenization is at the heart of a lot of" }, { "start": 274.199, "text": "weirdness in large language models and I" }, { "start": 276.12, "text": "would advise that you do not brush it" }, { "start": 277.759, "text": "off a lot of the issues that may look" }, { "start": 280.6, "text": "like just issues with the new network" }, { "start": 282.32, "text": "architecture or the large language model" }, { "start": 284.52, "text": "itself are actually issues with the" }, { "start": 286.6, "text": "tokenization and fundamentally Trace uh" }, { "start": 289.16, "text": "back to it so if you've noticed any" }, { "start": 291.759, "text": "issues with large language models can't" }, { "start": 294.24, "text": "you know not able to do spelling tasks" }, { "start": 296.16, "text": "very easily that's usually due to" }, { "start": 297.96, "text": "tokenization simple string processing" }, { "start": 300.16, "text": "can be difficult for the large language" }, { "start": 302.28, "text": "model to perform" }, { "start": 303.6, "text": "natively uh non-english languages can" }, { "start": 306.08, "text": "work much worse and to a large extent" }, { "start": 308.24, "text": "this is due to" }, { "start": 309.44, "text": "tokenization sometimes llms are bad at" }, { "start": 311.759, "text": "simple arithmetic also can trace be" }, { "start": 314.08, "text": "traced to" }, { "start": 315.479, "text": "tokenization uh gbt2 specifically would" }, { "start": 317.759, "text": "have had quite a bit more issues with" }, { "start": 319.639, "text": "python than uh future versions of it due" }, { "start": 322.16, "text": "to tokenization there's a lot of other" }, { "start": 324.4, "text": "issues maybe you've seen weird warnings" }, { "start": 325.88, "text": "about a trailing whites space this is a" }, { "start": 327.44, "text": "tokenization issue um" }, { "start": 330.68, "text": "if you had asked GPT earlier about solid" }, { "start": 333.52, "text": "gold Magikarp and what it is you would" }, { "start": 335.24, "text": "see the llm go totally crazy and it" }, { "start": 337.52, "text": "would start going off about a completely" }, { "start": 339.56, "text": "unrelated tangent topic maybe you've" }, { "start": 341.919, "text": "been told to use yl over Json in" }, { "start": 343.72, "text": "structure data all of that has to do" }, { "start": 345.44, "text": "with tokenization so basically" }, { "start": 347.639, "text": "tokenization is at the heart of many" }, { "start": 349.4, "text": "issues I will look back around to these" }, { "start": 351.88, "text": "at the end of the video but for now let" }, { "start": 354.08, "text": "me just um skip over it a little bit and" }, { "start": 356.919, "text": "let's go to this web app um the Tik" }, { "start": 359.96, "text": "tokenizer bell.app so I have it loaded" }, { "start": 362.919, "text": "here and what I like about this web app" }, { "start": 364.68, "text": "is that tokenization is running a sort" }, { "start": 366.56, "text": "of live in your browser in JavaScript so" }, { "start": 369.52, "text": "you can just type here stuff hello world" }, { "start": 371.96, "text": "and the whole string" }, { "start": 374.199, "text": "rokenes so here what we see on uh the" }, { "start": 378.479, "text": "left is a string that you put in on the" }, { "start": 380.36, "text": "right we're currently using the gpt2" }, { "start": 382.199, "text": "tokenizer we see that this string that I" }, { "start": 384.56, "text": "pasted here is currently tokenizing into" }, { "start": 387.08, "text": "300 tokens and here they are sort of uh" }, { "start": 390.52, "text": "shown explicitly in different colors for" }, { "start": 392.68, "text": "every single token so for example uh" }, { "start": 395.52, "text": "this word tokenization became two tokens" }, { "start": 398.88, "text": "the token" }, { "start": 400.72, "text": "3,642 and" }, { "start": 404.0, "text": "1,634 the token um space is is token 318" }, { "start": 410.16, "text": "so be careful on the bottom you can show" }, { "start": 411.919, "text": "white space and keep in mind that there" }, { "start": 414.599, "text": "are spaces and uh sln new line" }, { "start": 417.36, "text": "characters in here but you can hide them" }, { "start": 419.72, "text": "for" }, { "start": 421.599, "text": "clarity the token space at is token 379" }, { "start": 426.0, "text": "the to the Token space the is 262 Etc so" }, { "start": 431.08, "text": "you notice here that the space is part" }, { "start": 432.96, "text": "of that uh token" }, { "start": 435.96, "text": "chunk now so this is kind of like how" }, { "start": 438.639, "text": "our English sentence broke up and that" }, { "start": 441.16, "text": "seems all well and good now now here I" }, { "start": 444.039, "text": "put in some arithmetic so we see that uh" }, { "start": 446.919, "text": "the token 127 Plus and then token six" }, { "start": 451.8, "text": "space 6 followed by 77 so what's" }, { "start": 454.24, "text": "happening here is that 127 is feeding in" }, { "start": 456.639, "text": "as a single token into the large" }, { "start": 458.16, "text": "language model but the um number 677" }, { "start": 462.68, "text": "will actually feed in as two separate" }, { "start": 464.84, "text": "tokens and so the large language model" }, { "start": 467.0, "text": "has to sort of um take account of that" }, { "start": 470.72, "text": "and process it correctly in its Network" }, { "start": 473.879, "text": "and see here 804 will be broken up into" }, { "start": 476.199, "text": "two tokens and it's is all completely" }, { "start": 477.96, "text": "arbitrary and here I have another" }, { "start": 479.8, "text": "example of four-digit numbers and they" }, { "start": 482.039, "text": "break up in a way that they break up and" }, { "start": 483.919, "text": "it's totally arbitrary sometimes you" }, { "start": 485.28, "text": "have um multiple digits single token" }, { "start": 488.36, "text": "sometimes you have individual digits as" }, { "start": 490.36, "text": "many tokens and it's all kind of pretty" }, { "start": 492.24, "text": "arbitrary and coming out of the" }, { "start": 494.68, "text": "tokenizer here's another example we have" }, { "start": 497.479, "text": "the string egg and you see here that" }, { "start": 501.039, "text": "this became two" }, { "start": 502.36, "text": "tokens but for some reason when I say I" }, { "start": 504.759, "text": "have an egg you see when it's a space" }, { "start": 507.72, "text": "egg it's two token it's sorry it's a" }, { "start": 510.84, "text": "single token so just egg by itself in" }, { "start": 513.24, "text": "the beginning of a sentence is two" }, { "start": 514.76, "text": "tokens but here as a space egg is" }, { "start": 517.68, "text": "suddenly a single token uh for the exact" }, { "start": 520.519, "text": "same string okay here lowercase egg" }, { "start": 524.2, "text": "turns out to be a single token and in" }, { "start": 526.24, "text": "particular notice that the color is" }, { "start": 527.48, "text": "different so this is a different token" }, { "start": 529.36, "text": "so this is case sensitive and of course" }, { "start": 531.76, "text": "a capital egg would also be different" }, { "start": 534.56, "text": "tokens and again um this would be two" }, { "start": 537.44, "text": "tokens arbitrarily so so for the same" }, { "start": 540.079, "text": "concept egg depending on if it's in the" }, { "start": 542.32, "text": "beginning of a sentence at the end of a" }, { "start": 543.8, "text": "sentence lowercase uppercase or mixed" }, { "start": 546.24, "text": "all this will be uh basically very" }, { "start": 548.079, "text": "different tokens and different IDs and" }, { "start": 550.32, "text": "the language model has to learn from raw" }, { "start": 552.04, "text": "data from all the internet text that" }, { "start": 553.56, "text": "it's going to be training on that these" }, { "start": 555.16, "text": "are actually all the exact same concept" }, { "start": 557.44, "text": "and it has to sort of group them in the" }, { "start": 559.279, "text": "parameters of the neural network and" }, { "start": 561.32, "text": "understand just based on the data" }, { "start": 562.48, "text": "patterns that these are all very similar" }, { "start": 564.76, "text": "but maybe not almost exactly similar but" }, { "start": 567.399, "text": "but very very similar" }, { "start": 570.16, "text": "um after the EG demonstration here I" }, { "start": 572.8, "text": "have um an introduction from open a eyes" }, { "start": 575.64, "text": "chbt in Korean so manaso Pang uh Etc uh" }, { "start": 581.959, "text": "so this is in Korean and the reason I" }, { "start": 584.079, "text": "put this here is because you'll notice" }, { "start": 587.76, "text": "that um non-english languages work" }, { "start": 591.0, "text": "slightly worse in Chachi part of this is" }, { "start": 594.32, "text": "because of course the training data set" }, { "start": 595.64, "text": "for Chachi is much larger for English" }, { "start": 598.079, "text": "and for everything else but the same is" }, { "start": 599.959, "text": "true not just for the large language" }, { "start": 601.68, "text": "model itself but also for the tokenizer" }, { "start": 604.32, "text": "so when we train the tokenizer we're" }, { "start": 605.88, "text": "going to see that there's a training set" }, { "start": 607.24, "text": "as well and there's a lot more English" }, { "start": 609.24, "text": "than non-english and what ends up" }, { "start": 611.32, "text": "happening is that we're going to have a" }, { "start": 613.48, "text": "lot more longer tokens for" }, { "start": 616.6, "text": "English so how do I put this if you have" }, { "start": 619.6, "text": "a single sentence in English and you" }, { "start": 621.399, "text": "tokenize it you might see that it's 10" }, { "start": 623.56, "text": "tokens or something like that but if you" }, { "start": 625.48, "text": "translate that sentence into say Korean" }, { "start": 627.36, "text": "or Japanese or something else you'll" }, { "start": 629.44, "text": "typically see that the number of tokens" }, { "start": 630.839, "text": "used is much larger and that's because" }, { "start": 633.399, "text": "the chunks here are a lot more broken up" }, { "start": 636.76, "text": "so we're using a lot more tokens for the" }, { "start": 638.519, "text": "exact same thing and what this does is" }, { "start": 641.36, "text": "it bloats up the sequence length of all" }, { "start": 643.76, "text": "the documents so you're using up more" }, { "start": 646.24, "text": "tokens and then in the attention of the" }, { "start": 648.399, "text": "Transformer when these tokens try to" }, { "start": 649.92, "text": "attend each other you are running out of" }, { "start": 651.92, "text": "context um in the maximum context length" }, { "start": 655.12, "text": "of that Transformer and so basically all" }, { "start": 657.959, "text": "the non-english text is stretched out" }, { "start": 661.279, "text": "from the perspective of the Transformer" }, { "start": 663.44, "text": "and this just has to do with the um" }, { "start": 665.68, "text": "trainings that used for the tokenizer" }, { "start": 667.48, "text": "and the tokenization itself so it will" }, { "start": 670.04, "text": "create a lot bigger tokens and a lot" }, { "start": 672.079, "text": "larger groups in English and it will" }, { "start": 674.2, "text": "have a lot of little boundaries for all" }, { "start": 676.16, "text": "the other non-english text um so if we" }, { "start": 679.76, "text": "translated this into English it would be" }, { "start": 681.92, "text": "significantly fewer" }, { "start": 683.32, "text": "tokens the final example I have here is" }, { "start": 685.639, "text": "a little snippet of python for doing FS" }, { "start": 688.079, "text": "buuz and what I'd like you to notice is" }, { "start": 691.0, "text": "look all these individual spaces are all" }, { "start": 694.04, "text": "separate tokens they are token" }, { "start": 697.0, "text": "220 so uh 220 220 220 220 and then space" }, { "start": 702.76, "text": "if is a single token and so what's going" }, { "start": 705.32, "text": "on here is that when the Transformer is" }, { "start": 706.72, "text": "going to consume or try to uh create" }, { "start": 709.32, "text": "this text it needs to um handle all" }, { "start": 712.639, "text": "these spaces individually they all feed" }, { "start": 714.48, "text": "in one by one into the entire" }, { "start": 716.56, "text": "Transformer in the sequence and so this" }, { "start": 719.12, "text": "is being extremely wasteful tokenizing" }, { "start": 721.279, "text": "it in this way and so as a result of" }, { "start": 724.44, "text": "that gpt2 is not very good with python" }, { "start": 727.04, "text": "and it's not anything to do with coding" }, { "start": 728.68, "text": "or the language model itself it's just" }, { "start": 730.68, "text": "that if he use a lot of indentation" }, { "start": 732.079, "text": "using space in Python like we usually do" }, { "start": 735.399, "text": "uh you just end up bloating out all the" }, { "start": 737.399, "text": "text and it's separated across way too" }, { "start": 739.36, "text": "much of the sequence and we are running" }, { "start": 741.04, "text": "out of the context length in the" }, { "start": 742.76, "text": "sequence uh that's roughly speaking" }, { "start": 744.44, "text": "what's what's happening we're being way" }, { "start": 745.639, "text": "too wasteful we're taking up way too" }, { "start": 747.399, "text": "much token space now we can also scroll" }, { "start": 749.68, "text": "up here and we can change the tokenizer" }, { "start": 751.6, "text": "so note here that gpt2 tokenizer creates" }, { "start": 754.04, "text": "a token count of 300 for this string" }, { "start": 756.72, "text": "here we can change it to CL 100K base" }, { "start": 759.519, "text": "which is the GPT for tokenizer and we" }, { "start": 761.839, "text": "see that the token count drops to 185 so" }, { "start": 764.56, "text": "for the exact same string we are now" }, { "start": 766.8, "text": "roughly having the number of tokens and" }, { "start": 769.8, "text": "roughly speaking this is because uh the" }, { "start": 771.76, "text": "number of tokens in the GPT 4 tokenizer" }, { "start": 774.36, "text": "is roughly double that of the number of" }, { "start": 776.72, "text": "tokens in the gpt2 tokenizer so we went" }, { "start": 778.839, "text": "went from roughly 50k to roughly 100K" }, { "start": 781.639, "text": "now you can imagine that this is a good" }, { "start": 783.0, "text": "thing because the same text is now" }, { "start": 786.0, "text": "squished into half as many tokens so uh" }, { "start": 790.199, "text": "this is a lot denser input to the" }, { "start": 792.76, "text": "Transformer and in the Transformer every" }, { "start": 795.44, "text": "single token has a finite number of" }, { "start": 797.04, "text": "tokens before it that it's going to pay" }, { "start": 798.399, "text": "attention to and so what this is doing" }, { "start": 800.44, "text": "is we're roughly able to see twice as" }, { "start": 803.48, "text": "much text as a context for what token to" }, { "start": 806.519, "text": "predict next uh because of this change" }, { "start": 809.279, "text": "but of course just increasing the number" }, { "start": 810.8, "text": "of tokens is uh not strictly better" }, { "start": 813.399, "text": "infinitely uh because as you increase" }, { "start": 815.16, "text": "the number of tokens now your embedding" }, { "start": 816.92, "text": "table is um sort of getting a lot larger" }, { "start": 819.88, "text": "and also at the output we are trying to" }, { "start": 821.48, "text": "predict the next token and there's the" }, { "start": 822.88, "text": "soft Max there and that grows as well" }, { "start": 825.12, "text": "we're going to go into more detail later" }, { "start": 826.399, "text": "on this but there's some kind of a Sweet" }, { "start": 828.44, "text": "Spot somewhere where you have a just" }, { "start": 831.0, "text": "right number of tokens in your" }, { "start": 832.279, "text": "vocabulary where everything is" }, { "start": 833.88, "text": "appropriately dense and still fairly" }, { "start": 836.519, "text": "efficient now one thing I would like you" }, { "start": 838.36, "text": "to note specifically for the gp4" }, { "start": 840.16, "text": "tokenizer is that the handling of the" }, { "start": 843.56, "text": "white space for python has improved a" }, { "start": 845.44, "text": "lot you see that here these four spaces" }, { "start": 848.36, "text": "are represented as one single token for" }, { "start": 850.24, "text": "the three spaces here and then the token" }, { "start": 853.759, "text": "SPF and here seven spaces were all" }, { "start": 856.759, "text": "grouped into a single token so we're" }, { "start": 858.8, "text": "being a lot more efficient in how we" }, { "start": 860.199, "text": "represent Python and this was a" }, { "start": 861.92, "text": "deliberate Choice made by open aai when" }, { "start": 863.759, "text": "they designed the gp4 tokenizer and they" }, { "start": 867.56, "text": "group a lot more space into a single" }, { "start": 869.68, "text": "character what this does is this" }, { "start": 872.079, "text": "densifies Python and therefore we can" }, { "start": 875.199, "text": "attend to more code before it when we're" }, { "start": 878.12, "text": "trying to predict the next token in the" }, { "start": 879.72, "text": "sequence and so the Improvement in the" }, { "start": 882.04, "text": "python coding ability from gbt2 to gp4" }, { "start": 885.399, "text": "is not just a matter of the language" }, { "start": 887.079, "text": "model and the architecture and the" }, { "start": 888.839, "text": "details of the optimization but a lot of" }, { "start": 890.759, "text": "the Improvement here is also coming from" }, { "start": 892.24, "text": "the design of the tokenizer and how it" }, { "start": 894.24, "text": "groups characters into tokens okay so" }, { "start": 896.959, "text": "let's now start writing some code" }, { "start": 899.399, "text": "so remember what we want to do we want" }, { "start": 901.44, "text": "to take strings and feed them into" }, { "start": 903.72, "text": "language models for that we need to" }, { "start": 905.959, "text": "somehow tokenize strings into some" }, { "start": 908.8, "text": "integers in some fixed vocabulary and" }, { "start": 912.36, "text": "then we will use those integers to make" }, { "start": 914.24, "text": "a look up into a lookup table of vectors" }, { "start": 916.759, "text": "and feed those vectors into the" }, { "start": 918.0, "text": "Transformer as an input now the reason" }, { "start": 921.36, "text": "this gets a little bit tricky of course" }, { "start": 922.72, "text": "is that we don't just want to support" }, { "start": 924.0, "text": "the simple English alphabet we want to" }, { "start": 926.12, "text": "support different kinds of languages so" }, { "start": 928.12, "text": "this is anango in Korean which is hello" }, { "start": 931.639, "text": "and we also want to support many kinds" }, { "start": 933.0, "text": "of special characters that we might find" }, { "start": 934.72, "text": "on the internet for example" }, { "start": 937.319, "text": "Emoji so how do we feed this text into" }, { "start": 941.48, "text": "uh" }, { "start": 942.199, "text": "Transformers well how's the what is this" }, { "start": 944.48, "text": "text anyway in Python so if you go to" }, { "start": 946.56, "text": "the documentation of a string in Python" }, { "start": 949.6, "text": "you can see that strings are immutable" }, { "start": 951.519, "text": "sequences of Unicode code" }, { "start": 954.12, "text": "points okay what are Unicode code points" }, { "start": 957.88, "text": "we can go to PDF so Unicode code points" }, { "start": 961.48, "text": "are defined by the Unicode Consortium as" }, { "start": 964.68, "text": "part of the Unicode standard and what" }, { "start": 967.56, "text": "this is really is that it's just a" }, { "start": 969.0, "text": "definition of roughly 150,000 characters" }, { "start": 971.839, "text": "right now and roughly speaking what they" }, { "start": 974.72, "text": "look like and what integers um represent" }, { "start": 977.56, "text": "those characters so it says 150,000" }, { "start": 979.72, "text": "characters across 161 scripts as of" }, { "start": 982.639, "text": "right now so if you scroll down here you" }, { "start": 984.72, "text": "can see that the standard is very much" }, { "start": 986.279, "text": "alive the latest standard 15.1 in" }, { "start": 988.72, "text": "September" }, { "start": 990.199, "text": "2023 and basically this is just a way to" }, { "start": 993.92, "text": "define lots of types of" }, { "start": 996.92, "text": "characters like for example all these" }, { "start": 999.16, "text": "characters across different scripts so" }, { "start": 1001.88, "text": "the way we can access the unic code code" }, { "start": 1004.04, "text": "Point given Single Character is by using" }, { "start": 1005.959, "text": "the or function in Python so for example" }, { "start": 1008.199, "text": "I can pass in Ord of H and I can see" }, { "start": 1011.279, "text": "that for the Single Character H the unic" }, { "start": 1014.72, "text": "code code point is" }, { "start": 1016.48, "text": "104 okay um but this can be arbitr" }, { "start": 1020.399, "text": "complicated so we can take for example" }, { "start": 1022.16, "text": "our Emoji here and we can see that the" }, { "start": 1024.16, "text": "code point for this one is" }, { "start": 1026.4, "text": "128,000 or we can take" }, { "start": 1030.36, "text": "un and this is 50,000 now keep in mind" }, { "start": 1033.72, "text": "you can't plug in strings here because" }, { "start": 1036.72, "text": "you uh this doesn't have a single code" }, { "start": 1038.439, "text": "point it only takes a single uni code" }, { "start": 1040.679, "text": "code Point character and tells you its" }, { "start": 1043.959, "text": "integer so in this way we can look" }, { "start": 1046.799, "text": "up all the um characters of this" }, { "start": 1050.08, "text": "specific string and their code points so" }, { "start": 1052.16, "text": "or of X forx in this string and we get" }, { "start": 1056.76, "text": "this encoding here now see here we've" }, { "start": 1060.36, "text": "already turned the raw code points" }, { "start": 1062.2, "text": "already have integers so why can't we" }, { "start": 1064.44, "text": "simply just use these integers and not" }, { "start": 1066.84, "text": "have any tokenization at all why can't" }, { "start": 1068.559, "text": "we just use this natively as is and just" }, { "start": 1070.64, "text": "use the code Point well one reason for" }, { "start": 1072.88, "text": "that of course is that the vocabulary in" }, { "start": 1074.36, "text": "that case would be quite long so in this" }, { "start": 1076.799, "text": "case for Unicode the this is a" }, { "start": 1078.679, "text": "vocabulary of" }, { "start": 1079.799, "text": "150,000 different code points but more" }, { "start": 1082.64, "text": "worryingly than that I think the Unicode" }, { "start": 1085.039, "text": "standard is very much alive and it keeps" }, { "start": 1087.039, "text": "changing and so it's not kind of a" }, { "start": 1089.24, "text": "stable representation necessarily that" }, { "start": 1091.08, "text": "we may want to use directly so for those" }, { "start": 1093.88, "text": "reasons we need something a bit better" }, { "start": 1095.76, "text": "so to find something better we turn to" }, { "start": 1097.64, "text": "encodings so if we go to the Wikipedia" }, { "start": 1099.76, "text": "page here we see that the Unicode" }, { "start": 1101.28, "text": "consortion defines three types of" }, { "start": 1103.799, "text": "encodings utf8 UTF 16 and UTF 32 these" }, { "start": 1107.96, "text": "encoding are the way by which we can" }, { "start": 1110.72, "text": "take Unicode text and translate it into" }, { "start": 1113.48, "text": "binary data or by streams utf8 is by far" }, { "start": 1117.2, "text": "the most common uh so this is the utf8" }, { "start": 1119.96, "text": "page now this Wikipedia page is actually" }, { "start": 1122.0, "text": "quite long but what's important for our" }, { "start": 1124.4, "text": "purposes is that utf8 takes every single" }, { "start": 1126.44, "text": "Cod point and it translates it to a by" }, { "start": 1129.64, "text": "stream and this by stream is between one" }, { "start": 1132.36, "text": "to four bytes so it's a variable length" }, { "start": 1134.36, "text": "encoding so depending on the Unicode" }, { "start": 1136.48, "text": "Point according to the schema you're" }, { "start": 1138.039, "text": "going to end up with between 1 to four" }, { "start": 1139.76, "text": "bytes for each code point on top of that" }, { "start": 1143.0, "text": "there's utf8 uh" }, { "start": 1145.12, "text": "utf16 and UTF 32 UTF 32 is nice because" }, { "start": 1148.84, "text": "it is fixed length instead of variable" }, { "start": 1150.559, "text": "length but it has many other downsides" }, { "start": 1152.48, "text": "as well so the full kind of spectrum of" }, { "start": 1157.0, "text": "pros and cons of all these different" }, { "start": 1158.32, "text": "three encodings are beyond the scope of" }, { "start": 1160.48, "text": "this video I just like to point out that" }, { "start": 1162.52, "text": "I enjoyed this block post and this block" }, { "start": 1165.24, "text": "post at the end of it also has a number" }, { "start": 1167.039, "text": "of references that can be quite useful" }, { "start": 1169.24, "text": "uh one of them is uh utf8 everywhere" }, { "start": 1172.039, "text": "Manifesto um and this Manifesto" }, { "start": 1174.32, "text": "describes the reason why utf8 is" }, { "start": 1176.64, "text": "significantly preferred and a lot nicer" }, { "start": 1179.88, "text": "than the other encodings and why it is" }, { "start": 1181.799, "text": "used a lot more prominently um on the" }, { "start": 1185.48, "text": "internet one of the major advantages" }, { "start": 1188.08, "text": "just just to give you a sense is that" }, { "start": 1189.559, "text": "utf8 is the only one of these that is" }, { "start": 1192.0, "text": "backwards compatible to the much simpler" }, { "start": 1194.2, "text": "asky encoding of text um but I'm not" }, { "start": 1197.08, "text": "going to go into the full detail in this" }, { "start": 1198.48, "text": "video so suffice to say that we like the" }, { "start": 1201.0, "text": "utf8 encoding and uh let's try to take" }, { "start": 1203.84, "text": "the string and see what we get if we" }, { "start": 1206.039, "text": "encoded into" }, { "start": 1208.0, "text": "utf8 the string class in Python actually" }, { "start": 1210.76, "text": "has do encode and you can give it the" }, { "start": 1212.36, "text": "encoding which is say utf8 now we get" }, { "start": 1215.559, "text": "out of this is not very nice because" }, { "start": 1217.84, "text": "this is the bytes is a bytes object and" }, { "start": 1220.96, "text": "it's not very nice in the way that it's" }, { "start": 1222.76, "text": "printed so I personally like to take it" }, { "start": 1225.039, "text": "through list because then we actually" }, { "start": 1226.84, "text": "get the raw B" }, { "start": 1228.72, "text": "of this uh encoding so this is the raw" }, { "start": 1232.4, "text": "byes that represent this string" }, { "start": 1235.6, "text": "according to the utf8 en coding we can" }, { "start": 1238.08, "text": "also look at utf16 we get a slightly" }, { "start": 1240.559, "text": "different by stream and we here we start" }, { "start": 1243.24, "text": "to see one of the disadvantages of utf16" }, { "start": 1245.48, "text": "you see how we have zero Z something Z" }, { "start": 1247.96, "text": "something Z something we're starting to" }, { "start": 1249.679, "text": "get a sense that this is a bit of a" }, { "start": 1250.84, "text": "wasteful encoding and indeed for simple" }, { "start": 1253.919, "text": "asky characters or English characters" }, { "start": 1256.28, "text": "here uh we just have the structure of 0" }, { "start": 1258.559, "text": "something Z something and it's not" }, { "start": 1260.76, "text": "exactly nice same for UTF 32 when we" }, { "start": 1264.24, "text": "expand this we can start to get a sense" }, { "start": 1266.08, "text": "of the wastefulness of this encoding for" }, { "start": 1268.0, "text": "our purposes you see a lot of zeros" }, { "start": 1270.4, "text": "followed by" }, { "start": 1271.4, "text": "something and so uh this is not" }, { "start": 1274.84, "text": "desirable so suffice it to say that we" }, { "start": 1277.84, "text": "would like to stick with utf8 for our" }, { "start": 1280.88, "text": "purposes however if we just use utf8" }, { "start": 1283.88, "text": "naively these are by streams so that" }, { "start": 1286.4, "text": "would imply a vocabulary length of only" }, { "start": 1289.24, "text": "256 possible tokens uh but this this" }, { "start": 1293.12, "text": "vocabulary size is very very small what" }, { "start": 1295.32, "text": "this is going to do if we just were to" }, { "start": 1296.679, "text": "use it naively is that all of our text" }, { "start": 1299.88, "text": "would be stretched out over very very" }, { "start": 1301.919, "text": "long sequences of bytes and so" }, { "start": 1306.159, "text": "um what what this does is that certainly" }, { "start": 1309.32, "text": "the embeding table is going to be tiny" }, { "start": 1311.0, "text": "and the prediction at the top at the" }, { "start": 1312.32, "text": "final layer is going to be very tiny but" }, { "start": 1314.159, "text": "our sequences are very long and remember" }, { "start": 1316.44, "text": "that we have pretty finite um context" }, { "start": 1319.32, "text": "length and the attention that we can" }, { "start": 1321.0, "text": "support in a transformer for" }, { "start": 1322.76, "text": "computational reasons and so we only" }, { "start": 1325.52, "text": "have as much context length but now we" }, { "start": 1327.48, "text": "have very very long sequences and this" }, { "start": 1329.44, "text": "is just inefficient and it's not going" }, { "start": 1330.799, "text": "to allow us to attend to sufficiently" }, { "start": 1332.799, "text": "long text uh before us for the purposes" }, { "start": 1335.64, "text": "of the next token prediction task so we" }, { "start": 1338.36, "text": "don't want to use the raw bytes of the" }, { "start": 1341.6, "text": "utf8 encoding we want to be able to" }, { "start": 1344.2, "text": "support larger vocabulary size that we" }, { "start": 1346.919, "text": "can tune as a hyper" }, { "start": 1348.64, "text": "but we want to stick with the utf8" }, { "start": 1350.84, "text": "encoding of these strings so what do we" }, { "start": 1353.559, "text": "do well the answer of course is we turn" }, { "start": 1355.48, "text": "to the bite pair encoding algorithm" }, { "start": 1357.44, "text": "which will allow us to compress these" }, { "start": 1359.08, "text": "bite sequences um to a variable amount" }, { "start": 1362.6, "text": "so we'll get to that in a bit but I just" }, { "start": 1364.679, "text": "want to briefly speak to the fact that I" }, { "start": 1367.12, "text": "would love nothing more than to be able" }, { "start": 1369.279, "text": "to feed raw bite sequences into uh" }, { "start": 1372.96, "text": "language models in fact there's a paper" }, { "start": 1374.88, "text": "about how this could potentially be done" }, { "start": 1377.08, "text": "uh from Summer last last year now the" }, { "start": 1379.279, "text": "problem is you actually have to go in" }, { "start": 1380.96, "text": "and you have to modify the Transformer" }, { "start": 1382.279, "text": "architecture because as I mentioned" }, { "start": 1384.48, "text": "you're going to have a problem where the" }, { "start": 1386.64, "text": "attention will start to become extremely" }, { "start": 1388.24, "text": "expensive because the sequences are so" }, { "start": 1390.36, "text": "long and so in this paper they propose" }, { "start": 1393.44, "text": "kind of a hierarchical structuring of" }, { "start": 1395.76, "text": "the Transformer that could allow you to" }, { "start": 1397.64, "text": "just feed in raw bites and so at the end" }, { "start": 1400.36, "text": "they say together these results" }, { "start": 1401.919, "text": "establish the viability of tokenization" }, { "start": 1403.64, "text": "free autor regressive sequence modeling" }, { "start": 1405.32, "text": "at scale so tokenization free would" }, { "start": 1407.4, "text": "indeed be amazing we would just feed B" }, { "start": 1410.279, "text": "streams directly into our models but" }, { "start": 1412.279, "text": "unfortunately I don't know that this has" }, { "start": 1414.159, "text": "really been proven out yet by" }, { "start": 1416.08, "text": "sufficiently many groups and a" }, { "start": 1417.24, "text": "sufficient scale uh but something like" }, { "start": 1419.24, "text": "this at one point would be amazing and I" }, { "start": 1420.679, "text": "hope someone comes up with it but for" }, { "start": 1422.32, "text": "now we have to come back and we can't" }, { "start": 1424.44, "text": "feed this directly into language models" }, { "start": 1426.44, "text": "and we have to compress it using the B" }, { "start": 1428.279, "text": "paare encoding algorithm so let's see" }, { "start": 1429.84, "text": "how that works so as I mentioned the B" }, { "start": 1431.64, "text": "paare encoding algorithm is not all that" }, { "start": 1433.52, "text": "complicated and the Wikipedia page is" }, { "start": 1435.52, "text": "actually quite instructive as far as the" }, { "start": 1437.159, "text": "basic idea goes go what we're doing is" }, { "start": 1439.48, "text": "we have some kind of a input sequence uh" }, { "start": 1441.76, "text": "like for example here we have only four" }, { "start": 1443.64, "text": "elements in our vocabulary a b c and d" }, { "start": 1446.32, "text": "and we have a sequence of them so" }, { "start": 1448.0, "text": "instead of bytes let's say we just have" }, { "start": 1449.76, "text": "four a vocab size of" }, { "start": 1452.039, "text": "four the sequence is too long and we'd" }, { "start": 1454.12, "text": "like to compress it so what we do is" }, { "start": 1456.159, "text": "that we iteratively find the pair of uh" }, { "start": 1460.159, "text": "tokens that occur the most" }, { "start": 1463.44, "text": "frequently and then once we've" }, { "start": 1465.279, "text": "identified that pair we repl replace" }, { "start": 1468.48, "text": "that pair with just a single new token" }, { "start": 1470.88, "text": "that we append to our vocabulary so for" }, { "start": 1473.559, "text": "example here the bite pair AA occurs" }, { "start": 1476.279, "text": "most often so we mint a new token let's" }, { "start": 1478.919, "text": "call it capital Z and we replace every" }, { "start": 1481.679, "text": "single occurrence of AA by Z so now we" }, { "start": 1486.0, "text": "have two Z's here so here we took a" }, { "start": 1488.919, "text": "sequence of 11 characters with" }, { "start": 1491.799, "text": "vocabulary size four and we've converted" }, { "start": 1494.44, "text": "it to a um sequence of only nine tokens" }, { "start": 1498.64, "text": "but now with a vocabulary of five" }, { "start": 1500.559, "text": "because we have a fifth vocabulary" }, { "start": 1502.399, "text": "element that we just created and it's Z" }, { "start": 1504.96, "text": "standing for concatination of AA and we" }, { "start": 1507.52, "text": "can again repeat this process so we" }, { "start": 1510.24, "text": "again look at the sequence and identify" }, { "start": 1512.88, "text": "the pair of tokens that are most" }, { "start": 1515.64, "text": "frequent let's say that that is now AB" }, { "start": 1519.159, "text": "well we are going to replace AB with a" }, { "start": 1520.76, "text": "new token that we meant call Y so y" }, { "start": 1523.76, "text": "becomes ab and then every single" }, { "start": 1525.24, "text": "occurrence of ab is now replaced with y" }, { "start": 1528.039, "text": "so we end up with this so now we only" }, { "start": 1531.44, "text": "have 1 2 3 4 5 6 seven characters in our" }, { "start": 1535.159, "text": "sequence but we have not just um four" }, { "start": 1540.12, "text": "vocabulary elements or five but now we" }, { "start": 1542.32, "text": "have six and for the final round we" }, { "start": 1545.799, "text": "again look through the sequence find" }, { "start": 1547.64, "text": "that the phrase zy or the pair zy is" }, { "start": 1550.559, "text": "most common and replace it one more time" }, { "start": 1553.32, "text": "with another um character let's say x so" }, { "start": 1556.64, "text": "X is z y and we replace all curses of zy" }, { "start": 1559.919, "text": "and we get this following sequence so" }, { "start": 1562.12, "text": "basically after we have gone through" }, { "start": 1563.6, "text": "this process instead of having a um" }, { "start": 1568.48, "text": "sequence of" }, { "start": 1569.76, "text": "11 uh tokens with a vocabulary length of" }, { "start": 1573.64, "text": "four we now have a sequence of 1 2 3" }, { "start": 1578.159, "text": "four five tokens but our vocabulary" }, { "start": 1581.48, "text": "length now is seven and so in this way" }, { "start": 1585.159, "text": "we can iteratively compress our sequence" }, { "start": 1587.44, "text": "I we Mint new tokens so in the in the" }, { "start": 1590.279, "text": "exact same way we start we start out" }, { "start": 1592.399, "text": "with bite sequences so we have 256" }, { "start": 1596.24, "text": "vocabulary size but we're now going to" }, { "start": 1598.2, "text": "go through these and find the bite pairs" }, { "start": 1600.64, "text": "that occur the most and we're going to" }, { "start": 1602.559, "text": "iteratively start minting new tokens" }, { "start": 1604.84, "text": "appending them to our vocabulary and" }, { "start": 1606.76, "text": "replacing things and in this way we're" }, { "start": 1608.88, "text": "going to end up with a compressed" }, { "start": 1610.24, "text": "training data set and also an algorithm" }, { "start": 1612.96, "text": "for taking any arbitrary sequence and" }, { "start": 1615.279, "text": "encoding it using this uh vocabul" }, { "start": 1618.24, "text": "and also decoding it back to Strings so" }, { "start": 1621.0, "text": "let's now Implement all that so here's" }, { "start": 1623.24, "text": "what I did I went to this block post" }, { "start": 1625.679, "text": "that I enjoyed and I took the first" }, { "start": 1627.32, "text": "paragraph and I copy pasted it here into" }, { "start": 1630.0, "text": "text so this is one very long line" }, { "start": 1633.279, "text": "here now to get the tokens as I" }, { "start": 1635.96, "text": "mentioned we just take our text and we" }, { "start": 1637.36, "text": "encode it into utf8 the tokens here at" }, { "start": 1640.159, "text": "this point will be a raw bites single" }, { "start": 1642.76, "text": "stream of bytes and just so that it's" }, { "start": 1645.6, "text": "easier to work with instead of just a" }, { "start": 1647.64, "text": "bytes object I'm going to convert all" }, { "start": 1649.96, "text": "those bytes to integers and then create" }, { "start": 1652.64, "text": "a list of it just so it's easier for us" }, { "start": 1654.279, "text": "to manipulate and work with in Python" }, { "start": 1655.88, "text": "and visualize and here I'm printing all" }, { "start": 1658.0, "text": "of that so this is the original um this" }, { "start": 1662.08, "text": "is the original paragraph and its length" }, { "start": 1665.0, "text": "is" }, { "start": 1665.799, "text": "533 uh code points and then here are the" }, { "start": 1669.799, "text": "bytes encoded in ut utf8 and we see that" }, { "start": 1673.32, "text": "this has a length of 616 bytes at this" }, { "start": 1676.32, "text": "point or 616 tokens and the reason this" }, { "start": 1679.039, "text": "is more is because a lot of these simple" }, { "start": 1681.84, "text": "asky characters or simple characters" }, { "start": 1684.6, "text": "they just become a single bite but a lot" }, { "start": 1686.44, "text": "of these Unicode more complex characters" }, { "start": 1688.76, "text": "become multiple bytes up to four and so" }, { "start": 1691.08, "text": "we are expanding that" }, { "start": 1692.76, "text": "size so now what we'd like to do as a" }, { "start": 1694.799, "text": "first step of the algorithm is we'd like" }, { "start": 1696.24, "text": "to iterate over here and find the pair" }, { "start": 1698.919, "text": "of bites that occur most frequently" }, { "start": 1702.0, "text": "because we're then going to merge it so" }, { "start": 1704.12, "text": "if you are working long on a notebook on" }, { "start": 1705.799, "text": "a side then I encourage you to basically" }, { "start": 1707.76, "text": "click on the link find this notebook and" }, { "start": 1709.919, "text": "try to write that function yourself" }, { "start": 1711.88, "text": "otherwise I'm going to come here and" }, { "start": 1712.96, "text": "Implement first the function that finds" }, { "start": 1714.96, "text": "the most common pair okay so here's what" }, { "start": 1716.919, "text": "I came up with there are many different" }, { "start": 1718.399, "text": "ways to implement this but I'm calling" }, { "start": 1720.32, "text": "the function get stats it expects a list" }, { "start": 1722.159, "text": "of integers I'm using a dictionary to" }, { "start": 1724.48, "text": "keep track of basically the counts and" }, { "start": 1726.88, "text": "then this is a pythonic way to iterate" }, { "start": 1728.84, "text": "consecutive elements of this list uh" }, { "start": 1731.44, "text": "which we covered in the previous video" }, { "start": 1733.72, "text": "and then here I'm just keeping track of" }, { "start": 1735.919, "text": "just incrementing by one um for all the" }, { "start": 1738.559, "text": "pairs so if I call this on all the" }, { "start": 1740.399, "text": "tokens here then the stats comes out" }, { "start": 1743.399, "text": "here so this is the dictionary the keys" }, { "start": 1746.159, "text": "are these topples of consecutive" }, { "start": 1748.919, "text": "elements and this is the count so just" }, { "start": 1751.6, "text": "to uh print it in a slightly better way" }, { "start": 1754.679, "text": "this is one way that I like to do that" }, { "start": 1757.6, "text": "where you it's a little bit compound" }, { "start": 1760.559, "text": "here so you can pause if you like but we" }, { "start": 1762.36, "text": "iterate all all the items the items" }, { "start": 1765.039, "text": "called on dictionary returns pairs of" }, { "start": 1767.399, "text": "key value and instead I create a list" }, { "start": 1771.799, "text": "here of value key because if it's a" }, { "start": 1775.12, "text": "value key list then I can call sort on" }, { "start": 1777.279, "text": "it and by default python will uh use the" }, { "start": 1781.36, "text": "first element which in this case will be" }, { "start": 1783.559, "text": "value to sort by if it's given tles and" }, { "start": 1786.64, "text": "then reverse so it's descending and" }, { "start": 1788.72, "text": "print that so basically it looks like" }, { "start": 1790.88, "text": "101 comma 32 was the most commonly" }, { "start": 1793.96, "text": "occurring consecutive pair and it" }, { "start": 1795.72, "text": "occurred 20 times we can double check" }, { "start": 1798.2, "text": "that that makes reasonable sense so if I" }, { "start": 1800.44, "text": "just search" }, { "start": 1802.08, "text": "10132 then you see that these are the 20" }, { "start": 1805.2, "text": "occurrences of that um pair and if we'd" }, { "start": 1810.12, "text": "like to take a look at what exactly that" }, { "start": 1811.519, "text": "pair is we can use Char which is the" }, { "start": 1814.279, "text": "opposite of or in Python so we give it a" }, { "start": 1817.84, "text": "um unic code Cod point so 101 and of 32" }, { "start": 1822.039, "text": "and we see that this is e and space so" }, { "start": 1825.0, "text": "basically there's a lot of E space here" }, { "start": 1828.08, "text": "meaning that a lot of these words seem" }, { "start": 1829.48, "text": "to end with e so here's eace as an" }, { "start": 1832.12, "text": "example so there's a lot of that going" }, { "start": 1834.039, "text": "on here and this is the most common pair" }, { "start": 1836.72, "text": "so now that we've identified the most" }, { "start": 1838.24, "text": "common pair we would like to iterate" }, { "start": 1840.36, "text": "over this sequence we're going to Mint a" }, { "start": 1842.679, "text": "new token with the ID of" }, { "start": 1844.799, "text": "256 right because these tokens currently" }, { "start": 1847.84, "text": "go from Z to 255 so when we create a new" }, { "start": 1850.64, "text": "token it will have an ID of" }, { "start": 1852.84, "text": "256 and we're going to iterate over this" }, { "start": 1856.0, "text": "entire um list and every every time we" }, { "start": 1859.84, "text": "see 101 comma 32 we're going to swap" }, { "start": 1862.72, "text": "that out for" }, { "start": 1863.919, "text": "256 so let's Implement that now and feel" }, { "start": 1867.24, "text": "free to uh do that yourself as well so" }, { "start": 1869.96, "text": "first I commented uh this just so we" }, { "start": 1871.96, "text": "don't pollute uh the notebook too much" }, { "start": 1874.96, "text": "this is a nice way of in Python" }, { "start": 1877.96, "text": "obtaining the highest ranking pair so" }, { "start": 1880.399, "text": "we're basically calling the Max on this" }, { "start": 1883.08, "text": "dictionary stats and this will return" }, { "start": 1886.32, "text": "the maximum" }, { "start": 1887.679, "text": "key and then the question is how does it" }, { "start": 1890.159, "text": "rank keys so you can provide it with a" }, { "start": 1892.84, "text": "function that ranks keys and that" }, { "start": 1895.2, "text": "function is just stats. getet uh stats." }, { "start": 1898.2, "text": "getet would basically return the value" }, { "start": 1901.12, "text": "and so we're ranking by the value and" }, { "start": 1902.799, "text": "getting the maximum key so it's 101" }, { "start": 1905.48, "text": "comma 32 as we saw now to actually merge" }, { "start": 1909.2, "text": "10132 um this is the function that I" }, { "start": 1911.88, "text": "wrote but again there are many different" }, { "start": 1913.279, "text": "versions of it so we're going to take a" }, { "start": 1915.72, "text": "list of IDs and the the pair that we" }, { "start": 1917.72, "text": "want to replace and that pair will be" }, { "start": 1919.76, "text": "replaced with the new index" }, { "start": 1922.24, "text": "idx so iterating through IDs if we find" }, { "start": 1925.559, "text": "the pair swap it out for idx so we" }, { "start": 1928.44, "text": "create this new list and then we start" }, { "start": 1930.519, "text": "at zero and then we go through this" }, { "start": 1932.76, "text": "entire list sequentially from left to" }, { "start": 1934.84, "text": "right and here we are checking for" }, { "start": 1937.12, "text": "equality at the current position with" }, { "start": 1939.639, "text": "the" }, { "start": 1940.88, "text": "pair um so here we are checking that the" }, { "start": 1943.399, "text": "pair matches now here is a bit of a" }, { "start": 1945.48, "text": "tricky condition that you have to append" }, { "start": 1947.24, "text": "if you're trying to be careful and that" }, { "start": 1949.08, "text": "is that um you don't want this here to" }, { "start": 1951.679, "text": "be out of Bounds at the very last" }, { "start": 1953.76, "text": "position when you're on the rightmost" }, { "start": 1955.399, "text": "element of this list otherwise this" }, { "start": 1957.12, "text": "would uh give you an autof bounds error" }, { "start": 1959.279, "text": "so we have to make sure that we're not" }, { "start": 1960.679, "text": "at the very very last element so uh this" }, { "start": 1964.039, "text": "would be false for that so if we find a" }, { "start": 1966.6, "text": "match we append to this new list that" }, { "start": 1971.08, "text": "replacement index and we increment the" }, { "start": 1973.32, "text": "position by two so we skip over that" }, { "start": 1974.799, "text": "entire pair but otherwise if we we" }, { "start": 1977.12, "text": "haven't found a matching pair we just" }, { "start": 1979.08, "text": "sort of copy over the um element at that" }, { "start": 1982.12, "text": "position and increment by one then" }, { "start": 1985.24, "text": "return this so here's a very small toy" }, { "start": 1987.36, "text": "example if we have a list 566 791 and we" }, { "start": 1990.36, "text": "want to replace the occurrences of 67" }, { "start": 1992.36, "text": "with 99 then calling this on that will" }, { "start": 1996.36, "text": "give us what we're asking for so here" }, { "start": 1998.919, "text": "the 67 is replaced with" }, { "start": 2001.519, "text": "99 so now I'm going to uncomment this" }, { "start": 2003.76, "text": "for our actual use case where we want to" }, { "start": 2007.279, "text": "take our tokens we want to take the top" }, { "start": 2009.519, "text": "pair here and replace it with 256 to get" }, { "start": 2013.12, "text": "tokens to if we run this we get the" }, { "start": 2017.24, "text": "following so recall that previously we" }, { "start": 2020.88, "text": "had a length 616 in this list and now we" }, { "start": 2025.12, "text": "have a length 596 right so this" }, { "start": 2028.44, "text": "decreased by 20 which makes sense" }, { "start": 2030.159, "text": "because there are 20 occurrences" }, { "start": 2032.36, "text": "moreover we can try to find 256 here and" }, { "start": 2035.48, "text": "we see plenty of occurrences on off it" }, { "start": 2038.44, "text": "and moreover just double check there" }, { "start": 2039.76, "text": "should be no occurrence of 10132 so this" }, { "start": 2042.519, "text": "is the original array plenty of them and" }, { "start": 2045.0, "text": "in the second array there are no" }, { "start": 2046.159, "text": "occurrences of 1032 so we've" }, { "start": 2048.52, "text": "successfully merged this single pair and" }, { "start": 2051.599, "text": "now we just uh iterate this so we are" }, { "start": 2053.919, "text": "going to go over the sequence again find" }, { "start": 2055.48, "text": "the most common pair and replace it so" }, { "start": 2057.8, "text": "let me now write a y Loop that uses" }, { "start": 2059.48, "text": "these functions to do this um sort of" }, { "start": 2061.8, "text": "iteratively and how many times do we do" }, { "start": 2064.28, "text": "it four well that's totally up to us as" }, { "start": 2066.28, "text": "a hyper parameter" }, { "start": 2067.399, "text": "the more um steps we take the larger" }, { "start": 2070.919, "text": "will be our vocabulary and the shorter" }, { "start": 2073.04, "text": "will be our sequence and there is some" }, { "start": 2075.119, "text": "sweet spot that we usually find works" }, { "start": 2077.24, "text": "the best in practice and so this is kind" }, { "start": 2079.919, "text": "of a hyperparameter and we tune it and" }, { "start": 2081.639, "text": "we find good vocabulary sizes as an" }, { "start": 2084.2, "text": "example gp4 currently uses roughly" }, { "start": 2086.0, "text": "100,000 tokens and um bpark that those" }, { "start": 2089.879, "text": "are reasonable numbers currently instead" }, { "start": 2091.8, "text": "the are large language models so let me" }, { "start": 2093.919, "text": "now write uh putting putting it all" }, { "start": 2095.96, "text": "together and uh iterating these steps" }, { "start": 2098.68, "text": "okay now before we dive into the Y loop" }, { "start": 2100.52, "text": "I wanted to add one more cell here where" }, { "start": 2103.28, "text": "I went to the block post and instead of" }, { "start": 2104.96, "text": "grabbing just the first paragraph or two" }, { "start": 2107.0, "text": "I took the entire block post and I" }, { "start": 2108.8, "text": "stretched it out in a single line and" }, { "start": 2110.96, "text": "basically just using longer text will" }, { "start": 2112.48, "text": "allow us to have more representative" }, { "start": 2113.88, "text": "statistics for the bite Pairs and we'll" }, { "start": 2116.28, "text": "just get a more sensible results out of" }, { "start": 2118.04, "text": "it because it's longer text um so here" }, { "start": 2121.76, "text": "we have the raw text we encode it into" }, { "start": 2124.359, "text": "bytes using the utf8 encoding" }, { "start": 2127.64, "text": "and then here as before we are just" }, { "start": 2130.079, "text": "changing it into a list of integers in" }, { "start": 2131.839, "text": "Python just so it's easier to work with" }, { "start": 2133.96, "text": "instead of the raw byes objects and then" }, { "start": 2136.68, "text": "this is the code that I came up with uh" }, { "start": 2140.76, "text": "to actually do the merging in Loop these" }, { "start": 2144.0, "text": "two functions here are identical to what" }, { "start": 2145.839, "text": "we had above I only included them here" }, { "start": 2148.119, "text": "just so that you have the point of" }, { "start": 2149.88, "text": "reference here so uh these two are" }, { "start": 2153.359, "text": "identical and then this is the new code" }, { "start": 2155.0, "text": "that I added so the first first thing we" }, { "start": 2157.079, "text": "want to do is we want to decide on the" }, { "start": 2158.56, "text": "final vocabulary size that we want our" }, { "start": 2161.04, "text": "tokenizer to have and as I mentioned" }, { "start": 2162.96, "text": "this is a hyper parameter and you set it" }, { "start": 2164.52, "text": "in some way depending on your best" }, { "start": 2166.44, "text": "performance so let's say for us we're" }, { "start": 2168.48, "text": "going to use 276 because that way we're" }, { "start": 2170.839, "text": "going to be doing exactly 20" }, { "start": 2173.079, "text": "merges and uh 20 merges because we" }, { "start": 2175.72, "text": "already have" }, { "start": 2176.88, "text": "256 tokens for the raw bytes and to" }, { "start": 2180.88, "text": "reach 276 we have to do 20 merges uh to" }, { "start": 2183.68, "text": "add 20 new" }, { "start": 2185.48, "text": "tokens here uh this is uh one way in" }, { "start": 2188.2, "text": "Python to just create a copy of a list" }, { "start": 2191.48, "text": "so I'm taking the tokens list and by" }, { "start": 2193.52, "text": "wrapping it in a list python will" }, { "start": 2195.839, "text": "construct a new list of all the" }, { "start": 2197.16, "text": "individual elements so this is just a" }, { "start": 2198.64, "text": "copy" }, { "start": 2199.92, "text": "operation then here I'm creating a" }, { "start": 2202.079, "text": "merges uh dictionary so this merges" }, { "start": 2204.839, "text": "dictionary is going to maintain" }, { "start": 2206.119, "text": "basically the child one child two" }, { "start": 2209.4, "text": "mapping to a new uh token and so what" }, { "start": 2212.52, "text": "we're going to be building up here is a" }, { "start": 2213.92, "text": "binary tree of merges but actually it's" }, { "start": 2216.92, "text": "not exactly a tree because a tree would" }, { "start": 2219.28, "text": "have a single root node with a bunch of" }, { "start": 2221.44, "text": "leaves for us we're starting with the" }, { "start": 2223.44, "text": "leaves on the bottom which are the" }, { "start": 2225.0, "text": "individual bites those are the starting" }, { "start": 2226.92, "text": "256 tokens and then we're starting to" }, { "start": 2229.52, "text": "like merge two of them at a time and so" }, { "start": 2231.52, "text": "it's not a tree it's more like a forest" }, { "start": 2234.96, "text": "um uh as we merge these elements" }, { "start": 2238.92, "text": "so for 20 merges we're going to find the" }, { "start": 2242.88, "text": "most commonly occurring pair we're going" }, { "start": 2245.079, "text": "to Mint a new token integer for it so I" }, { "start": 2248.48, "text": "here will start at zero so we'll going" }, { "start": 2250.079, "text": "to start at 256 we're going to print" }, { "start": 2252.359, "text": "that we're merging it and we're going to" }, { "start": 2254.44, "text": "replace all of the occurrences of that" }, { "start": 2256.2, "text": "pair with the new new lied token and" }, { "start": 2259.56, "text": "we're going to record that this pair of" }, { "start": 2262.16, "text": "integers merged into this new" }, { "start": 2265.52, "text": "integer so running this gives us the" }, { "start": 2269.079, "text": "following" }, { "start": 2271.16, "text": "output so we did 20 merges and for" }, { "start": 2274.48, "text": "example the first merge was exactly as" }, { "start": 2276.839, "text": "before the" }, { "start": 2278.839, "text": "10132 um tokens merging into a new token" }, { "start": 2281.8, "text": "2556 now keep in mind that the" }, { "start": 2284.0, "text": "individual uh tokens 101 and 32 can" }, { "start": 2286.599, "text": "still occur in the sequence after" }, { "start": 2288.44, "text": "merging it's only when they occur" }, { "start": 2290.359, "text": "exactly consecutively that that becomes" }, { "start": 2292.599, "text": "256" }, { "start": 2293.88, "text": "now um and in particular the other thing" }, { "start": 2296.92, "text": "to notice here is that the token 256" }, { "start": 2299.16, "text": "which is the newly minted token is also" }, { "start": 2301.4, "text": "eligible for merging so here on the" }, { "start": 2303.4, "text": "bottom the 20th merge was a merge of 25" }, { "start": 2306.839, "text": "and 259 becoming" }, { "start": 2308.88, "text": "275 so every time we replace these" }, { "start": 2311.8, "text": "tokens they become eligible for merging" }, { "start": 2313.64, "text": "in the next round of data ration so" }, { "start": 2315.92, "text": "that's why we're building up a small" }, { "start": 2317.119, "text": "sort of binary Forest instead of a" }, { "start": 2318.8, "text": "single individual" }, { "start": 2320.2, "text": "tree one thing we can take a look at as" }, { "start": 2322.319, "text": "well is we can take a look at the" }, { "start": 2324.0, "text": "compression ratio that we've achieved so" }, { "start": 2326.16, "text": "in particular we started off with this" }, { "start": 2328.359, "text": "tokens list um so we started off with" }, { "start": 2331.4, "text": "24,000 bytes and after merging 20 times" }, { "start": 2336.28, "text": "uh we now have only" }, { "start": 2338.52, "text": "19,000 um tokens and so therefore the" }, { "start": 2341.92, "text": "compression ratio simply just dividing" }, { "start": 2343.64, "text": "the two is roughly 1.27 so that's the" }, { "start": 2346.8, "text": "amount of compression we were able to" }, { "start": 2347.96, "text": "achieve of this text with only 20" }, { "start": 2350.8, "text": "merges um and of course the more" }, { "start": 2353.119, "text": "vocabulary elements you add uh the" }, { "start": 2355.599, "text": "greater the compression ratio here would" }, { "start": 2359.24, "text": "be finally so that's kind of like um the" }, { "start": 2363.76, "text": "training of the tokenizer if you will" }, { "start": 2365.72, "text": "now 1 Point I wanted to make is that and" }, { "start": 2368.28, "text": "maybe this is a diagram that can help um" }, { "start": 2371.28, "text": "kind of illustrate is that tokenizer is" }, { "start": 2373.079, "text": "a completely separate object from the" }, { "start": 2374.92, "text": "large language model itself so" }, { "start": 2377.0, "text": "everything in this lecture we're not" }, { "start": 2378.04, "text": "really touching the llm itself uh we're" }, { "start": 2380.119, "text": "just training the tokenizer this is a" }, { "start": 2381.839, "text": "completely separate pre-processing stage" }, { "start": 2383.92, "text": "usually so the tokenizer will have its" }, { "start": 2386.24, "text": "own training set just like a large" }, { "start": 2387.96, "text": "language model has a potentially" }, { "start": 2389.8, "text": "different training set so the tokenizer" }, { "start": 2392.04, "text": "has a training set of documents on which" }, { "start": 2393.4, "text": "you're going to train the" }, { "start": 2394.76, "text": "tokenizer and then and um we're" }, { "start": 2397.76, "text": "performing The Bite pair encoding" }, { "start": 2398.96, "text": "algorithm as we saw above to train the" }, { "start": 2401.079, "text": "vocabulary of this" }, { "start": 2402.64, "text": "tokenizer so it has its own training set" }, { "start": 2404.96, "text": "it is a pre-processing stage that you" }, { "start": 2406.52, "text": "would run a single time in the beginning" }, { "start": 2409.24, "text": "um and the tokenizer is trained using" }, { "start": 2411.96, "text": "bipar coding algorithm once you have the" }, { "start": 2414.359, "text": "tokenizer once it's trained and you have" }, { "start": 2416.319, "text": "the vocabulary and you have the merges" }, { "start": 2419.04, "text": "uh we can do both encoding and decoding" }, { "start": 2422.28, "text": "so these two arrows here so the" }, { "start": 2424.52, "text": "tokenizer is a translation layer between" }, { "start": 2427.0, "text": "raw text which is as we saw the sequence" }, { "start": 2430.04, "text": "of Unicode code points it can take raw" }, { "start": 2432.52, "text": "text and turn it into a token sequence" }, { "start": 2435.44, "text": "and vice versa it can take a token" }, { "start": 2437.0, "text": "sequence and translate it back into raw" }, { "start": 2440.76, "text": "text so now that we have trained uh" }, { "start": 2443.359, "text": "tokenizer and we have these merges we" }, { "start": 2445.96, "text": "are going to turn to how we can do the" }, { "start": 2447.44, "text": "encoding and the decoding step if you" }, { "start": 2449.48, "text": "give me text here are the tokens and" }, { "start": 2451.24, "text": "vice versa if you give me tokens here's" }, { "start": 2453.0, "text": "the text once we have that we can" }, { "start": 2455.28, "text": "translate between these two Realms and" }, { "start": 2457.52, "text": "then the language model is going to be" }, { "start": 2458.76, "text": "trained as a step two afterwards and" }, { "start": 2461.64, "text": "typically in a in a sort of a" }, { "start": 2463.64, "text": "state-of-the-art application you might" }, { "start": 2465.48, "text": "take all of your training data for the" }, { "start": 2466.839, "text": "language model and you might run it" }, { "start": 2468.359, "text": "through the tokenizer and sort of" }, { "start": 2470.4, "text": "translate everything into a massive" }, { "start": 2471.92, "text": "token sequence and then you can throw" }, { "start": 2473.64, "text": "away the raw text you're just left with" }, { "start": 2475.44, "text": "the tokens themselves and those are" }, { "start": 2477.72, "text": "stored on disk and that is what the" }, { "start": 2479.72, "text": "large language model is actually reading" }, { "start": 2481.319, "text": "when it's training on them so this one" }, { "start": 2483.24, "text": "approach that you can take as a single" }, { "start": 2484.8, "text": "massive pre-processing step a" }, { "start": 2486.88, "text": "stage um so yeah basically I think the" }, { "start": 2490.4, "text": "most important thing I want to get" }, { "start": 2491.4, "text": "across is that this is completely" }, { "start": 2492.599, "text": "separate stage it usually has its own" }, { "start": 2494.4, "text": "entire uh training set you may want to" }, { "start": 2496.839, "text": "have those training sets be different" }, { "start": 2498.359, "text": "between the tokenizer and the logge" }, { "start": 2499.599, "text": "language model so for example when" }, { "start": 2501.28, "text": "you're training the tokenizer as I" }, { "start": 2503.319, "text": "mentioned we don't just care about the" }, { "start": 2505.079, "text": "performance of English text we care" }, { "start": 2506.76, "text": "about uh multi many different languages" }, { "start": 2509.44, "text": "and we also care about code or not code" }, { "start": 2511.52, "text": "so you may want to look into different" }, { "start": 2513.24, "text": "kinds of mixtures of different kinds of" }, { "start": 2515.2, "text": "languages and different amounts of code" }, { "start": 2517.359, "text": "and things like that because the amount" }, { "start": 2520.24, "text": "of different language that you have in" }, { "start": 2521.96, "text": "your tokenizer training set will" }, { "start": 2523.76, "text": "determine how many merges of it there" }, { "start": 2526.119, "text": "will be and therefore that determines" }, { "start": 2528.24, "text": "the density with which uh this type of" }, { "start": 2531.319, "text": "data is um sort of has in the token" }, { "start": 2535.2, "text": "space and so roughly speaking" }, { "start": 2537.76, "text": "intuitively if you add some amount of" }, { "start": 2539.72, "text": "data like say you have a ton of Japanese" }, { "start": 2541.359, "text": "data in your uh tokenizer training set" }, { "start": 2544.04, "text": "then that means that more Japanese" }, { "start": 2545.359, "text": "tokens will get merged" }, { "start": 2546.839, "text": "and therefore Japanese will have shorter" }, { "start": 2548.92, "text": "sequences uh and that's going to be" }, { "start": 2550.64, "text": "beneficial for the large language model" }, { "start": 2552.4, "text": "which has a finite context length on" }, { "start": 2554.359, "text": "which it can work on in in the token" }, { "start": 2556.599, "text": "space uh so hopefully that makes sense" }, { "start": 2559.24, "text": "so we're now going to turn to encoding" }, { "start": 2561.2, "text": "and decoding now that we have trained a" }, { "start": 2563.079, "text": "tokenizer so we have our merges and now" }, { "start": 2566.4, "text": "how do we do encoding and decoding okay" }, { "start": 2568.44, "text": "so let's begin with decoding which is" }, { "start": 2570.44, "text": "this Arrow over here so given a token" }, { "start": 2572.72, "text": "sequence let's go through the tokenizer" }, { "start": 2574.92, "text": "to get back a python string object so" }, { "start": 2577.52, "text": "the raw text so this is the function" }, { "start": 2579.88, "text": "that we' like to implement um we're" }, { "start": 2581.88, "text": "given the list of integers and we want" }, { "start": 2583.44, "text": "to return a python string if you'd like" }, { "start": 2585.68, "text": "uh try to implement this function" }, { "start": 2586.839, "text": "yourself it's a fun exercise otherwise" }, { "start": 2588.839, "text": "I'm going to start uh pasting in my own" }, { "start": 2591.28, "text": "solution so there are many different" }, { "start": 2593.52, "text": "ways to do it um here's one way I will" }, { "start": 2596.88, "text": "create an uh kind of pre-processing" }, { "start": 2598.88, "text": "variable that I will call" }, { "start": 2601.04, "text": "vocab and vocab is a mapping or a" }, { "start": 2604.68, "text": "dictionary in Python for from the token" }, { "start": 2607.559, "text": "uh ID to the bytes object for that token" }, { "start": 2611.52, "text": "so we begin with the raw bytes for" }, { "start": 2613.8, "text": "tokens from 0 to 255 and then we go in" }, { "start": 2616.839, "text": "order of all the merges and we sort of" }, { "start": 2619.76, "text": "uh populate this vocab list by doing an" }, { "start": 2622.28, "text": "addition here so this is the basically" }, { "start": 2625.72, "text": "the bytes representation of the first" }, { "start": 2627.76, "text": "child followed by the second one and" }, { "start": 2630.04, "text": "remember these are bytes objects so this" }, { "start": 2632.079, "text": "addition here is an addition of two" }, { "start": 2634.2, "text": "bytes objects just concatenation" }, { "start": 2637.04, "text": "so that's what we get" }, { "start": 2638.76, "text": "here one tricky thing to be careful with" }, { "start": 2641.2, "text": "by the way is that I'm iterating a" }, { "start": 2642.88, "text": "dictionary in Python using a DOT items" }, { "start": 2646.0, "text": "and uh it really matters that this runs" }, { "start": 2648.72, "text": "in the order in which we inserted items" }, { "start": 2651.48, "text": "into the merous dictionary luckily" }, { "start": 2653.559, "text": "starting with python 3.7 this is" }, { "start": 2655.4, "text": "guaranteed to be the case but before" }, { "start": 2657.04, "text": "python 3.7 this iteration may have been" }, { "start": 2659.16, "text": "out of order with respect to how we" }, { "start": 2660.96, "text": "inserted elements into merges and this" }, { "start": 2663.16, "text": "may not have worked but we are using an" }, { "start": 2665.8, "text": "um modern python so we're okay and then" }, { "start": 2668.8, "text": "here uh given the IDS the first thing" }, { "start": 2671.599, "text": "we're going to do is get the" }, { "start": 2675.04, "text": "tokens so the way I implemented this" }, { "start": 2677.24, "text": "here is I'm taking I'm iterating over" }, { "start": 2679.599, "text": "all the IDS I'm using vocap to look up" }, { "start": 2681.88, "text": "their bytes and then here this is one" }, { "start": 2684.119, "text": "way in Python to concatenate all these" }, { "start": 2686.64, "text": "bytes together to create our tokens and" }, { "start": 2689.72, "text": "then these tokens here at this point are" }, { "start": 2691.72, "text": "raw bytes so I have to decode using UTF" }, { "start": 2696.0, "text": "F now back into python strings so" }, { "start": 2699.2, "text": "previously we called that encode on a" }, { "start": 2701.16, "text": "string object to get the bytes and now" }, { "start": 2703.2, "text": "we're doing it Opposite we're taking the" }, { "start": 2705.2, "text": "bytes and calling a decode on the bytes" }, { "start": 2707.8, "text": "object to get a string in Python and" }, { "start": 2711.0, "text": "then we can return" }, { "start": 2713.319, "text": "text so um this is how we can do it now" }, { "start": 2716.96, "text": "this actually has a um issue um in the" }, { "start": 2720.8, "text": "way I implemented it and this could" }, { "start": 2722.119, "text": "actually throw an error so try to think" }, { "start": 2724.119, "text": "figure out why this code could actually" }, { "start": 2726.48, "text": "result in an error if we plug in um uh" }, { "start": 2730.24, "text": "some sequence of IDs that is" }, { "start": 2732.599, "text": "unlucky so let me demonstrate the issue" }, { "start": 2735.24, "text": "when I try to decode just something like" }, { "start": 2737.16, "text": "97 I am going to get letter A here back" }, { "start": 2741.079, "text": "so nothing too crazy happening but when" }, { "start": 2744.4, "text": "I try to decode 128 as a single element" }, { "start": 2748.24, "text": "the token 128 is what in string or in" }, { "start": 2751.319, "text": "Python object uni Cod decoder utfa can't" }, { "start": 2755.119, "text": "Decode by um 0x8 which is this in HEX in" }, { "start": 2760.119, "text": "position zero invalid start bite what" }, { "start": 2761.92, "text": "does that mean well to understand what" }, { "start": 2763.64, "text": "this means we have to go back to our" }, { "start": 2764.76, "text": "utf8 page uh that I briefly showed" }, { "start": 2767.92, "text": "earlier and this is Wikipedia utf8 and" }, { "start": 2770.76, "text": "basically there's a specific schema that" }, { "start": 2773.559, "text": "utfa bytes take so in particular if you" }, { "start": 2776.92, "text": "have a multi-te object for some of the" }, { "start": 2779.839, "text": "Unicode characters they have to have" }, { "start": 2781.52, "text": "this special sort of envelope in how the" }, { "start": 2784.16, "text": "encoding works and so what's happening" }, { "start": 2786.52, "text": "here is that invalid start pite that's" }, { "start": 2790.0, "text": "because" }, { "start": 2791.0, "text": "128 the binary representation of it is" }, { "start": 2793.88, "text": "one followed by all zeros so we have one" }, { "start": 2797.359, "text": "and then all zero and we see here that" }, { "start": 2799.559, "text": "that doesn't conform to the format" }, { "start": 2801.04, "text": "because one followed by all zero just" }, { "start": 2802.68, "text": "doesn't fit any of these rules so to" }, { "start": 2804.96, "text": "speak so it's an invalid start bite" }, { "start": 2807.64, "text": "which is byte one this one must have a" }, { "start": 2810.599, "text": "one following it and then a zero" }, { "start": 2812.76, "text": "following it and then the content of" }, { "start": 2814.48, "text": "your uni codee in x here so basically we" }, { "start": 2817.68, "text": "don't um exactly follow the utf8" }, { "start": 2819.96, "text": "standard and this cannot be decoded and" }, { "start": 2822.52, "text": "so the way to fix this um is to" }, { "start": 2826.28, "text": "use this errors equals in bytes. decode" }, { "start": 2831.04, "text": "function of python and by default errors" }, { "start": 2833.839, "text": "is strict so we will throw an error if" }, { "start": 2837.16, "text": "um it's not valid utf8 bytes encoding" }, { "start": 2840.28, "text": "but there are many different things that" }, { "start": 2841.68, "text": "you could put here on error handling" }, { "start": 2843.68, "text": "this is the full list of all the errors" }, { "start": 2845.359, "text": "that you can use and in particular" }, { "start": 2847.359, "text": "instead of strict let's change it to" }, { "start": 2849.359, "text": "replace and that will replace uh with" }, { "start": 2852.28, "text": "this special marker this replacement" }, { "start": 2855.8, "text": "character so errors equals replace and" }, { "start": 2860.52, "text": "now we just get that character" }, { "start": 2863.16, "text": "back so basically not every single by" }, { "start": 2866.96, "text": "sequence is valid" }, { "start": 2868.52, "text": "utf8 and if it happens that your large" }, { "start": 2871.48, "text": "language model for example predicts your" }, { "start": 2873.88, "text": "tokens in a bad manner then they might" }, { "start": 2876.64, "text": "not fall into valid utf8 and then we" }, { "start": 2880.24, "text": "won't be able to decode them so the" }, { "start": 2882.88, "text": "standard practice is to basically uh use" }, { "start": 2885.64, "text": "errors equals replace and this is what" }, { "start": 2887.52, "text": "you will also find in the openai um code" }, { "start": 2890.319, "text": "that they released as well but basically" }, { "start": 2892.72, "text": "whenever you see um this kind of a" }, { "start": 2894.2, "text": "character in your output in that case uh" }, { "start": 2896.0, "text": "something went wrong and the LM output" }, { "start": 2898.16, "text": "not was not valid uh sort of sequence of" }, { "start": 2901.52, "text": "tokens okay and now we're going to go" }, { "start": 2903.48, "text": "the other way so we are going to" }, { "start": 2905.319, "text": "implement" }, { "start": 2906.24, "text": "this Arrow right here where we are going" }, { "start": 2907.96, "text": "to be given a string and we want to" }, { "start": 2909.64, "text": "encode it into" }, { "start": 2911.16, "text": "tokens so this is the signature of the" }, { "start": 2913.72, "text": "function that we're interested in and um" }, { "start": 2916.92, "text": "this should basically print a list of" }, { "start": 2918.16, "text": "integers of the tokens so again uh try" }, { "start": 2921.76, "text": "to maybe implement this yourself if" }, { "start": 2923.04, "text": "you'd like a fun exercise uh and pause" }, { "start": 2925.559, "text": "here otherwise I'm going to start" }, { "start": 2926.52, "text": "putting in my" }, { "start": 2927.96, "text": "solution so again there are many ways to" }, { "start": 2930.28, "text": "do this so um this is one of the ways" }, { "start": 2933.64, "text": "that sort of I came came up with so the" }, { "start": 2937.599, "text": "first thing we're going to do is we are" }, { "start": 2939.16, "text": "going" }, { "start": 2940.119, "text": "to uh take our text encode it into utf8" }, { "start": 2943.44, "text": "to get the raw bytes and then as before" }, { "start": 2945.799, "text": "we're going to call list on the bytes" }, { "start": 2947.28, "text": "object to get a list of integers of" }, { "start": 2950.079, "text": "those bytes so those are the starting" }, { "start": 2952.76, "text": "tokens those are the raw bytes of our" }, { "start": 2954.599, "text": "sequence but now of course according to" }, { "start": 2956.96, "text": "the merges dictionary above and recall" }, { "start": 2959.559, "text": "this was the" }, { "start": 2961.079, "text": "merges some of the bytes may be merged" }, { "start": 2963.96, "text": "according to this lookup in addition to" }, { "start": 2966.559, "text": "that remember that the merges was built" }, { "start": 2968.16, "text": "from top to bottom and this is sort of" }, { "start": 2969.92, "text": "the order in which we inserted stuff" }, { "start": 2971.359, "text": "into merges and so we prefer to do all" }, { "start": 2974.28, "text": "these merges in the beginning before we" }, { "start": 2976.119, "text": "do these merges later because um for" }, { "start": 2979.2, "text": "example this merge over here relies on" }, { "start": 2980.96, "text": "the 256 which got merged here so we have" }, { "start": 2984.64, "text": "to go in the order from top to bottom" }, { "start": 2986.92, "text": "sort of if we are going to be merging" }, { "start": 2988.92, "text": "anything now we expect to be doing a few" }, { "start": 2991.44, "text": "merges so we're going to be doing W" }, { "start": 2994.52, "text": "true um and now we want to find a pair" }, { "start": 2998.079, "text": "of byes that is consecutive that we are" }, { "start": 3000.72, "text": "allowed to merge according to this in" }, { "start": 3003.599, "text": "order to reuse some of the functionality" }, { "start": 3005.0, "text": "that we've already written I'm going to" }, { "start": 3006.559, "text": "reuse the function uh get" }, { "start": 3009.079, "text": "stats so recall that get stats uh will" }, { "start": 3012.079, "text": "give us the we'll basically count up how" }, { "start": 3014.24, "text": "many times every single pair occurs in" }, { "start": 3016.599, "text": "our sequence of tokens and return that" }, { "start": 3018.92, "text": "as a dictionary and the dictionary was a" }, { "start": 3022.079, "text": "mapping from all the different uh by" }, { "start": 3025.599, "text": "pairs to the number of times that they" }, { "start": 3027.4, "text": "occur right um at this point we don't" }, { "start": 3030.28, "text": "actually care how many times they occur" }, { "start": 3032.359, "text": "in the sequence we only care what the" }, { "start": 3034.359, "text": "raw pairs are in that sequence and so" }, { "start": 3036.839, "text": "I'm only going to be using basically the" }, { "start": 3038.28, "text": "keys of the dictionary I only care about" }, { "start": 3040.44, "text": "the set of possible merge candidates if" }, { "start": 3042.92, "text": "that makes" }, { "start": 3043.76, "text": "sense now we want to identify the pair" }, { "start": 3046.16, "text": "that we're going to be merging at this" }, { "start": 3047.72, "text": "stage of the loop so what do we want we" }, { "start": 3050.24, "text": "want to find the pair or like the a key" }, { "start": 3053.24, "text": "inside stats that has the lowest index" }, { "start": 3057.079, "text": "in the merges uh dictionary because we" }, { "start": 3059.64, "text": "want to do all the early merges before" }, { "start": 3061.28, "text": "we work our way to the late" }, { "start": 3063.079, "text": "merges so again there are many different" }, { "start": 3065.319, "text": "ways to implement this but I'm going to" }, { "start": 3067.72, "text": "do something a little bit fancy" }, { "start": 3071.28, "text": "here so I'm going to be using the Min" }, { "start": 3074.2, "text": "over an iterator in Python when you call" }, { "start": 3076.799, "text": "Min on an iterator and stats here as a" }, { "start": 3078.96, "text": "dictionary we're going to be iterating" }, { "start": 3080.839, "text": "the keys of this dictionary in Python so" }, { "start": 3084.119, "text": "we're looking at all the pairs inside" }, { "start": 3087.079, "text": "stats um which are all the consecutive" }, { "start": 3089.359, "text": "Pairs and we're going to be taking the" }, { "start": 3092.079, "text": "consecutive pair inside tokens that has" }, { "start": 3094.44, "text": "the minimum what the Min takes a key" }, { "start": 3098.88, "text": "which gives us the function that is" }, { "start": 3100.319, "text": "going to return a value over which we're" }, { "start": 3102.359, "text": "going to do the Min and the one we care" }, { "start": 3104.96, "text": "about is we're we care about taking" }, { "start": 3106.44, "text": "merges and basically getting um that" }, { "start": 3110.92, "text": "pairs" }, { "start": 3112.839, "text": "index so basically for any pair inside" }, { "start": 3117.16, "text": "stats we are going to be looking into" }, { "start": 3119.72, "text": "merges at what index it has and we want" }, { "start": 3123.079, "text": "to get the pair with the Min number so" }, { "start": 3125.839, "text": "as an example if there's a pair 101 and" }, { "start": 3127.559, "text": "32 we definitely want to get that pair" }, { "start": 3130.44, "text": "uh we want to identify it here and" }, { "start": 3131.92, "text": "return it and pair would become 10132 if" }, { "start": 3135.04, "text": "it" }, { "start": 3135.76, "text": "occurs and the reason that I'm putting a" }, { "start": 3137.96, "text": "float INF here as a fall back is that in" }, { "start": 3141.4, "text": "the get function when we call uh when we" }, { "start": 3144.2, "text": "basically consider a pair that doesn't" }, { "start": 3146.599, "text": "occur in the merges then that pair is" }, { "start": 3149.0, "text": "not eligible to be merged right so if in" }, { "start": 3151.88, "text": "the token sequence there's some pair" }, { "start": 3153.48, "text": "that is not a merging pair it cannot be" }, { "start": 3155.559, "text": "merged then uh it doesn't actually occur" }, { "start": 3158.119, "text": "here and it doesn't have an index and uh" }, { "start": 3160.839, "text": "it cannot be merged which we will denote" }, { "start": 3162.599, "text": "as float INF and the reason Infinity is" }, { "start": 3165.079, "text": "nice here is because for sure we're" }, { "start": 3166.599, "text": "guaranteed that it's not going to" }, { "start": 3168.079, "text": "participate in the list of candidates" }, { "start": 3170.04, "text": "when we do the men so uh so this is one" }, { "start": 3173.44, "text": "way to do it so B basically long story" }, { "start": 3175.88, "text": "short this Returns the most eligible" }, { "start": 3178.28, "text": "merging candidate pair uh that occurs in" }, { "start": 3181.119, "text": "the tokens now one thing to be careful" }, { "start": 3184.079, "text": "with here is this uh function here might" }, { "start": 3187.48, "text": "fail in the following way if there's" }, { "start": 3189.88, "text": "nothing to merge then uh uh then there's" }, { "start": 3193.599, "text": "nothing in merges um that satisfi that" }, { "start": 3196.92, "text": "is satisfied anymore there's nothing to" }, { "start": 3198.559, "text": "merge everything just returns float imps" }, { "start": 3201.72, "text": "and then the pair I think will just" }, { "start": 3203.68, "text": "become the very first element of stats" }, { "start": 3206.96, "text": "um but this pair is not actually a" }, { "start": 3208.359, "text": "mergeable pair it just becomes the first" }, { "start": 3211.16, "text": "pair inside stats arbitrarily because" }, { "start": 3213.28, "text": "all of these pairs evaluate to float in" }, { "start": 3216.319, "text": "for the merging Criterion so basically" }, { "start": 3218.559, "text": "it could be that this this doesn't look" }, { "start": 3220.359, "text": "succeed because there's no more merging" }, { "start": 3221.64, "text": "pairs so if this pair is not in merges" }, { "start": 3224.64, "text": "that was returned then this is a signal" }, { "start": 3226.839, "text": "for us that actually there was nothing" }, { "start": 3228.4, "text": "to merge no single pair can be merged" }, { "start": 3230.72, "text": "anymore in that case we will break" }, { "start": 3233.079, "text": "out um nothing else can be" }, { "start": 3237.88, "text": "merged you may come up with a different" }, { "start": 3239.839, "text": "implementation by the way this is kind" }, { "start": 3241.04, "text": "of like really trying hard in" }, { "start": 3243.88, "text": "Python um but really we're just trying" }, { "start": 3245.96, "text": "to find a pair that can be merged with" }, { "start": 3247.799, "text": "the lowest index" }, { "start": 3249.599, "text": "here now if we did find a pair that is" }, { "start": 3253.88, "text": "inside merges with the lowest index then" }, { "start": 3256.28, "text": "we can merge it" }, { "start": 3259.839, "text": "so we're going to look into the merger" }, { "start": 3262.04, "text": "dictionary for that pair to look up the" }, { "start": 3264.28, "text": "index and we're going to now merge that" }, { "start": 3267.28, "text": "into that index so we're going to do" }, { "start": 3269.24, "text": "tokens equals and we're going to" }, { "start": 3272.24, "text": "replace the original tokens we're going" }, { "start": 3274.64, "text": "to be replacing the pair pair and we're" }, { "start": 3276.76, "text": "going to be replacing it with index idx" }, { "start": 3278.96, "text": "and this returns a new list of tokens" }, { "start": 3281.64, "text": "where every occurrence of pair is" }, { "start": 3283.16, "text": "replaced with idx so we're doing a merge" }, { "start": 3286.28, "text": "and we're going to be continuing this" }, { "start": 3287.599, "text": "until eventually nothing can be merged" }, { "start": 3289.28, "text": "we'll come out here and we'll break out" }, { "start": 3291.28, "text": "and here we just return" }, { "start": 3293.319, "text": "tokens and so that that's the" }, { "start": 3295.839, "text": "implementation I think so hopefully this" }, { "start": 3297.44, "text": "runs okay cool um yeah and this looks uh" }, { "start": 3302.44, "text": "reasonable so for example 32 is a space" }, { "start": 3304.88, "text": "in asky so that's here um so this looks" }, { "start": 3309.2, "text": "like it worked great okay so let's wrap" }, { "start": 3311.48, "text": "up this section of the video at least I" }, { "start": 3313.48, "text": "wanted to point out that this is not" }, { "start": 3314.88, "text": "quite the right implementation just yet" }, { "start": 3316.359, "text": "because we are leaving out a special" }, { "start": 3317.96, "text": "case so in particular if uh we try to do" }, { "start": 3320.68, "text": "this this would give us an error and the" }, { "start": 3323.559, "text": "issue is that um if we only have a" }, { "start": 3325.64, "text": "single character or an empty string then" }, { "start": 3328.039, "text": "stats is empty and that causes an issue" }, { "start": 3329.839, "text": "inside Min so one way to fight this is" }, { "start": 3332.96, "text": "if L of tokens is at least two because" }, { "start": 3336.359, "text": "if it's less than two it's just a single" }, { "start": 3337.839, "text": "token or no tokens then let's just uh" }, { "start": 3340.079, "text": "there's nothing to merge so we just" }, { "start": 3341.52, "text": "return so that would fix uh that" }, { "start": 3344.64, "text": "case Okay and then second I have a few" }, { "start": 3348.079, "text": "test cases here for us as well so first" }, { "start": 3350.44, "text": "let's make sure uh about or let's note" }, { "start": 3353.359, "text": "the following if we take a string and we" }, { "start": 3356.44, "text": "try to encode it and then decode it back" }, { "start": 3358.64, "text": "you'd expect to get the same string back" }, { "start": 3360.24, "text": "right is that true for all" }, { "start": 3364.68, "text": "strings so I think uh so here it is the" }, { "start": 3367.16, "text": "case and I think in general this is" }, { "start": 3368.72, "text": "probably the case um but notice that" }, { "start": 3372.039, "text": "going backwards is not is not you're not" }, { "start": 3374.64, "text": "going to have an identity going" }, { "start": 3375.92, "text": "backwards because as I mentioned us not" }, { "start": 3379.2, "text": "all token sequences are valid utf8 uh" }, { "start": 3382.96, "text": "sort of by streams and so so therefore" }, { "start": 3385.44, "text": "you're some of them can't even be" }, { "start": 3387.2, "text": "decodable um so this only goes in One" }, { "start": 3390.48, "text": "Direction but for that one direction we" }, { "start": 3392.92, "text": "can check uh here if we take the" }, { "start": 3394.76, "text": "training text which is the text that we" }, { "start": 3396.319, "text": "train to tokenizer around we can make" }, { "start": 3398.0, "text": "sure that when we encode and decode we" }, { "start": 3399.44, "text": "get the same thing back which is true" }, { "start": 3401.96, "text": "and here I took some validation data so" }, { "start": 3403.839, "text": "I went to I think this web page and I" }, { "start": 3405.599, "text": "grabbed some text so this is text that" }, { "start": 3407.76, "text": "the tokenizer has not seen and we can" }, { "start": 3409.68, "text": "make sure that this also works um okay" }, { "start": 3412.72, "text": "so that gives us some confidence that" }, { "start": 3413.92, "text": "this was correctly implemented" }, { "start": 3416.0, "text": "so those are the basics of the bite pair" }, { "start": 3418.039, "text": "encoding algorithm we saw how we can uh" }, { "start": 3420.72, "text": "take some training set train a tokenizer" }, { "start": 3423.68, "text": "the parameters of this tokenizer really" }, { "start": 3425.44, "text": "are just this dictionary of merges and" }, { "start": 3428.119, "text": "that basically creates the little binary" }, { "start": 3429.599, "text": "Forest on top of raw" }, { "start": 3431.559, "text": "bites once we have this the merges table" }, { "start": 3434.68, "text": "we can both encode and decode between" }, { "start": 3436.799, "text": "raw text and token sequences so that's" }, { "start": 3439.4, "text": "the the simplest setting of The" }, { "start": 3441.28, "text": "tokenizer what we're going to do now" }, { "start": 3443.2, "text": "though is we're going to look at some of" }, { "start": 3444.48, "text": "the St the art lar language models and" }, { "start": 3446.559, "text": "the kinds of tokenizers that they use" }, { "start": 3448.359, "text": "and we're going to see that this picture" }, { "start": 3449.559, "text": "complexifies very quickly so we're going" }, { "start": 3451.64, "text": "to go through the details of this comp" }, { "start": 3454.599, "text": "complexification one at a time so let's" }, { "start": 3457.52, "text": "kick things off by looking at the GPD" }, { "start": 3459.039, "text": "Series so in particular I have the gpt2" }, { "start": 3461.64, "text": "paper here um and this paper is from" }, { "start": 3464.64, "text": "2019 or so so 5 years ago and let's" }, { "start": 3468.359, "text": "scroll down to input representation this" }, { "start": 3471.28, "text": "is where they talk about the tokenizer" }, { "start": 3472.68, "text": "that they're using for gpd2 now this is" }, { "start": 3475.64, "text": "all fairly readable so I encourage you" }, { "start": 3477.039, "text": "to pause and um read this yourself but" }, { "start": 3480.039, "text": "this is where they motivate the use of" }, { "start": 3482.0, "text": "the bite pair encoding algorithm on the" }, { "start": 3484.68, "text": "bite level representation of utf8" }, { "start": 3487.52, "text": "encoding so this is where they motivate" }, { "start": 3489.52, "text": "it and they talk about the vocabulary" }, { "start": 3491.079, "text": "sizes and everything now everything here" }, { "start": 3493.839, "text": "is exactly as we've covered it so far" }, { "start": 3495.92, "text": "but things start to depart around here" }, { "start": 3498.559, "text": "so what they mention is that they don't" }, { "start": 3500.44, "text": "just apply the naive algorithm as we" }, { "start": 3502.28, "text": "have done it and in particular here's a" }, { "start": 3505.16, "text": "example suppose that you have common" }, { "start": 3507.0, "text": "words like dog what will happen is that" }, { "start": 3509.48, "text": "dog of course occurs very frequently in" }, { "start": 3511.64, "text": "the text and it occurs right next to all" }, { "start": 3514.28, "text": "kinds of punctuation as an example so" }, { "start": 3516.4, "text": "doc dot dog exclamation mark dog" }, { "start": 3519.16, "text": "question mark Etc and naively you might" }, { "start": 3522.24, "text": "imagine that the BP algorithm could" }, { "start": 3523.64, "text": "merge these to be single tokens and then" }, { "start": 3525.76, "text": "you end up with lots of tokens that are" }, { "start": 3527.44, "text": "just like dog with a slightly different" }, { "start": 3529.0, "text": "punctuation and so it feels like you're" }, { "start": 3530.88, "text": "clustering things that shouldn't be" }, { "start": 3532.039, "text": "clustered you're combining kind of" }, { "start": 3533.64, "text": "semantics with" }, { "start": 3535.559, "text": "uation and this uh feels suboptimal and" }, { "start": 3538.92, "text": "indeed they also say that this is" }, { "start": 3540.96, "text": "suboptimal according to some of the" }, { "start": 3542.359, "text": "experiments so what they want to do is" }, { "start": 3544.2, "text": "they want to top down in a manual way" }, { "start": 3546.319, "text": "enforce that some types of um characters" }, { "start": 3549.599, "text": "should never be merged together um so" }, { "start": 3552.76, "text": "they want to enforce these merging rules" }, { "start": 3554.799, "text": "on top of the bite PA encoding algorithm" }, { "start": 3557.68, "text": "so let's take a look um at their code" }, { "start": 3559.88, "text": "and see how they actually enforce this" }, { "start": 3561.48, "text": "and what kinds of mergy they actually do" }, { "start": 3563.2, "text": "perform so I have to to tab open here" }, { "start": 3565.839, "text": "for gpt2 under open AI on GitHub and" }, { "start": 3569.64, "text": "when we go to" }, { "start": 3570.68, "text": "Source there is an encoder thatp now I" }, { "start": 3574.28, "text": "don't personally love that they call it" }, { "start": 3575.599, "text": "encoder dopy because this is the" }, { "start": 3577.079, "text": "tokenizer and the tokenizer can do both" }, { "start": 3579.359, "text": "encode and decode uh so it feels kind of" }, { "start": 3581.88, "text": "awkward to me that it's called encoder" }, { "start": 3583.2, "text": "but that is the tokenizer and there's a" }, { "start": 3585.92, "text": "lot going on here and we're going to" }, { "start": 3587.0, "text": "step through it in detail at one point" }, { "start": 3589.24, "text": "for now I just want to focus on this" }, { "start": 3591.599, "text": "part here the create a rigix pattern" }, { "start": 3594.359, "text": "here that looks very complicated and" }, { "start": 3596.24, "text": "we're going to go through it in a bit uh" }, { "start": 3598.68, "text": "but this is the core part that allows" }, { "start": 3600.28, "text": "them to enforce rules uh for what parts" }, { "start": 3604.0, "text": "of the text Will Never Be merged for" }, { "start": 3605.96, "text": "sure now notice that re. compile here is" }, { "start": 3608.64, "text": "a little bit misleading because we're" }, { "start": 3610.76, "text": "not just doing import re which is the" }, { "start": 3612.44, "text": "python re module we're doing import reex" }, { "start": 3614.64, "text": "as re and reex is a python package that" }, { "start": 3617.72, "text": "you can install P install r x and it's" }, { "start": 3620.4, "text": "basically an extension of re so it's a" }, { "start": 3622.079, "text": "bit more powerful" }, { "start": 3623.24, "text": "re um" }, { "start": 3626.0, "text": "so let's take a look at this pattern and" }, { "start": 3628.88, "text": "what it's doing and why this is actually" }, { "start": 3630.799, "text": "doing the separation that they are" }, { "start": 3632.64, "text": "looking for okay so I've copy pasted the" }, { "start": 3634.92, "text": "pattern here to our jupit notebook where" }, { "start": 3637.119, "text": "we left off and let's take this pattern" }, { "start": 3639.24, "text": "for a spin so in the exact same way that" }, { "start": 3642.119, "text": "their code does we're going to call an" }, { "start": 3644.079, "text": "re. findall for this pattern on any" }, { "start": 3647.28, "text": "arbitrary string that we are interested" }, { "start": 3649.359, "text": "so this is the string that we want to" }, { "start": 3650.599, "text": "encode into tokens um to feed into n llm" }, { "start": 3655.24, "text": "like gpt2 so what exactly is this doing" }, { "start": 3659.039, "text": "well re. findall will take this pattern" }, { "start": 3661.039, "text": "and try to match it against a" }, { "start": 3662.839, "text": "string um the way this works is that you" }, { "start": 3666.119, "text": "are going from left to right in the" }, { "start": 3667.96, "text": "string and you're trying to match the" }, { "start": 3670.28, "text": "pattern and R.F find all will get all" }, { "start": 3673.799, "text": "the occurrences and organize them into a" }, { "start": 3676.319, "text": "list now when you look at the um when" }, { "start": 3679.16, "text": "you look at this pattern first of all" }, { "start": 3680.88, "text": "notice that this is a raw string um and" }, { "start": 3683.96, "text": "then these are three double quotes just" }, { "start": 3686.319, "text": "to start the string so really the string" }, { "start": 3688.839, "text": "itself this is the pattern itself" }, { "start": 3691.319, "text": "right and notice that it's made up of a" }, { "start": 3694.079, "text": "lot of ores so see these vertical bars" }, { "start": 3696.48, "text": "those are ores in reg X and so you go" }, { "start": 3700.2, "text": "from left to right in this pattern and" }, { "start": 3701.48, "text": "try to match it against the string" }, { "start": 3703.16, "text": "wherever you are so we have hello and" }, { "start": 3706.44, "text": "we're going to try to match it well it's" }, { "start": 3708.24, "text": "not apostrophe s it's not apostrophe t" }, { "start": 3710.799, "text": "or any of these but it is an optional" }, { "start": 3713.96, "text": "space followed by- P of uh sorry SL P of" }, { "start": 3718.119, "text": "L one or more times what is/ P of L it" }, { "start": 3722.319, "text": "is coming to some documentation that I" }, { "start": 3724.72, "text": "found um there might be other sources as" }, { "start": 3728.0, "text": "well uh SLP is a letter any kind of" }, { "start": 3731.599, "text": "letter from any language and hello is" }, { "start": 3735.039, "text": "made up of letters h e l Etc so optional" }, { "start": 3739.52, "text": "space followed by a bunch of letters one" }, { "start": 3741.559, "text": "or more letters is going to match hello" }, { "start": 3744.72, "text": "but then the match ends because a white" }, { "start": 3747.079, "text": "space is not a letter so from there on" }, { "start": 3751.079, "text": "begins a new sort of attempt to match" }, { "start": 3753.64, "text": "against the string again and starting in" }, { "start": 3756.44, "text": "here we're going to skip over all of" }, { "start": 3758.079, "text": "these again until we get to the exact" }, { "start": 3760.16, "text": "same Point again and we see that there's" }, { "start": 3762.319, "text": "an optional space this is the optional" }, { "start": 3764.279, "text": "space followed by a bunch of letters one" }, { "start": 3766.24, "text": "or more of them and so that matches so" }, { "start": 3768.72, "text": "when we run this we get a list of two" }, { "start": 3772.0, "text": "elements hello and then space world" }, { "start": 3775.72, "text": "so how are you if we add more letters we" }, { "start": 3778.88, "text": "would just get them like this now what" }, { "start": 3781.599, "text": "is this doing and why is this important" }, { "start": 3783.64, "text": "we are taking our string and instead of" }, { "start": 3785.92, "text": "directly encoding it um for" }, { "start": 3789.0, "text": "tokenization we are first splitting it" }, { "start": 3791.4, "text": "up and when you actually step through" }, { "start": 3793.48, "text": "the code and we'll do that in a bit more" }, { "start": 3795.319, "text": "detail what really is doing on a high" }, { "start": 3797.359, "text": "level is that it first splits your text" }, { "start": 3800.92, "text": "into a list of texts just like this one" }, { "start": 3804.64, "text": "and all these elements of this list are" }, { "start": 3806.559, "text": "processed independently by the tokenizer" }, { "start": 3809.279, "text": "and all of the results of that" }, { "start": 3810.76, "text": "processing are simply" }, { "start": 3812.279, "text": "concatenated so hello world oh I I" }, { "start": 3815.92, "text": "missed how hello world how are you we" }, { "start": 3819.64, "text": "have five elements of list all of these" }, { "start": 3821.599, "text": "will independent" }, { "start": 3824.4, "text": "independently go from text to a token" }, { "start": 3827.0, "text": "sequence and then that token sequence is" }, { "start": 3829.2, "text": "going to be concatenated it's all going" }, { "start": 3830.799, "text": "to be joined up and roughly speaking" }, { "start": 3834.359, "text": "what that does is you're only ever" }, { "start": 3836.119, "text": "finding merges between the elements of" }, { "start": 3838.44, "text": "this list so you can only ever consider" }, { "start": 3840.359, "text": "merges within every one of these" }, { "start": 3841.72, "text": "elements in" }, { "start": 3843.24, "text": "individually and um after you've done" }, { "start": 3846.319, "text": "all the possible merging for all of" }, { "start": 3847.92, "text": "these elements individually the results" }, { "start": 3849.88, "text": "of all that will be joined um by" }, { "start": 3853.64, "text": "concatenation and so you are basically" }, { "start": 3856.24, "text": "what what you're doing effectively is" }, { "start": 3858.4, "text": "you are never going to be merging this e" }, { "start": 3861.0, "text": "with this space because they are now" }, { "start": 3863.2, "text": "parts of the separate elements of this" }, { "start": 3865.079, "text": "list and so you are saying we are never" }, { "start": 3867.72, "text": "going to merge" }, { "start": 3868.92, "text": "eace um because we're breaking it up in" }, { "start": 3872.039, "text": "this way so basically using this regx" }, { "start": 3875.72, "text": "pattern to Chunk Up the text is just one" }, { "start": 3877.96, "text": "way of enforcing that some merges are" }, { "start": 3881.72, "text": "not to happen and we're going to go into" }, { "start": 3883.76, "text": "more of this text and we'll see that" }, { "start": 3885.2, "text": "what this is trying to do on a high" }, { "start": 3886.24, "text": "level is we're trying to not merge" }, { "start": 3888.0, "text": "across letters across numbers across" }, { "start": 3890.64, "text": "punctuation and so on so let's see in" }, { "start": 3893.2, "text": "more detail how that works so let's" }, { "start": 3894.72, "text": "continue now we have/ P ofn if you go to" }, { "start": 3898.0, "text": "the documentation SLP of n is any kind" }, { "start": 3901.839, "text": "of numeric character in any script so" }, { "start": 3904.44, "text": "it's numbers so we have an optional" }, { "start": 3906.599, "text": "space followed by numbers and those" }, { "start": 3908.119, "text": "would be separated out so letters and" }, { "start": 3910.359, "text": "numbers are being separated so if I do" }, { "start": 3912.559, "text": "Hello World 123 how are you then world" }, { "start": 3915.839, "text": "will stop matching here because one is" }, { "start": 3917.96, "text": "not a letter anymore but one is a number" }, { "start": 3920.64, "text": "so this group will match for that and" }, { "start": 3922.52, "text": "we'll get it as a separate entity" }, { "start": 3926.559, "text": "uh let's see how these apostrophes work" }, { "start": 3928.359, "text": "so here if we have" }, { "start": 3931.0, "text": "um uh Slash V or I mean apostrophe V as" }, { "start": 3935.079, "text": "an example then apostrophe here is not a" }, { "start": 3938.359, "text": "letter or a" }, { "start": 3939.52, "text": "number so hello will stop matching and" }, { "start": 3942.44, "text": "then we will exactly match this with" }, { "start": 3944.96, "text": "that so that will come out as a separate" }, { "start": 3948.2, "text": "thing so why are they doing the" }, { "start": 3950.24, "text": "apostrophes here honestly I think that" }, { "start": 3952.24, "text": "these are just like very common" }, { "start": 3953.599, "text": "apostrophes p uh that are used um" }, { "start": 3956.96, "text": "typically I don't love that they've done" }, { "start": 3959.359, "text": "this" }, { "start": 3960.599, "text": "because uh let me show you what happens" }, { "start": 3963.319, "text": "when you have uh some Unicode" }, { "start": 3965.44, "text": "apostrophes like for example you can" }, { "start": 3967.359, "text": "have if you have house then this will be" }, { "start": 3970.559, "text": "separated out because of this matching" }, { "start": 3973.039, "text": "but if you use the Unicode apostrophe" }, { "start": 3975.319, "text": "like" }, { "start": 3976.16, "text": "this then suddenly this does not work" }, { "start": 3979.839, "text": "and so this apostrophe will actually" }, { "start": 3981.559, "text": "become its own thing now and so so um" }, { "start": 3984.92, "text": "it's basically hardcoded for this" }, { "start": 3986.359, "text": "specific kind of apostrophe and uh" }, { "start": 3989.68, "text": "otherwise they become completely" }, { "start": 3991.319, "text": "separate tokens in addition to this you" }, { "start": 3994.039, "text": "can go to the gpt2 docs and here when" }, { "start": 3998.48, "text": "they Define the pattern they say should" }, { "start": 4000.2, "text": "have added re. ignore case so BP merges" }, { "start": 4003.0, "text": "can happen for capitalized versions of" }, { "start": 4004.559, "text": "contractions so what they're pointing" }, { "start": 4006.52, "text": "out is that you see how this is" }, { "start": 4007.72, "text": "apostrophe and then lowercase letters" }, { "start": 4010.839, "text": "well because they didn't do re. ignore" }, { "start": 4012.92, "text": "case then then um these rules will not" }, { "start": 4016.44, "text": "separate out the apostrophes if it's" }, { "start": 4018.88, "text": "uppercase so" }, { "start": 4021.44, "text": "house would be like this but if I did" }, { "start": 4026.64, "text": "house if I'm uppercase then notice" }, { "start": 4030.24, "text": "suddenly the apostrophe comes by" }, { "start": 4032.279, "text": "itself so the tokenization will work" }, { "start": 4035.48, "text": "differently in uppercase and lower case" }, { "start": 4037.44, "text": "inconsistently separating out these" }, { "start": 4039.039, "text": "apostrophes so it feels extremely gnarly" }, { "start": 4041.119, "text": "and slightly gross um but that's that's" }, { "start": 4044.52, "text": "how that works okay so let's come back" }, { "start": 4047.24, "text": "after trying to match a bunch of" }, { "start": 4048.44, "text": "apostrophe Expressions by the way the" }, { "start": 4050.279, "text": "other issue here is that these are quite" }, { "start": 4052.079, "text": "language specific probably so I don't" }, { "start": 4054.559, "text": "know that all the languages for example" }, { "start": 4055.799, "text": "use or don't use apostrophes but that" }, { "start": 4057.48, "text": "would be inconsistently tokenized as a" }, { "start": 4059.96, "text": "result then we try to match letters then" }, { "start": 4062.52, "text": "we try to match numbers and then if that" }, { "start": 4064.88, "text": "doesn't work we fall back to here and" }, { "start": 4067.559, "text": "what this is saying is again optional" }, { "start": 4069.16, "text": "space followed by something that is not" }, { "start": 4070.839, "text": "a letter number or a space in one or" }, { "start": 4073.96, "text": "more of that so what this is doing" }, { "start": 4075.799, "text": "effectively is this is trying to match" }, { "start": 4077.559, "text": "punctuation roughly speaking not letters" }, { "start": 4079.52, "text": "and not numbers so this group will try" }, { "start": 4082.279, "text": "to trigger for that so if I do something" }, { "start": 4084.2, "text": "like this then these parts here are not" }, { "start": 4088.48, "text": "letters or numbers but they will" }, { "start": 4089.96, "text": "actually they are uh they will actually" }, { "start": 4092.039, "text": "get caught here and so they become its" }, { "start": 4094.48, "text": "own group so we've separated out the" }, { "start": 4097.4, "text": "punctuation and finally this um this is" }, { "start": 4100.08, "text": "also a little bit confusing so this is" }, { "start": 4102.159, "text": "matching white space but this is using a" }, { "start": 4105.359, "text": "negative look ahead assertion in regex" }, { "start": 4109.04, "text": "so what this is doing is it's matching" }, { "start": 4110.92, "text": "wh space up to but not including the" }, { "start": 4113.279, "text": "last Whit space" }, { "start": 4115.0, "text": "character why is this important um this" }, { "start": 4117.92, "text": "is pretty subtle I think so you see how" }, { "start": 4120.279, "text": "the white space is always included at" }, { "start": 4121.719, "text": "the beginning of the word so um space r" }, { "start": 4125.52, "text": "space u Etc suppose we have a lot of" }, { "start": 4128.08, "text": "spaces" }, { "start": 4129.4, "text": "here what's going to happen here is that" }, { "start": 4132.359, "text": "these spaces up to not including the" }, { "start": 4134.6, "text": "last character will get caught by this" }, { "start": 4137.92, "text": "and what that will do is it will" }, { "start": 4139.719, "text": "separate out the spaces up to but not" }, { "start": 4141.88, "text": "including the last character so that the" }, { "start": 4143.679, "text": "last character can come here and join" }, { "start": 4145.92, "text": "with the um space you and the reason" }, { "start": 4149.239, "text": "that's nice is because space you is the" }, { "start": 4151.44, "text": "common token so if I didn't have these" }, { "start": 4153.799, "text": "Extra Spaces here you would just have" }, { "start": 4155.44, "text": "space you and if I add tokens if I add" }, { "start": 4158.159, "text": "spaces we still have a space view but" }, { "start": 4160.719, "text": "now we have all this extra white space" }, { "start": 4162.96, "text": "so basically the GB to tokenizer really" }, { "start": 4164.719, "text": "likes to have a space letters or numbers" }, { "start": 4167.44, "text": "um and it it preens these spaces and" }, { "start": 4170.44, "text": "this is just something that it is" }, { "start": 4171.4, "text": "consistent about so that's what that is" }, { "start": 4173.679, "text": "for and then finally we have all the the" }, { "start": 4176.4, "text": "last fallback is um whites space" }, { "start": 4178.64, "text": "characters uh so um that would be" }, { "start": 4182.719, "text": "just um if that doesn't get caught then" }, { "start": 4186.679, "text": "this thing will catch any trailing" }, { "start": 4188.52, "text": "spaces and so on I wanted to show one" }, { "start": 4190.759, "text": "more real world example here so if we" }, { "start": 4193.159, "text": "have this string which is a piece of" }, { "start": 4194.44, "text": "python code and then we try to split it" }, { "start": 4196.36, "text": "up then this is the kind of output we" }, { "start": 4198.4, "text": "get so you'll notice that the list has" }, { "start": 4200.56, "text": "many elements here and that's because we" }, { "start": 4202.48, "text": "are splitting up fairly often uh every" }, { "start": 4205.12, "text": "time sort of a category" }, { "start": 4207.12, "text": "changes um so there will never be any" }, { "start": 4209.36, "text": "merges Within These" }, { "start": 4210.96, "text": "elements and um that's what you are" }, { "start": 4213.48, "text": "seeing here now you might think that in" }, { "start": 4216.44, "text": "order to train the" }, { "start": 4217.76, "text": "tokenizer uh open AI has used this to" }, { "start": 4221.12, "text": "split up text into chunks and then run" }, { "start": 4223.88, "text": "just a BP algorithm within all the" }, { "start": 4225.8, "text": "chunks but that is not exactly what" }, { "start": 4227.96, "text": "happened and the reason is the following" }, { "start": 4230.28, "text": "notice that we have the spaces here uh" }, { "start": 4233.32, "text": "those Spaces end up being entire" }, { "start": 4235.44, "text": "elements but these spaces never actually" }, { "start": 4238.36, "text": "end up being merged by by open Ai and" }, { "start": 4240.64, "text": "the way you can tell is that if you copy" }, { "start": 4242.48, "text": "paste the exact same chunk here into Tik" }, { "start": 4244.199, "text": "token U Tik tokenizer you see that all" }, { "start": 4247.28, "text": "the spaces are kept independent and" }, { "start": 4249.28, "text": "they're all token" }, { "start": 4251.0, "text": "220 so I think opena at some point Point" }, { "start": 4253.84, "text": "en Force some rule that these spaces" }, { "start": 4256.04, "text": "would never be merged and so um there's" }, { "start": 4259.4, "text": "some additional rules on top of just" }, { "start": 4261.28, "text": "chunking and bpe that open ey is not uh" }, { "start": 4264.199, "text": "clear about now the training code for" }, { "start": 4266.32, "text": "the gpt2 tokenizer was never released so" }, { "start": 4268.679, "text": "all we have is uh the code that I've" }, { "start": 4270.8, "text": "already shown you but this code here" }, { "start": 4273.28, "text": "that they've released is only the" }, { "start": 4274.4, "text": "inference code for the tokens so this is" }, { "start": 4277.679, "text": "not the training code you can't give it" }, { "start": 4279.08, "text": "a piece of text and training tokenizer" }, { "start": 4281.52, "text": "this is just the inference code which" }, { "start": 4283.32, "text": "Tak takes the merges that we have up" }, { "start": 4285.6, "text": "above and applies them to a new piece of" }, { "start": 4288.32, "text": "text and so we don't know exactly how" }, { "start": 4290.56, "text": "opening ey trained um train the" }, { "start": 4292.48, "text": "tokenizer but it wasn't as simple as" }, { "start": 4294.64, "text": "chunk it up and BP it uh whatever it was" }, { "start": 4298.36, "text": "next I wanted to introduce you to the" }, { "start": 4300.239, "text": "Tik token library from openai which is" }, { "start": 4302.48, "text": "the official library for tokenization" }, { "start": 4304.8, "text": "from openai so this is Tik token bip" }, { "start": 4308.36, "text": "install P to Tik token and then um you" }, { "start": 4311.44, "text": "can do the tokenization in inference" }, { "start": 4314.36, "text": "this is again not training code this is" }, { "start": 4315.88, "text": "only inference code for" }, { "start": 4317.92, "text": "tokenization um I wanted to show you how" }, { "start": 4320.36, "text": "you would use it quite simple and" }, { "start": 4322.48, "text": "running this just gives us the gpt2" }, { "start": 4324.36, "text": "tokens or the GPT 4 tokens so this is" }, { "start": 4326.92, "text": "the tokenizer use for GPT 4 and so in" }, { "start": 4329.679, "text": "particular we see that the Whit space in" }, { "start": 4331.239, "text": "gpt2 remains unmerged but in GPT 4 uh" }, { "start": 4334.48, "text": "these Whit spaces merge as we also saw" }, { "start": 4337.32, "text": "in this one where here they're all" }, { "start": 4339.44, "text": "unmerged but if we go down to GPT 4 uh" }, { "start": 4342.639, "text": "they become merged" }, { "start": 4345.239, "text": "um now in the" }, { "start": 4347.76, "text": "gp4 uh tokenizer they changed the" }, { "start": 4351.04, "text": "regular expression that they use to" }, { "start": 4353.12, "text": "Chunk Up text so the way to see this is" }, { "start": 4355.639, "text": "that if you come to your the Tik token" }, { "start": 4358.0, "text": "uh library and then you go to this file" }, { "start": 4361.08, "text": "Tik token X openi public this is where" }, { "start": 4364.12, "text": "sort of like the definition of all these" }, { "start": 4365.639, "text": "different tokenizers that openi" }, { "start": 4366.96, "text": "maintains is and so uh necessarily to do" }, { "start": 4370.56, "text": "the inference they had to publish some" }, { "start": 4371.76, "text": "of the details about the strings" }, { "start": 4373.96, "text": "so this is the string that we already" }, { "start": 4375.36, "text": "saw for gpt2 it is slightly different" }, { "start": 4378.36, "text": "but it is actually equivalent uh to what" }, { "start": 4380.36, "text": "we discussed here so this pattern that" }, { "start": 4382.84, "text": "we discussed is equivalent to this" }, { "start": 4384.96, "text": "pattern this one just executes a little" }, { "start": 4387.0, "text": "bit faster so here you see a little bit" }, { "start": 4389.239, "text": "of a slightly different definition but" }, { "start": 4390.719, "text": "otherwise it's the same we're going to" }, { "start": 4392.719, "text": "go into special tokens in a bit and then" }, { "start": 4395.32, "text": "if you scroll down to CL 100k this is" }, { "start": 4398.6, "text": "the GPT 4 tokenizer you see that the" }, { "start": 4400.76, "text": "pattern has changed um and this is kind" }, { "start": 4403.96, "text": "of like the main the major change in" }, { "start": 4406.08, "text": "addition to a bunch of other special" }, { "start": 4407.36, "text": "tokens which I'll go into in a bit again" }, { "start": 4410.4, "text": "now some I'm not going to actually go" }, { "start": 4411.84, "text": "into the full detail of the pattern" }, { "start": 4413.28, "text": "change because honestly this is my" }, { "start": 4415.44, "text": "numbing uh I would just advise that you" }, { "start": 4417.44, "text": "pull out chat GPT and the regex" }, { "start": 4419.88, "text": "documentation and just step through it" }, { "start": 4422.159, "text": "but really the major changes are number" }, { "start": 4424.52, "text": "one you see this eye here that means" }, { "start": 4428.08, "text": "that the um case sensitivity this is" }, { "start": 4431.08, "text": "case insensitive match and so the" }, { "start": 4433.679, "text": "comment that we saw earlier on oh we" }, { "start": 4436.12, "text": "should have used re. uppercase uh" }, { "start": 4438.4, "text": "basically we're now going to be matching" }, { "start": 4441.8, "text": "these apostrophe s apostrophe D" }, { "start": 4444.6, "text": "apostrophe M Etc uh we're going to be" }, { "start": 4446.92, "text": "matching them both in lowercase and in" }, { "start": 4448.6, "text": "uppercase so that's fixed there's a" }, { "start": 4451.32, "text": "bunch of different like handling of the" }, { "start": 4452.76, "text": "whites space that I'm not going to go" }, { "start": 4454.08, "text": "into the full details of and then one" }, { "start": 4456.48, "text": "more thing here is you will notice that" }, { "start": 4458.639, "text": "when they match the numbers they only" }, { "start": 4460.679, "text": "match one to three numbers so so they" }, { "start": 4463.56, "text": "will never merge" }, { "start": 4466.12, "text": "numbers that are in low in more than" }, { "start": 4468.88, "text": "three digits only up to three digits of" }, { "start": 4471.159, "text": "numbers will ever be merged and uh" }, { "start": 4474.679, "text": "that's one change that they made as well" }, { "start": 4476.32, "text": "to prevent uh tokens that are very very" }, { "start": 4478.6, "text": "long number" }, { "start": 4480.0, "text": "sequences uh but again we don't really" }, { "start": 4482.08, "text": "know why they do any of this stuff uh" }, { "start": 4484.199, "text": "because none of this is documented and" }, { "start": 4486.28, "text": "uh it's just we just get the pattern so" }, { "start": 4489.52, "text": "um yeah it is what it is but those are" }, { "start": 4491.76, "text": "some of the changes that gp4 has made" }, { "start": 4494.36, "text": "and of course the vocabulary size went" }, { "start": 4496.36, "text": "from roughly 50k to roughly" }, { "start": 4498.4, "text": "100K the next thing I would like to do" }, { "start": 4500.4, "text": "very briefly is to take you through the" }, { "start": 4502.32, "text": "gpt2 encoder dopy that openi has" }, { "start": 4505.4, "text": "released uh this is the file that I" }, { "start": 4507.36, "text": "already mentioned to you briefly now" }, { "start": 4509.639, "text": "this file is uh fairly short and should" }, { "start": 4512.84, "text": "be relatively understandable to you at" }, { "start": 4514.639, "text": "this point um starting at the bottom" }, { "start": 4517.96, "text": "here they are loading two files encoder" }, { "start": 4521.48, "text": "Json and vocab bpe and they do some" }, { "start": 4524.159, "text": "light processing on it and then they" }, { "start": 4525.4, "text": "call this encoder object which is the" }, { "start": 4527.719, "text": "tokenizer now if you'd like to inspect" }, { "start": 4530.12, "text": "these two files which together" }, { "start": 4531.96, "text": "constitute their saved tokenizer then" }, { "start": 4534.56, "text": "you can do that with a piece of code" }, { "start": 4536.12, "text": "like" }, { "start": 4536.84, "text": "this um this is where you can download" }, { "start": 4539.32, "text": "these two files and you can inspect them" }, { "start": 4540.8, "text": "if you'd like and what you will find is" }, { "start": 4542.88, "text": "that this encoder as they call it in" }, { "start": 4545.08, "text": "their code is exactly equivalent to our" }, { "start": 4547.639, "text": "vocab so remember here where we have" }, { "start": 4551.8, "text": "this vocab object which allowed us us to" }, { "start": 4553.48, "text": "decode very efficiently and basically it" }, { "start": 4556.0, "text": "took us from the integer to the byes uh" }, { "start": 4560.12, "text": "for that integer so our vocab is exactly" }, { "start": 4563.32, "text": "their encoder and then their vocab bpe" }, { "start": 4567.76, "text": "confusingly is actually are merges so" }, { "start": 4571.159, "text": "their BP merges which is based on the" }, { "start": 4574.0, "text": "data inside vocab bpe ends up being" }, { "start": 4576.679, "text": "equivalent to our merges so uh basically" }, { "start": 4580.679, "text": "they are saving and loading the two uh" }, { "start": 4584.36, "text": "variables that for us are also critical" }, { "start": 4586.239, "text": "the merges variable and the vocab" }, { "start": 4588.32, "text": "variable using just these two variables" }, { "start": 4591.12, "text": "you can represent a tokenizer and you" }, { "start": 4592.56, "text": "can both do encoding and decoding once" }, { "start": 4594.52, "text": "you've trained this" }, { "start": 4596.0, "text": "tokenizer now the only thing that um is" }, { "start": 4600.0, "text": "actually slightly confusing inside what" }, { "start": 4602.56, "text": "opening ey does here is that in addition" }, { "start": 4604.52, "text": "to this encoder and a decoder they also" }, { "start": 4606.88, "text": "have something called a bite encoder and" }, { "start": 4608.52, "text": "a bite decoder and this is actually" }, { "start": 4611.28, "text": "unfortunately just" }, { "start": 4613.96, "text": "kind of a spirous implementation detail" }, { "start": 4615.88, "text": "and isn't actually deep or interesting" }, { "start": 4617.719, "text": "in any way so I'm going to skip the" }, { "start": 4619.08, "text": "discussion of it but what opening ey" }, { "start": 4621.04, "text": "does here for reasons that I don't fully" }, { "start": 4622.8, "text": "understand is that not only have they" }, { "start": 4625.0, "text": "this tokenizer which can encode and" }, { "start": 4626.44, "text": "decode but they have a whole separate" }, { "start": 4628.159, "text": "layer here in addition that is used" }, { "start": 4630.0, "text": "serially with the tokenizer and so you" }, { "start": 4632.639, "text": "first do um bite encode and then encode" }, { "start": 4636.08, "text": "and then you do decode and then bite" }, { "start": 4637.679, "text": "decode so that's the loop and they are" }, { "start": 4640.239, "text": "just stacked serial on top of each other" }, { "start": 4642.84, "text": "and and it's not that interesting so I" }, { "start": 4644.719, "text": "won't cover it and you can step through" }, { "start": 4645.96, "text": "it if you'd like otherwise this file if" }, { "start": 4648.639, "text": "you ignore the bite encoder and the bite" }, { "start": 4650.239, "text": "decoder will be algorithmically very" }, { "start": 4651.88, "text": "familiar with you and the meat of it" }, { "start": 4653.96, "text": "here is the what they call bpe function" }, { "start": 4657.04, "text": "and you should recognize this Loop here" }, { "start": 4659.639, "text": "which is very similar to our own y Loop" }, { "start": 4661.96, "text": "where they're trying to identify the" }, { "start": 4663.52, "text": "Byram uh a pair that they should be" }, { "start": 4666.96, "text": "merging next and then here just like we" }, { "start": 4669.159, "text": "had they have a for Loop trying to merge" }, { "start": 4670.96, "text": "this pair uh so they will go over all of" }, { "start": 4673.6, "text": "the sequence and they will merge the" }, { "start": 4675.12, "text": "pair whenever they find it and they keep" }, { "start": 4677.84, "text": "repeating that until they run out of" }, { "start": 4679.8, "text": "possible merges in the in the text so" }, { "start": 4682.36, "text": "that's the meat of this file and uh" }, { "start": 4684.56, "text": "there's an encode and a decode function" }, { "start": 4686.04, "text": "just like we have implemented it so long" }, { "start": 4688.159, "text": "story short what I want you to take away" }, { "start": 4689.719, "text": "at this point is that unfortunately it's" }, { "start": 4691.639, "text": "a little bit of a messy code that they" }, { "start": 4693.0, "text": "have but algorithmically it is identical" }, { "start": 4695.12, "text": "to what we've built up above and what" }, { "start": 4697.719, "text": "we've built up above if you understand" }, { "start": 4699.159, "text": "it is algorithmically what is necessary" }, { "start": 4701.32, "text": "to actually build a BP to organizer" }, { "start": 4703.719, "text": "train it and then both encode and decode" }, { "start": 4706.84, "text": "the next topic I would like to turn to" }, { "start": 4708.28, "text": "is that of special tokens so in addition" }, { "start": 4710.92, "text": "to tokens that are coming from you know" }, { "start": 4712.6, "text": "raw bytes and the BP merges we can" }, { "start": 4715.239, "text": "insert all kinds of tokens that we are" }, { "start": 4716.8, "text": "going to use to delimit different parts" }, { "start": 4718.96, "text": "of the data or introduced to create a" }, { "start": 4721.04, "text": "special structure of the token streams" }, { "start": 4724.8, "text": "so in uh if you look at this encoder" }, { "start": 4727.48, "text": "object from open AIS gpd2 right here we" }, { "start": 4730.88, "text": "mentioned this is very similar to our" }, { "start": 4732.159, "text": "vocab you'll notice that the length of" }, { "start": 4734.84, "text": "this is" }, { "start": 4738.88, "text": "50257 and as I mentioned it's mapping uh" }, { "start": 4741.84, "text": "and it's inverted from the mapping of" }, { "start": 4743.36, "text": "our vocab our vocab goes from integer to" }, { "start": 4746.12, "text": "string and they go the other way around" }, { "start": 4748.08, "text": "for no amazing reason um but the thing" }, { "start": 4751.84, "text": "to note here is that this the mapping" }, { "start": 4753.28, "text": "table here is" }, { "start": 4755.0, "text": "50257 where does that number come from" }, { "start": 4758.6, "text": "where what are the tokens as I mentioned" }, { "start": 4760.8, "text": "there are 256 raw bite token" }, { "start": 4764.4, "text": "tokens and then opena actually did" }, { "start": 4767.199, "text": "50,000" }, { "start": 4768.639, "text": "merges so those become the other tokens" }, { "start": 4772.0, "text": "but this would have been" }, { "start": 4774.04, "text": "50256 so what is the 57th token and" }, { "start": 4777.679, "text": "there is basically one special" }, { "start": 4780.52, "text": "token and that one special token you can" }, { "start": 4783.239, "text": "see is called end of text so this is a" }, { "start": 4787.04, "text": "special token and it's the very last" }, { "start": 4789.56, "text": "token and this token is used to delimit" }, { "start": 4792.48, "text": "documents ments in the training set so" }, { "start": 4795.76, "text": "when we're creating the training data we" }, { "start": 4797.32, "text": "have all these documents and we tokenize" }, { "start": 4799.199, "text": "them and we get a stream of tokens those" }, { "start": 4801.8, "text": "tokens only range from Z to" }, { "start": 4805.28, "text": "50256 and then in between those" }, { "start": 4807.4, "text": "documents we put special end of text" }, { "start": 4810.4, "text": "token and we insert that token in" }, { "start": 4812.8, "text": "between documents and we are using this" }, { "start": 4815.639, "text": "as a signal to the language model that" }, { "start": 4818.4, "text": "the document has ended and what follows" }, { "start": 4820.719, "text": "is going to be unrelated to the document" }, { "start": 4823.28, "text": "previously that said the language model" }, { "start": 4825.199, "text": "has to learn this from data it it needs" }, { "start": 4827.199, "text": "to learn that this token usually means" }, { "start": 4829.719, "text": "that it should wipe its sort of memory" }, { "start": 4831.92, "text": "of what came before and what came before" }, { "start": 4834.04, "text": "this token is not actually informative" }, { "start": 4835.56, "text": "to what comes next but we are expecting" }, { "start": 4837.56, "text": "the language model to just like learn" }, { "start": 4839.0, "text": "this but we're giving it the Special" }, { "start": 4840.92, "text": "sort of the limiter of these documents" }, { "start": 4844.08, "text": "we can go here to Tech tokenizer and um" }, { "start": 4846.679, "text": "this the gpt2 tokenizer uh our code that" }, { "start": 4849.48, "text": "we've been playing with before so we can" }, { "start": 4851.44, "text": "add here right hello world world how are" }, { "start": 4853.679, "text": "you and we're getting different tokens" }, { "start": 4855.84, "text": "but now you can see what if what happens" }, { "start": 4858.239, "text": "if I put end of text you see how until I" }, { "start": 4862.199, "text": "finished it these are all different" }, { "start": 4863.92, "text": "tokens end of" }, { "start": 4866.36, "text": "text still set different tokens and now" }, { "start": 4868.8, "text": "when I finish it suddenly we get token" }, { "start": 4873.28, "text": "50256 and the reason this works is" }, { "start": 4875.88, "text": "because this didn't actually go through" }, { "start": 4878.239, "text": "the bpe merges instead the code that" }, { "start": 4881.92, "text": "actually outposted tokens has special" }, { "start": 4885.0, "text": "case instructions for handling special" }, { "start": 4888.04, "text": "tokens um we did not see these special" }, { "start": 4890.76, "text": "instructions for handling special tokens" }, { "start": 4892.84, "text": "in the encoder dopy it's absent there" }, { "start": 4896.36, "text": "but if you go to Tech token Library" }, { "start": 4898.0, "text": "which is uh implemented in Rust you will" }, { "start": 4900.92, "text": "find all kinds of special case handling" }, { "start": 4902.639, "text": "for these special tokens that you can" }, { "start": 4904.52, "text": "register uh create adds to the" }, { "start": 4907.12, "text": "vocabulary and then it looks for them" }, { "start": 4909.0, "text": "and it uh whenever it sees these special" }, { "start": 4910.92, "text": "tokens like this it will actually come" }, { "start": 4913.44, "text": "in and swap in that special token so" }, { "start": 4916.08, "text": "these things are outside of the typical" }, { "start": 4918.12, "text": "algorithm of uh B PA en" }, { "start": 4920.56, "text": "coding so these special tokens are used" }, { "start": 4922.92, "text": "pervasively uh not just in uh basically" }, { "start": 4925.639, "text": "base language modeling of predicting the" }, { "start": 4927.4, "text": "next token in the sequence but" }, { "start": 4929.08, "text": "especially when it gets to later to the" }, { "start": 4930.679, "text": "fine tuning stage and all of the chat uh" }, { "start": 4933.239, "text": "gbt sort of aspects of it uh because we" }, { "start": 4935.679, "text": "don't just want to Del limit documents" }, { "start": 4936.88, "text": "we want to delimit entire conversations" }, { "start": 4938.719, "text": "between an assistant and a user so if I" }, { "start": 4941.56, "text": "refresh this sck tokenizer page the" }, { "start": 4944.239, "text": "default example that they have here is" }, { "start": 4946.44, "text": "using not sort of base model encoders" }, { "start": 4950.12, "text": "but ftuned model uh sort of tokenizers" }, { "start": 4953.6, "text": "um so for example using the GPT 3.5" }, { "start": 4955.84, "text": "turbo scheme these here are all special" }, { "start": 4958.96, "text": "tokens I am start I end Etc uh this is" }, { "start": 4963.239, "text": "short for Imaginary mcore start by the" }, { "start": 4966.84, "text": "way but you can see here that there's a" }, { "start": 4969.6, "text": "sort of start and end of every single" }, { "start": 4971.199, "text": "message and there can be many other" }, { "start": 4972.56, "text": "other tokens lots of tokens um in use to" }, { "start": 4976.52, "text": "delimit these conversations and kind of" }, { "start": 4978.719, "text": "keep track of the flow of the messages" }, { "start": 4980.84, "text": "here now we can go back to the Tik token" }, { "start": 4983.8, "text": "library and here when you scroll to the" }, { "start": 4986.239, "text": "bottom they talk about how you can" }, { "start": 4988.159, "text": "extend tick token and I can you can" }, { "start": 4990.239, "text": "create basically you can Fork uh the um" }, { "start": 4993.679, "text": "CL 100K base tokenizers in gp4 and for" }, { "start": 4997.32, "text": "example you can extend it by adding more" }, { "start": 4998.92, "text": "special tokens and these are totally up" }, { "start": 5000.36, "text": "to you you can come up with any" }, { "start": 5001.36, "text": "arbitrary tokens and add them with the" }, { "start": 5003.76, "text": "new ID afterwards and the tikken library" }, { "start": 5006.52, "text": "will uh correctly swap them out uh when" }, { "start": 5009.88, "text": "it sees this in the" }, { "start": 5011.76, "text": "strings now we can also go back to this" }, { "start": 5014.96, "text": "file which we've looked at previously" }, { "start": 5017.08, "text": "and I mentioned that the gpt2 in Tik" }, { "start": 5019.679, "text": "toen open" }, { "start": 5021.44, "text": "I.P we have the vocabulary we have the" }, { "start": 5024.0, "text": "pattern for splitting and then here we" }, { "start": 5026.28, "text": "are registering the single special token" }, { "start": 5028.04, "text": "in gpd2 which was the end of text token" }, { "start": 5030.32, "text": "and we saw that it has this ID" }, { "start": 5033.0, "text": "in GPT 4 when they defy this here you" }, { "start": 5036.4, "text": "see that the pattern has changed as" }, { "start": 5037.6, "text": "we've discussed but also the special" }, { "start": 5039.36, "text": "tokens have changed in this tokenizer so" }, { "start": 5041.8, "text": "we of course have the end of text just" }, { "start": 5043.719, "text": "like in gpd2 but we also see three sorry" }, { "start": 5046.88, "text": "four additional tokens here Thim prefix" }, { "start": 5049.52, "text": "middle and suffix what is fim fim is" }, { "start": 5052.36, "text": "short for fill in the middle and if" }, { "start": 5054.88, "text": "you'd like to learn more about this idea" }, { "start": 5057.0, "text": "it comes from this paper um and I'm not" }, { "start": 5060.0, "text": "going to go into detail in this video" }, { "start": 5061.199, "text": "it's beyond this video and then there's" }, { "start": 5063.44, "text": "one additional uh serve token here so" }, { "start": 5067.04, "text": "that's that encoding as well so it's" }, { "start": 5069.92, "text": "very common basically to train a" }, { "start": 5071.6, "text": "language model and then if you'd like uh" }, { "start": 5074.719, "text": "you can add special tokens now when you" }, { "start": 5077.52, "text": "add special tokens you of course have to" }, { "start": 5079.8, "text": "um do some model surgery to the" }, { "start": 5081.719, "text": "Transformer and all the parameters" }, { "start": 5083.44, "text": "involved in that Transformer because you" }, { "start": 5085.159, "text": "are basically adding an integer and you" }, { "start": 5087.119, "text": "want to make sure that for example your" }, { "start": 5088.56, "text": "embedding Matrix for the vocabulary" }, { "start": 5090.639, "text": "tokens has to be extended by adding a" }, { "start": 5093.04, "text": "row and typically this row would be" }, { "start": 5094.88, "text": "initialized uh with small random numbers" }, { "start": 5096.88, "text": "or something like that because we need" }, { "start": 5098.8, "text": "to have a vector that now stands for" }, { "start": 5101.199, "text": "that token in addition to that you have" }, { "start": 5103.28, "text": "to go to the final layer of the" }, { "start": 5104.28, "text": "Transformer and you have to make sure" }, { "start": 5105.679, "text": "that that projection at the very end" }, { "start": 5107.52, "text": "into the classifier uh is extended by" }, { "start": 5109.679, "text": "one as well so basically there's some" }, { "start": 5111.8, "text": "model surgery involved that you have to" }, { "start": 5113.48, "text": "couple with the tokenization changes if" }, { "start": 5116.52, "text": "you are going to add special tokens but" }, { "start": 5118.92, "text": "this is a very common operation that" }, { "start": 5120.199, "text": "people do especially if they'd like to" }, { "start": 5121.8, "text": "fine tune the model for example taking" }, { "start": 5123.719, "text": "it from a base model to a chat model" }, { "start": 5126.239, "text": "like chat" }, { "start": 5127.88, "text": "GPT okay so at this point you should" }, { "start": 5129.84, "text": "have everything you need in order to" }, { "start": 5131.04, "text": "build your own gp4 tokenizer now in the" }, { "start": 5133.719, "text": "process of developing this lecture I've" }, { "start": 5135.36, "text": "done that and I published the code under" }, { "start": 5137.239, "text": "this repository" }, { "start": 5138.92, "text": "MBP so MBP looks like this right now as" }, { "start": 5142.52, "text": "I'm recording but uh the MBP repository" }, { "start": 5145.36, "text": "will probably change quite a bit because" }, { "start": 5146.719, "text": "I intend to continue working on it um in" }, { "start": 5149.84, "text": "addition to the MBP repository I've" }, { "start": 5151.76, "text": "published the this uh exercise" }, { "start": 5153.44, "text": "progression that you can follow so if" }, { "start": 5155.36, "text": "you go to exercise. MD here uh this is" }, { "start": 5158.36, "text": "sort of me breaking up the task ahead of" }, { "start": 5161.159, "text": "you into four steps that sort of uh" }, { "start": 5163.4, "text": "build up to what can be a gp4 tokenizer" }, { "start": 5166.639, "text": "and so feel free to follow these steps" }, { "start": 5168.4, "text": "exactly and follow a little bit of the" }, { "start": 5170.4, "text": "guidance that I've laid out here and" }, { "start": 5172.48, "text": "anytime you feel stuck just reference" }, { "start": 5174.639, "text": "the MBP repository here so either the" }, { "start": 5177.96, "text": "tests could be useful or the MBP" }, { "start": 5180.08, "text": "repository itself I try to keep the code" }, { "start": 5182.6, "text": "fairly clean and understandable and so" }, { "start": 5186.159, "text": "um feel free to reference it whenever um" }, { "start": 5188.92, "text": "you get" }, { "start": 5190.159, "text": "stuck uh in addition to that basically" }, { "start": 5192.56, "text": "once you write it you should be able to" }, { "start": 5194.679, "text": "reproduce this behavior from Tech token" }, { "start": 5196.84, "text": "so getting the gb4 tokenizer you can" }, { "start": 5199.32, "text": "take uh you can encode the string and" }, { "start": 5201.32, "text": "you should get these tokens and then you" }, { "start": 5203.239, "text": "can encode and decode the exact same" }, { "start": 5204.679, "text": "string to recover it and in addition to" }, { "start": 5207.239, "text": "all that you should be able to implement" }, { "start": 5208.4, "text": "your own train function uh which Tik" }, { "start": 5210.719, "text": "token Library does not provide it's it's" }, { "start": 5212.48, "text": "again only inference code but you could" }, { "start": 5214.6, "text": "write your own train MBP does it as well" }, { "start": 5217.88, "text": "and that will allow you to train your" }, { "start": 5219.32, "text": "own token" }, { "start": 5220.719, "text": "vocabularies so here are some of the" }, { "start": 5222.4, "text": "code inside M be mean bpe uh shows the" }, { "start": 5226.04, "text": "token vocabularies that you might obtain" }, { "start": 5228.719, "text": "so on the left uh here we have the GPT 4" }, { "start": 5232.4, "text": "merges uh so the first 256 are raw" }, { "start": 5235.84, "text": "individual bytes and then here I am" }, { "start": 5237.719, "text": "visualizing the merges that gp4" }, { "start": 5239.56, "text": "performed during its training so the" }, { "start": 5241.76, "text": "very first merge that gp4 did was merge" }, { "start": 5244.92, "text": "two spaces into a single token for you" }, { "start": 5247.6, "text": "know two spaces and that is a token 256" }, { "start": 5250.84, "text": "and so this is the order in which things" }, { "start": 5252.239, "text": "merged during gb4 training and this is" }, { "start": 5254.679, "text": "the merge order that um we obtain in MBP" }, { "start": 5259.08, "text": "by training a tokenizer and in this case" }, { "start": 5261.199, "text": "I trained it on a Wikipedia page of" }, { "start": 5263.239, "text": "Taylor Swift uh not because I'm a Swifty" }, { "start": 5265.6, "text": "but because that is one of the longest" }, { "start": 5267.8, "text": "um Wikipedia Pages apparently that's" }, { "start": 5269.639, "text": "available but she is pretty cool and" }, { "start": 5274.04, "text": "um what was I going to say yeah so you" }, { "start": 5276.639, "text": "can compare these two uh vocabularies" }, { "start": 5279.08, "text": "and so as an example um here GPT for" }, { "start": 5284.0, "text": "merged I in to become in and we've done" }, { "start": 5286.8, "text": "the exact same thing on this token 259" }, { "start": 5290.0, "text": "here space t becomes space t and that" }, { "start": 5293.28, "text": "happened for us a little bit later as" }, { "start": 5294.639, "text": "well so the difference here is again to" }, { "start": 5296.719, "text": "my understanding only a difference of" }, { "start": 5298.4, "text": "the training set so as an example" }, { "start": 5300.28, "text": "because I see a lot of white space I" }, { "start": 5302.08, "text": "supect that gp4 probably had a lot of" }, { "start": 5303.76, "text": "python code in its training set I'm not" }, { "start": 5305.48, "text": "sure uh for the" }, { "start": 5307.6, "text": "tokenizer and uh here we see much less" }, { "start": 5310.08, "text": "of that of course in the Wikipedia page" }, { "start": 5312.96, "text": "so roughly speaking they look the same" }, { "start": 5314.679, "text": "and they look the same because they're" }, { "start": 5315.96, "text": "running the same algorithm and when you" }, { "start": 5318.08, "text": "train your own you're probably going to" }, { "start": 5319.199, "text": "get something similar depending on what" }, { "start": 5321.199, "text": "you train it on okay so we are now going" }, { "start": 5323.28, "text": "to move on from tick token and the way" }, { "start": 5325.08, "text": "that open AI tokenizes its strings and" }, { "start": 5327.6, "text": "we're going to discuss one more very" }, { "start": 5329.199, "text": "commonly used library for working with" }, { "start": 5331.0, "text": "tokenization inlm" }, { "start": 5332.719, "text": "and that is sentence piece so sentence" }, { "start": 5335.36, "text": "piece is very commonly used in language" }, { "start": 5338.159, "text": "models because unlike Tik token it can" }, { "start": 5340.119, "text": "do both training and inference and is" }, { "start": 5342.36, "text": "quite efficient at both it supports a" }, { "start": 5344.84, "text": "number of algorithms for training uh" }, { "start": 5346.76, "text": "vocabularies but one of them is the B" }, { "start": 5349.199, "text": "pair en coding algorithm that we've been" }, { "start": 5350.44, "text": "looking at so it supports it now" }, { "start": 5353.639, "text": "sentence piece is used both by llama and" }, { "start": 5355.719, "text": "mistal series and many other models as" }, { "start": 5358.199, "text": "well it is on GitHub under Google" }, { "start": 5360.76, "text": "sentence piece" }, { "start": 5362.76, "text": "and the big difference with sentence" }, { "start": 5364.4, "text": "piece and we're going to look at example" }, { "start": 5366.199, "text": "because this is kind of hard and subtle" }, { "start": 5367.92, "text": "to explain is that they think different" }, { "start": 5371.04, "text": "about the order of operations here so in" }, { "start": 5375.48, "text": "the case of Tik token we first take our" }, { "start": 5378.56, "text": "code points in the string we encode them" }, { "start": 5381.0, "text": "using mutf to bytes and then we're" }, { "start": 5382.88, "text": "merging bytes it's fairly" }, { "start": 5384.96, "text": "straightforward for sentence piece um it" }, { "start": 5388.88, "text": "works directly on the level of the code" }, { "start": 5390.4, "text": "points themselves so so it looks at" }, { "start": 5392.52, "text": "whatever code points are available in" }, { "start": 5393.92, "text": "your training set and then it starts" }, { "start": 5395.88, "text": "merging those code points and um the bpe" }, { "start": 5399.76, "text": "is running on the level of code" }, { "start": 5401.6, "text": "points and if you happen to run out of" }, { "start": 5404.239, "text": "code points so there are maybe some rare" }, { "start": 5406.76, "text": "uh code points that just don't come up" }, { "start": 5408.04, "text": "too often and the Rarity is determined" }, { "start": 5409.719, "text": "by this character coverage hyper" }, { "start": 5411.199, "text": "parameter then these uh code points will" }, { "start": 5414.36, "text": "either get mapped to a special unknown" }, { "start": 5416.28, "text": "token like ank or if you have the bite" }, { "start": 5419.52, "text": "foldback option turned on then that will" }, { "start": 5422.119, "text": "take those rare Cod points it will" }, { "start": 5423.96, "text": "encode them using utf8 and then the" }, { "start": 5426.08, "text": "individual bytes of that encoding will" }, { "start": 5427.76, "text": "be translated into tokens and there are" }, { "start": 5430.119, "text": "these special bite tokens that basically" }, { "start": 5432.199, "text": "get added to the vocabulary so it uses" }, { "start": 5435.52, "text": "BP on on the code points and then it" }, { "start": 5438.239, "text": "falls back to bytes for rare Cod points" }, { "start": 5441.8, "text": "um and so that's kind of like difference" }, { "start": 5444.08, "text": "personally I find the Tik token we" }, { "start": 5445.52, "text": "significantly cleaner uh but it's kind" }, { "start": 5447.48, "text": "of like a subtle but pretty major" }, { "start": 5448.84, "text": "difference between the way they approach" }, { "start": 5450.32, "text": "tokenization let's work with with a" }, { "start": 5452.04, "text": "concrete example because otherwise this" }, { "start": 5454.0, "text": "is kind of hard to um to get your head" }, { "start": 5456.719, "text": "around so let's work with a concrete" }, { "start": 5459.119, "text": "example this is how we can import" }, { "start": 5461.119, "text": "sentence piece and then here we're going" }, { "start": 5463.6, "text": "to take I think I took like the" }, { "start": 5465.199, "text": "description of sentence piece and I just" }, { "start": 5466.76, "text": "created like a little toy data set it" }, { "start": 5468.679, "text": "really likes to have a file so I created" }, { "start": 5470.4, "text": "a toy. txt file with this" }, { "start": 5473.08, "text": "content now what's kind of a little bit" }, { "start": 5475.52, "text": "crazy about sentence piece is that" }, { "start": 5476.76, "text": "there's a ton of options and" }, { "start": 5478.679, "text": "configurations and the reason this is so" }, { "start": 5480.8, "text": "is because sentence piece has been" }, { "start": 5482.199, "text": "around I think for a while and it really" }, { "start": 5483.84, "text": "tries to handle a large diversity of" }, { "start": 5485.76, "text": "things and um because it's been around I" }, { "start": 5488.44, "text": "think it has quite a bit of accumulated" }, { "start": 5490.52, "text": "historical baggage uh as well and so in" }, { "start": 5493.679, "text": "particular there's like a ton of" }, { "start": 5495.56, "text": "configuration arguments this is not even" }, { "start": 5496.96, "text": "all of it you can go to here to see all" }, { "start": 5499.8, "text": "the training" }, { "start": 5500.96, "text": "options um and uh there's also quite" }, { "start": 5504.4, "text": "useful documentation when you look at" }, { "start": 5505.719, "text": "the raw Proto buff uh that is used to" }, { "start": 5508.6, "text": "represent the trainer spec and so on um" }, { "start": 5512.44, "text": "many of these options are irrelevant to" }, { "start": 5514.52, "text": "us so maybe to point out one example Das" }, { "start": 5516.96, "text": "Das shrinking Factor uh this shrinking" }, { "start": 5519.84, "text": "factor is not used in the B pair en" }, { "start": 5521.28, "text": "coding algorithm so this is just an" }, { "start": 5523.159, "text": "argument that is irrelevant to us um it" }, { "start": 5525.92, "text": "applies to a different training" }, { "start": 5529.52, "text": "algorithm now what I tried to do here is" }, { "start": 5531.92, "text": "I tried to set up sentence piece in a" }, { "start": 5533.88, "text": "way that is very very similar as far as" }, { "start": 5535.719, "text": "I can tell to maybe identical hopefully" }, { "start": 5538.88, "text": "to the way that llama 2 was strained so" }, { "start": 5542.08, "text": "the way they trained their own um their" }, { "start": 5545.04, "text": "own tokenizer and the way I did this was" }, { "start": 5547.119, "text": "basically you can take the tokenizer" }, { "start": 5548.719, "text": "model file that meta released and you" }, { "start": 5551.4, "text": "can um open it using the Proto protuff" }, { "start": 5555.199, "text": "uh sort of file that you can generate" }, { "start": 5558.36, "text": "and then you can inspect all the options" }, { "start": 5559.719, "text": "and I tried to copy over all the options" }, { "start": 5561.36, "text": "that looked relevant so here we set up" }, { "start": 5563.679, "text": "the input it's raw text in this file" }, { "start": 5566.6, "text": "here's going to be the output so it's" }, { "start": 5568.08, "text": "going to be for talk 400. model and" }, { "start": 5570.76, "text": "vocab" }, { "start": 5572.44, "text": "we're saying that we're going to use the" }, { "start": 5573.4, "text": "BP algorithm and we want to Bap size of" }, { "start": 5576.04, "text": "400 then there's a ton of configurations" }, { "start": 5578.6, "text": "here" }, { "start": 5581.08, "text": "for um for basically pre-processing and" }, { "start": 5585.08, "text": "normalization rules as they're called" }, { "start": 5587.08, "text": "normalization used to be very prevalent" }, { "start": 5589.48, "text": "I would say before llms in natural" }, { "start": 5591.159, "text": "language processing so in machine" }, { "start": 5592.8, "text": "translation and uh text classification" }, { "start": 5594.88, "text": "and so on you want to normalize and" }, { "start": 5596.719, "text": "simplify the text and you want to turn" }, { "start": 5598.0, "text": "it all lowercase and you want to remove" }, { "start": 5599.52, "text": "all double whites space Etc" }, { "start": 5602.199, "text": "and in language models we prefer not to" }, { "start": 5603.76, "text": "do any of it or at least that is my" }, { "start": 5605.28, "text": "preference as a deep learning person you" }, { "start": 5606.96, "text": "want to not touch your data you want to" }, { "start": 5608.84, "text": "keep the raw data as much as possible um" }, { "start": 5611.679, "text": "in a raw" }, { "start": 5613.119, "text": "form so you're basically trying to turn" }, { "start": 5615.159, "text": "off a lot of this if you can the other" }, { "start": 5618.0, "text": "thing that sentence piece does is that" }, { "start": 5619.52, "text": "it has this concept of sentences so" }, { "start": 5623.04, "text": "sentence piece it's back it's kind of" }, { "start": 5625.48, "text": "like was developed I think early in the" }, { "start": 5626.84, "text": "days where there was um an idea that" }, { "start": 5630.159, "text": "they you're training a tokenizer on a" }, { "start": 5631.96, "text": "bunch of independent sentences so it has" }, { "start": 5634.199, "text": "a lot of like how many sentences you're" }, { "start": 5636.36, "text": "going to train on what is the maximum" }, { "start": 5638.0, "text": "sentence length" }, { "start": 5640.679, "text": "um shuffling sentences and so for it" }, { "start": 5643.719, "text": "sentences are kind of like the" }, { "start": 5644.8, "text": "individual training examples but again" }, { "start": 5646.88, "text": "in the context of llms I find that this" }, { "start": 5648.719, "text": "is like a very spous and weird" }, { "start": 5650.44, "text": "distinction like sentences are just like" }, { "start": 5653.92, "text": "don't touch the raw data sentences" }, { "start": 5655.6, "text": "happen to exist but in raw data sets" }, { "start": 5658.679, "text": "there are a lot of like inet like what" }, { "start": 5660.6, "text": "exactly is a sentence what isn't a" }, { "start": 5662.44, "text": "sentence um and so I think like it's" }, { "start": 5665.0, "text": "really hard to Define what an actual" }, { "start": 5666.48, "text": "sentence is if you really like dig into" }, { "start": 5668.639, "text": "it and there could be different concepts" }, { "start": 5670.92, "text": "of it in different languages or" }, { "start": 5672.119, "text": "something like that so why even" }, { "start": 5673.719, "text": "introduce the concept it it doesn't" }, { "start": 5675.56, "text": "honestly make sense to me I would just" }, { "start": 5676.92, "text": "prefer to treat a file as a giant uh" }, { "start": 5679.199, "text": "stream of" }, { "start": 5680.36, "text": "bytes it has a lot of treatment around" }, { "start": 5682.8, "text": "rare word characters and when I say word" }, { "start": 5685.119, "text": "I mean code points we're going to come" }, { "start": 5686.48, "text": "back to this in a second and it has a" }, { "start": 5688.679, "text": "lot of other rules for um basically" }, { "start": 5691.679, "text": "splitting digits splitting white space" }, { "start": 5694.48, "text": "and numbers and how you deal with that" }, { "start": 5696.56, "text": "so these are some kind of like merge" }, { "start": 5698.199, "text": "rules so I think this is a little bit" }, { "start": 5700.08, "text": "equivalent to tick token using the" }, { "start": 5702.92, "text": "regular expression to split up" }, { "start": 5704.52, "text": "categories there's like kind of" }, { "start": 5707.04, "text": "equivalence of it if you squint T it in" }, { "start": 5709.239, "text": "sentence piece where you can also for" }, { "start": 5710.639, "text": "example split up split up the digits uh" }, { "start": 5714.199, "text": "and uh so" }, { "start": 5715.84, "text": "on there's a few more things here that" }, { "start": 5718.199, "text": "I'll come back to in a bit and then" }, { "start": 5719.36, "text": "there are some special tokens that you" }, { "start": 5720.48, "text": "can indicate and it hardcodes the UN" }, { "start": 5723.36, "text": "token the beginning of sentence end of" }, { "start": 5725.56, "text": "sentence and a pad token um and the UN" }, { "start": 5729.32, "text": "token must exist for my understanding" }, { "start": 5732.52, "text": "and then some some things so we can" }, { "start": 5734.719, "text": "train and when when I press train it's" }, { "start": 5737.28, "text": "going to create this file talk 400." }, { "start": 5740.119, "text": "model and talk 400. wab I can then load" }, { "start": 5743.159, "text": "the model file and I can inspect the" }, { "start": 5745.56, "text": "vocabulary off it and so we trained" }, { "start": 5748.56, "text": "vocab size 400 on this text here and" }, { "start": 5753.32, "text": "these are the individual pieces the" }, { "start": 5755.0, "text": "individual tokens that sentence piece" }, { "start": 5756.88, "text": "will create so in the beginning we see" }, { "start": 5758.8, "text": "that we have the an token uh with the ID" }, { "start": 5762.08, "text": "zero then we have the beginning of" }, { "start": 5764.04, "text": "sequence end of sequence one and two and" }, { "start": 5767.8, "text": "then we said that the pad ID is negative" }, { "start": 5769.32, "text": "1 so we chose not to use it so there's" }, { "start": 5772.08, "text": "no pad ID" }, { "start": 5773.48, "text": "here then these are individual bite" }, { "start": 5776.84, "text": "tokens so here we saw that bite fallback" }, { "start": 5780.159, "text": "in llama was turned on so it's true so" }, { "start": 5783.56, "text": "what follows are going to be the 256" }, { "start": 5786.159, "text": "bite" }, { "start": 5787.199, "text": "tokens and these are their" }, { "start": 5791.719, "text": "IDs and then at the bottom after the" }, { "start": 5795.04, "text": "bite tokens come the" }, { "start": 5797.679, "text": "merges and these are the parent nodes in" }, { "start": 5800.56, "text": "the merges so we're not seeing the" }, { "start": 5802.199, "text": "children we're just seeing the parents" }, { "start": 5803.719, "text": "and their" }, { "start": 5804.6, "text": "ID and then after the" }, { "start": 5807.04, "text": "merges comes eventually the individual" }, { "start": 5810.719, "text": "tokens and their IDs and so these are" }, { "start": 5813.56, "text": "the individual tokens so these are the" }, { "start": 5815.32, "text": "individual code Point tokens if you will" }, { "start": 5818.239, "text": "and they come at the end so that is the" }, { "start": 5820.28, "text": "ordering with which sentence piece sort" }, { "start": 5821.76, "text": "of like represents its vocabularies it" }, { "start": 5823.92, "text": "starts with special tokens then the bike" }, { "start": 5826.119, "text": "tokens then the merge tokens and then" }, { "start": 5828.159, "text": "the individual codo tokens and all these" }, { "start": 5831.639, "text": "raw codepoint to tokens are the ones" }, { "start": 5834.04, "text": "that it encountered in the training" }, { "start": 5836.119, "text": "set so those individual code points are" }, { "start": 5839.8, "text": "all the the entire set of code points" }, { "start": 5842.159, "text": "that occurred" }, { "start": 5844.4, "text": "here so those all get put in there and" }, { "start": 5847.48, "text": "then those that are extremely rare as" }, { "start": 5849.28, "text": "determined by character coverage so if a" }, { "start": 5851.119, "text": "code Point occurred only a single time" }, { "start": 5852.52, "text": "out of like a million um sentences or" }, { "start": 5855.159, "text": "something like that then it would be" }, { "start": 5857.08, "text": "ignored and it would not be added to our" }, { "start": 5860.199, "text": "uh" }, { "start": 5861.04, "text": "vocabulary once we have a vocabulary we" }, { "start": 5863.36, "text": "can encode into IDs and we can um sort" }, { "start": 5866.48, "text": "of get a" }, { "start": 5867.4, "text": "list and then here I am also decoding" }, { "start": 5870.679, "text": "the indiv idual tokens back into little" }, { "start": 5874.32, "text": "pieces as they call it so let's take a" }, { "start": 5876.96, "text": "look at what happened here hello space" }, { "start": 5881.08, "text": "on so these are the token IDs we got" }, { "start": 5884.679, "text": "back and when we look here uh a few" }, { "start": 5887.48, "text": "things sort of uh jump to mind number" }, { "start": 5891.52, "text": "one take a look at these characters the" }, { "start": 5894.159, "text": "Korean characters of course were not" }, { "start": 5895.52, "text": "part of the training set so sentence" }, { "start": 5898.0, "text": "piece is encountering code points that" }, { "start": 5899.599, "text": "it has not seen during training time and" }, { "start": 5902.199, "text": "those code points do not have a token" }, { "start": 5904.56, "text": "associated with them so suddenly these" }, { "start": 5906.4, "text": "are un tokens unknown tokens but because" }, { "start": 5910.56, "text": "bite fall back as true instead sentence" }, { "start": 5913.84, "text": "piece falls back to bytes and so it" }, { "start": 5916.44, "text": "takes this it encodes it with utf8 and" }, { "start": 5919.84, "text": "then it uses these tokens to represent" }, { "start": 5923.28, "text": "uh those bytes and that's what we are" }, { "start": 5925.8, "text": "getting sort of here this is the utf8 uh" }, { "start": 5929.719, "text": "encoding and in this shifted by three uh" }, { "start": 5932.88, "text": "because of these um special tokens here" }, { "start": 5936.239, "text": "that have IDs earlier on so that's what" }, { "start": 5938.84, "text": "happened here now one more thing that um" }, { "start": 5942.92, "text": "well first before I go on with respect" }, { "start": 5945.52, "text": "to the bitef back let me remove bite" }, { "start": 5948.239, "text": "foldback if this is false what's going" }, { "start": 5950.84, "text": "to happen let's" }, { "start": 5952.52, "text": "retrain so the first thing that happened" }, { "start": 5954.44, "text": "is all the bite tokens disappeared right" }, { "start": 5957.28, "text": "and now we just have the merges and we" }, { "start": 5959.0, "text": "have a lot more merges now because we" }, { "start": 5960.48, "text": "have a lot more space because we're not" }, { "start": 5961.8, "text": "taking up space in the wab size uh with" }, { "start": 5965.04, "text": "all the" }, { "start": 5965.96, "text": "bytes and now if we encode" }, { "start": 5969.08, "text": "this we get a zero so this entire string" }, { "start": 5973.239, "text": "here suddenly there's no bitef back so" }, { "start": 5975.119, "text": "this is unknown and unknown is an and so" }, { "start": 5979.4, "text": "this is zero because the an token is" }, { "start": 5982.04, "text": "token zero and you have to keep in mind" }, { "start": 5984.92, "text": "that this would feed into your uh" }, { "start": 5986.88, "text": "language model so what is a language" }, { "start": 5988.4, "text": "model supposed to do when all kinds of" }, { "start": 5989.92, "text": "different things that are unrecognized" }, { "start": 5992.159, "text": "because they're rare just end up mapping" }, { "start": 5994.0, "text": "into Unk it's not exactly the property" }, { "start": 5996.119, "text": "that you want so that's why I think" }, { "start": 5997.76, "text": "llama correctly uh used by fallback true" }, { "start": 6002.04, "text": "uh because we definitely want to feed" }, { "start": 6003.719, "text": "these um unknown or rare code points" }, { "start": 6006.04, "text": "into the model and some uh some manner" }, { "start": 6008.56, "text": "the next thing I want to show you is the" }, { "start": 6010.679, "text": "following notice here when we are" }, { "start": 6012.48, "text": "decoding all the individual tokens you" }, { "start": 6014.719, "text": "see how spaces uh space here ends up" }, { "start": 6018.04, "text": "being this um bold underline I'm not" }, { "start": 6021.239, "text": "100% sure by the way why sentence piece" }, { "start": 6023.08, "text": "switches whites space into these bold" }, { "start": 6025.36, "text": "underscore characters maybe it's for" }, { "start": 6027.639, "text": "visualization I'm not 100% sure why that" }, { "start": 6029.88, "text": "happens uh but notice this why do we" }, { "start": 6032.44, "text": "have an extra space in the front of" }, { "start": 6037.44, "text": "hello um what where is this coming from" }, { "start": 6040.48, "text": "well it's coming from this option" }, { "start": 6043.159, "text": "here" }, { "start": 6045.04, "text": "um add dummy prefix is true and when you" }, { "start": 6048.36, "text": "go to the" }, { "start": 6049.56, "text": "documentation add D whites space at the" }, { "start": 6051.88, "text": "beginning of text in order to treat" }, { "start": 6053.36, "text": "World in world and hello world in the" }, { "start": 6055.92, "text": "exact same way so what this is trying to" }, { "start": 6057.96, "text": "do is the" }, { "start": 6059.239, "text": "following if we go back to our tick" }, { "start": 6062.04, "text": "tokenizer world as uh token by itself" }, { "start": 6066.32, "text": "has a different ID than space world so" }, { "start": 6070.239, "text": "we have this is 1917 but this is 14 Etc" }, { "start": 6074.599, "text": "so these are two different tokens for" }, { "start": 6076.0, "text": "the language model and the language" }, { "start": 6077.4, "text": "model has to learn from data that they" }, { "start": 6078.88, "text": "are actually kind of like a very similar" }, { "start": 6080.32, "text": "concept so to the language model in the" }, { "start": 6083.0, "text": "Tik token World um basically words in" }, { "start": 6086.0, "text": "the beginning of sentences and words in" }, { "start": 6087.639, "text": "the middle of sentences actually look" }, { "start": 6089.04, "text": "completely different um and it has to" }, { "start": 6092.04, "text": "learned that they are roughly the same" }, { "start": 6094.44, "text": "so this add dami prefix is trying to" }, { "start": 6096.92, "text": "fight that a little bit and the way that" }, { "start": 6098.96, "text": "works is that it basically" }, { "start": 6101.719, "text": "uh adds a dummy prefix so for as a as a" }, { "start": 6106.76, "text": "part of pre-processing it will take the" }, { "start": 6109.08, "text": "string and it will add a space it will" }, { "start": 6111.32, "text": "do this and that's done in an effort to" }, { "start": 6114.92, "text": "make this world and that world the same" }, { "start": 6117.52, "text": "they will both be space world so that's" }, { "start": 6120.28, "text": "one other kind of pre-processing option" }, { "start": 6122.159, "text": "that is turned on and llama 2 also uh" }, { "start": 6125.28, "text": "uses this option and that's I think" }, { "start": 6127.4, "text": "everything that I want to say for my" }, { "start": 6128.639, "text": "preview of sentence piece and how it is" }, { "start": 6130.44, "text": "different um maybe here what I've done" }, { "start": 6133.119, "text": "is I just uh put in the Raw protocol" }, { "start": 6136.719, "text": "buffer representation basically of the" }, { "start": 6139.84, "text": "tokenizer the too trained so feel free" }, { "start": 6142.88, "text": "to sort of Step through this and if you" }, { "start": 6144.76, "text": "would like uh your tokenization to look" }, { "start": 6147.0, "text": "identical to that of the meta uh llama 2" }, { "start": 6150.32, "text": "then you would be copy pasting these" }, { "start": 6151.679, "text": "settings as I tried to do up above and" }, { "start": 6154.76, "text": "uh yeah that's I think that's it for" }, { "start": 6156.96, "text": "this section I think my summary for" }, { "start": 6158.88, "text": "sentence piece from all of this is" }, { "start": 6160.8, "text": "number one I think that there's a lot of" }, { "start": 6162.44, "text": "historical baggage in sentence piece a" }, { "start": 6164.28, "text": "lot of Concepts that I think are" }, { "start": 6165.679, "text": "slightly confusing and I think" }, { "start": 6167.239, "text": "potentially um contain foot guns like" }, { "start": 6169.4, "text": "this concept of a sentence and it's" }, { "start": 6170.8, "text": "maximum length and stuff like that um" }, { "start": 6173.719, "text": "otherwise it is fairly commonly used in" }, { "start": 6175.88, "text": "the industry um because it is efficient" }, { "start": 6178.88, "text": "and can do both training and inference" }, { "start": 6181.0, "text": "uh it has a few quirks like for example" }, { "start": 6182.76, "text": "un token must exist and the way the bite" }, { "start": 6185.08, "text": "fallbacks are done and so on I don't" }, { "start": 6186.56, "text": "find particularly elegant and" }, { "start": 6188.36, "text": "unfortunately I have to say it's not" }, { "start": 6189.56, "text": "very well documented so it took me a lot" }, { "start": 6191.44, "text": "of time working with this myself um and" }, { "start": 6194.76, "text": "just visualizing things and trying to" }, { "start": 6196.159, "text": "really understand what is happening here" }, { "start": 6197.8, "text": "because uh the documentation" }, { "start": 6199.28, "text": "unfortunately is in my opion not not" }, { "start": 6201.44, "text": "super amazing but it is a very nice repo" }, { "start": 6204.679, "text": "that is available to you if you'd like" }, { "start": 6206.159, "text": "to train your own tokenizer right now" }, { "start": 6208.199, "text": "okay let me now switch gears again as" }, { "start": 6209.639, "text": "we're starting to slowly wrap up here I" }, { "start": 6211.719, "text": "want to revisit this issue in a bit more" }, { "start": 6213.36, "text": "detail of how we should set the vocap" }, { "start": 6215.32, "text": "size and what are some of the" }, { "start": 6216.199, "text": "considerations around it so for this I'd" }, { "start": 6219.639, "text": "like to go back to the model" }, { "start": 6220.84, "text": "architecture that we developed in the" }, { "start": 6222.159, "text": "last video when we built the GPT from" }, { "start": 6224.679, "text": "scratch so this here was uh the file" }, { "start": 6227.4, "text": "that we built in the previous video and" }, { "start": 6229.08, "text": "we defined the Transformer model and and" }, { "start": 6231.32, "text": "let's specifically look at Bap size and" }, { "start": 6232.88, "text": "where it appears in this file so here we" }, { "start": 6235.199, "text": "Define the voap size uh at this time it" }, { "start": 6238.159, "text": "was 65 or something like that extremely" }, { "start": 6239.96, "text": "small number so this will grow much" }, { "start": 6242.08, "text": "larger you'll see that Bap size doesn't" }, { "start": 6244.28, "text": "come up too much in most of these layers" }, { "start": 6246.159, "text": "the only place that it comes up to is in" }, { "start": 6248.52, "text": "exactly these two places here so when we" }, { "start": 6251.48, "text": "Define the language model there's the" }, { "start": 6253.56, "text": "token embedding table which is this" }, { "start": 6255.8, "text": "two-dimensional array where the vocap" }, { "start": 6258.08, "text": "size is basically the number of rows and" }, { "start": 6261.199, "text": "uh each vocabulary element each token" }, { "start": 6263.92, "text": "has a vector that we're going to train" }, { "start": 6265.92, "text": "using back propagation that Vector is of" }, { "start": 6267.96, "text": "size and embed which is number of" }, { "start": 6269.44, "text": "channels in the Transformer and" }, { "start": 6271.599, "text": "basically as voap size increases this" }, { "start": 6273.679, "text": "embedding table as I mentioned earlier" }, { "start": 6275.679, "text": "is going to also grow we're going to be" }, { "start": 6277.0, "text": "adding rows in addition to that at the" }, { "start": 6279.719, "text": "end of the Transformer there's this LM" }, { "start": 6281.88, "text": "head layer which is a linear layer and" }, { "start": 6284.239, "text": "you'll notice that that layer is used at" }, { "start": 6286.28, "text": "the very end to produce the logits uh" }, { "start": 6288.639, "text": "which become the probabilities for the" }, { "start": 6289.96, "text": "next token in sequence and so" }, { "start": 6291.76, "text": "intuitively we're trying to produce a" }, { "start": 6293.92, "text": "probability for every single token that" }, { "start": 6296.239, "text": "might come next at every point in time" }, { "start": 6298.84, "text": "of that Transformer and if we have more" }, { "start": 6301.08, "text": "and more tokens we need to produce more" }, { "start": 6302.679, "text": "and more probabilities so every single" }, { "start": 6304.92, "text": "token is going to introduce an" }, { "start": 6306.199, "text": "additional dot product that we have to" }, { "start": 6308.159, "text": "do here in this linear layer for this" }, { "start": 6310.199, "text": "final layer in a" }, { "start": 6311.44, "text": "Transformer so why can't vocap size be" }, { "start": 6314.56, "text": "infinite why can't we grow to Infinity" }, { "start": 6316.52, "text": "well number one your token embedding" }, { "start": 6318.199, "text": "table is going to grow uh your linear" }, { "start": 6321.56, "text": "layer is going to grow so we're going to" }, { "start": 6323.599, "text": "be doing a lot more computation here" }, { "start": 6325.119, "text": "because this LM head layer will become" }, { "start": 6326.56, "text": "more computational expensive number two" }, { "start": 6329.119, "text": "because we have more parameters we could" }, { "start": 6330.84, "text": "be worried that we are going to be under" }, { "start": 6333.44, "text": "trining some of these" }, { "start": 6335.199, "text": "parameters so intuitively if you have a" }, { "start": 6337.4, "text": "very large vocabulary size say we have a" }, { "start": 6338.96, "text": "million uh tokens then every one of" }, { "start": 6341.32, "text": "these tokens is going to come up more" }, { "start": 6342.679, "text": "and more rarely in the training data" }, { "start": 6345.04, "text": "because there's a lot more other tokens" }, { "start": 6346.52, "text": "all over the place and so we're going to" }, { "start": 6348.56, "text": "be seeing fewer and fewer examples uh" }, { "start": 6351.0, "text": "for each individual token and you might" }, { "start": 6353.28, "text": "be worried that basically the vectors" }, { "start": 6355.0, "text": "associated with every token will be" }, { "start": 6356.28, "text": "undertrained as a result because they" }, { "start": 6358.28, "text": "just don't come up too often and they" }, { "start": 6359.92, "text": "don't participate in the forward" }, { "start": 6360.96, "text": "backward pass in addition to that as" }, { "start": 6363.199, "text": "your vocab size grows you're going to" }, { "start": 6364.88, "text": "start shrinking your sequences a lot" }, { "start": 6367.04, "text": "right and that's really nice because" }, { "start": 6369.32, "text": "that means that we're going to be" }, { "start": 6370.119, "text": "attending to more and more text so" }, { "start": 6372.0, "text": "that's nice but also you might be" }, { "start": 6373.599, "text": "worrying that two large of chunks are" }, { "start": 6375.92, "text": "being squished into single tokens and so" }, { "start": 6378.56, "text": "the model just doesn't have as much of" }, { "start": 6380.719, "text": "time to think per sort of um some number" }, { "start": 6385.08, "text": "of characters in the text or you can" }, { "start": 6386.679, "text": "think about it that way right so" }, { "start": 6388.08, "text": "basically we're squishing too much" }, { "start": 6389.48, "text": "information into a single token and then" }, { "start": 6391.639, "text": "the forward pass of the Transformer is" }, { "start": 6393.04, "text": "not enough to actually process that" }, { "start": 6394.4, "text": "information appropriately and so these" }, { "start": 6396.44, "text": "are some of the considerations you're" }, { "start": 6397.48, "text": "thinking about when you're designing the" }, { "start": 6398.639, "text": "vocab size as I mentioned this is mostly" }, { "start": 6400.639, "text": "an empirical hyperparameter and it seems" }, { "start": 6402.88, "text": "like in state-of-the-art architectures" }, { "start": 6404.239, "text": "today this is usually in the high 10,000" }, { "start": 6406.76, "text": "or somewhere around 100,000 today and" }, { "start": 6409.36, "text": "the next consideration I want to briefly" }, { "start": 6410.88, "text": "talk about is what if we want to take a" }, { "start": 6413.0, "text": "pre-trained model and we want to extend" }, { "start": 6415.199, "text": "the vocap size and this is done fairly" }, { "start": 6417.36, "text": "commonly actually so for example when" }, { "start": 6418.88, "text": "you're doing fine-tuning for cha GPT um" }, { "start": 6422.159, "text": "a lot more new special tokens get" }, { "start": 6423.76, "text": "introduced on top of the base model to" }, { "start": 6425.8, "text": "maintain the metadata and all the" }, { "start": 6428.04, "text": "structure of conversation objects" }, { "start": 6429.88, "text": "between a user and an assistant so that" }, { "start": 6431.92, "text": "takes a lot of special tokens you might" }, { "start": 6434.04, "text": "also try to throw in more special tokens" }, { "start": 6435.88, "text": "for example for using the browser or any" }, { "start": 6437.8, "text": "other tool and so it's very tempting to" }, { "start": 6440.639, "text": "add a lot of tokens for all kinds of" }, { "start": 6442.159, "text": "special functionality so if you want to" }, { "start": 6444.52, "text": "be adding a token that's totally" }, { "start": 6445.8, "text": "possible Right all we have to do is we" }, { "start": 6447.719, "text": "have to resize this embedding so we have" }, { "start": 6449.88, "text": "to add rows we would initialize these uh" }, { "start": 6452.48, "text": "parameters from scratch to be small" }, { "start": 6454.44, "text": "random numbers and then we have to" }, { "start": 6456.119, "text": "extend the weight inside this linear uh" }, { "start": 6459.28, "text": "so we have to start making dot products" }, { "start": 6461.44, "text": "um with the associated parameters as" }, { "start": 6463.199, "text": "well to basically calculate the" }, { "start": 6464.56, "text": "probabilities for these new tokens so" }, { "start": 6466.76, "text": "both of these are just a resizing" }, { "start": 6468.639, "text": "operation it's a very mild" }, { "start": 6470.84, "text": "model surgery and can be done fairly" }, { "start": 6472.599, "text": "easily and it's quite common that" }, { "start": 6474.04, "text": "basically you would freeze the base" }, { "start": 6475.36, "text": "model you introduce these new parameters" }, { "start": 6477.44, "text": "and then you only train these new" }, { "start": 6478.639, "text": "parameters to introduce new tokens into" }, { "start": 6480.56, "text": "the architecture um and so you can" }, { "start": 6483.119, "text": "freeze arbitrary parts of it or you can" }, { "start": 6484.96, "text": "train arbitrary parts of it and that's" }, { "start": 6486.4, "text": "totally up to you but basically minor" }, { "start": 6488.32, "text": "surgery required if you'd like to" }, { "start": 6490.119, "text": "introduce new tokens and finally I'd" }, { "start": 6491.88, "text": "like to mention that actually there's an" }, { "start": 6493.36, "text": "entire design space of applications in" }, { "start": 6495.92, "text": "terms of introducing new tokens into a" }, { "start": 6497.639, "text": "vocabulary that go Way Beyond just" }, { "start": 6499.36, "text": "adding special tokens and special new" }, { "start": 6501.199, "text": "functionality so just to give you a" }, { "start": 6503.0, "text": "sense of the design space but this could" }, { "start": 6504.36, "text": "be an entire video just by itself uh" }, { "start": 6506.599, "text": "this is a paper on learning to compress" }, { "start": 6508.639, "text": "prompts with what they called uh gist" }, { "start": 6511.04, "text": "tokens and the rough idea is suppose" }, { "start": 6513.4, "text": "that you're using language models in a" }, { "start": 6514.679, "text": "setting that requires very long prompts" }, { "start": 6517.159, "text": "while these long prompts just slow" }, { "start": 6518.8, "text": "everything down because you have to" }, { "start": 6519.84, "text": "encode them and then you have to use" }, { "start": 6521.4, "text": "them and then you're tending over them" }, { "start": 6523.119, "text": "and it's just um you know heavy to have" }, { "start": 6525.119, "text": "very large prompts so instead what they" }, { "start": 6527.639, "text": "do here in this paper is they introduce" }, { "start": 6530.679, "text": "new tokens and um imagine basically" }, { "start": 6534.56, "text": "having a few new tokens you put them in" }, { "start": 6536.4, "text": "a sequence and then you train the model" }, { "start": 6539.36, "text": "by distillation so you are keeping the" }, { "start": 6541.52, "text": "entire model Frozen and you're only" }, { "start": 6543.159, "text": "training the representations of the new" }, { "start": 6545.0, "text": "tokens their embeddings and you're" }, { "start": 6546.96, "text": "optimizing over the new tokens such that" }, { "start": 6549.44, "text": "the behavior of the language model is" }, { "start": 6551.92, "text": "identical uh to the model that has a" }, { "start": 6555.04, "text": "very long prompt that works for you and" }, { "start": 6557.679, "text": "so it's a compression technique of" }, { "start": 6559.0, "text": "compressing that very long prompt into" }, { "start": 6560.8, "text": "those few new gist tokens and so you can" }, { "start": 6563.8, "text": "train this and then at test time you can" }, { "start": 6565.04, "text": "discard your old prompt and just swap in" }, { "start": 6566.719, "text": "those tokens and they sort of like uh" }, { "start": 6568.639, "text": "stand in for that very long prompt and" }, { "start": 6571.119, "text": "have an almost identical performance and" }, { "start": 6573.679, "text": "so this is one um technique and a class" }, { "start": 6576.48, "text": "of parameter efficient fine-tuning" }, { "start": 6578.0, "text": "techniques where most of the model is" }, { "start": 6579.92, "text": "basically fixed and there's no training" }, { "start": 6581.88, "text": "of the model weights there's no training" }, { "start": 6583.599, "text": "of Laura or anything like that of new" }, { "start": 6585.44, "text": "parameters the the parameters that" }, { "start": 6587.239, "text": "you're training are now just the uh" }, { "start": 6589.119, "text": "token embeddings so that's just one" }, { "start": 6591.199, "text": "example but this could again be like an" }, { "start": 6592.88, "text": "entire video but just to give you a" }, { "start": 6594.52, "text": "sense that there's a whole design space" }, { "start": 6595.76, "text": "here that is potentially worth exploring" }, { "start": 6597.36, "text": "in the future the next thing I want to" }, { "start": 6599.199, "text": "briefly address is that I think recently" }, { "start": 6601.199, "text": "there's a lot of momentum in how you" }, { "start": 6603.08, "text": "actually could construct Transformers" }, { "start": 6605.08, "text": "that can simultaneously process not just" }, { "start": 6606.8, "text": "text as the input modality but a lot of" }, { "start": 6608.84, "text": "other modalities so be it images videos" }, { "start": 6611.52, "text": "audio Etc and how do you feed in all" }, { "start": 6614.28, "text": "these modalities and potentially predict" }, { "start": 6616.0, "text": "these modalities from a Transformer uh" }, { "start": 6618.84, "text": "do you have to change the architecture" }, { "start": 6619.84, "text": "in some fundamental way and I think what" }, { "start": 6621.599, "text": "a lot of people are starting to converge" }, { "start": 6623.119, "text": "towards is that you're not changing the" }, { "start": 6624.28, "text": "architecture you stick with the" }, { "start": 6625.44, "text": "Transformer you just kind of tokenize" }, { "start": 6627.56, "text": "your input domains and then call the day" }, { "start": 6629.96, "text": "and pretend it's just text tokens and" }, { "start": 6631.52, "text": "just do everything else identical in an" }, { "start": 6633.96, "text": "identical manner so here for example" }, { "start": 6636.08, "text": "there was a early paper that has nice" }, { "start": 6637.56, "text": "graphic for how you can take an image" }, { "start": 6639.599, "text": "and you can chunc at it into" }, { "start": 6642.159, "text": "integers um and these sometimes uh so" }, { "start": 6645.4, "text": "these will basically become the tokens" }, { "start": 6646.84, "text": "of images as an example and uh these" }, { "start": 6649.56, "text": "tokens can be uh hard tokens where you" }, { "start": 6652.199, "text": "force them to be integers they can also" }, { "start": 6653.92, "text": "be soft tokens where you uh sort of" }, { "start": 6657.0, "text": "don't require uh these to be discrete" }, { "start": 6660.239, "text": "but you do Force these representations" }, { "start": 6662.159, "text": "to go through bottlenecks like in Auto" }, { "start": 6664.76, "text": "encoders uh also in this paper that came" }, { "start": 6666.92, "text": "out from open a SORA which I think" }, { "start": 6668.88, "text": "really um uh blew the mind of many" }, { "start": 6671.84, "text": "people and inspired a lot of people in" }, { "start": 6673.52, "text": "terms of what's possible they have a" }, { "start": 6675.199, "text": "Graphic here and they talk briefly about" }, { "start": 6676.92, "text": "how llms have text tokens Sora has" }, { "start": 6680.159, "text": "visual patches so again they came up" }, { "start": 6682.52, "text": "with a way to chunc a videos into" }, { "start": 6684.92, "text": "basically tokens when they own" }, { "start": 6686.52, "text": "vocabularies and then you can either" }, { "start": 6688.52, "text": "process discrete tokens say with autog" }, { "start": 6690.04, "text": "regressive models or even soft tokens" }, { "start": 6692.079, "text": "with diffusion models and uh all of that" }, { "start": 6695.239, "text": "is sort of uh being actively worked on" }, { "start": 6698.239, "text": "designed on and is beyond the scope of" }, { "start": 6699.639, "text": "this video but just something I wanted" }, { "start": 6700.88, "text": "to mention briefly okay now that we have" }, { "start": 6702.96, "text": "come quite deep into the tokenization" }, { "start": 6705.119, "text": "algorithm and we understand a lot more" }, { "start": 6706.76, "text": "about how it works let's loop back" }, { "start": 6708.92, "text": "around to the beginning of this video" }, { "start": 6710.52, "text": "and go through some of these bullet" }, { "start": 6711.599, "text": "points and really see why they happen so" }, { "start": 6714.88, "text": "first of all why can't my llm spell" }, { "start": 6716.96, "text": "words very well or do other spell" }, { "start": 6718.76, "text": "related" }, { "start": 6720.56, "text": "tasks so fundamentally this is because" }, { "start": 6722.92, "text": "as we saw these characters are chunked" }, { "start": 6725.679, "text": "up into tokens and some of these tokens" }, { "start": 6727.96, "text": "are actually fairly long so as an" }, { "start": 6730.4, "text": "example I went to the gp4 vocabulary and" }, { "start": 6732.8, "text": "I looked at uh one of the longer tokens" }, { "start": 6735.28, "text": "so that default style turns out to be a" }, { "start": 6737.88, "text": "single individual token so that's a lot" }, { "start": 6739.719, "text": "of characters for a single token so my" }, { "start": 6742.159, "text": "suspicion is that there's just too much" }, { "start": 6743.76, "text": "crammed into this single token and my" }, { "start": 6746.079, "text": "suspicion was that the model should not" }, { "start": 6747.76, "text": "be very good at tasks related to" }, { "start": 6750.36, "text": "spelling of this uh single token so I" }, { "start": 6754.679, "text": "asked how many letters L are there in" }, { "start": 6757.0, "text": "the word default style and of course my" }, { "start": 6761.48, "text": "prompt is intentionally done that way" }, { "start": 6764.36, "text": "and you see how default style will be a" }, { "start": 6765.76, "text": "single token so this is what the model" }, { "start": 6767.36, "text": "sees so my suspicion is that it wouldn't" }, { "start": 6769.4, "text": "be very good at this and indeed it is" }, { "start": 6771.32, "text": "not it doesn't actually know how many" }, { "start": 6773.159, "text": "L's are in there it thinks there are" }, { "start": 6774.639, "text": "three and actually there are four if I'm" }, { "start": 6777.0, "text": "not getting this wrong myself so that" }, { "start": 6779.639, "text": "didn't go extremely well let's look look" }, { "start": 6782.32, "text": "at another kind of uh character level" }, { "start": 6784.599, "text": "task so for example here I asked uh gp4" }, { "start": 6788.4, "text": "to reverse the string default style and" }, { "start": 6791.159, "text": "they tried to use a code interpreter and" }, { "start": 6793.199, "text": "I stopped it and I said just do it just" }, { "start": 6795.44, "text": "try it and uh it gave me jumble so it" }, { "start": 6799.56, "text": "doesn't actually really know how to" }, { "start": 6801.44, "text": "reverse this string going from right to" }, { "start": 6803.76, "text": "left uh so it gave a wrong result so" }, { "start": 6806.76, "text": "again like working with this working" }, { "start": 6808.32, "text": "hypothesis that maybe this is due to the" }, { "start": 6810.0, "text": "tokenization I tried a different" }, { "start": 6811.84, "text": "approach I said okay let's reverse the" }, { "start": 6814.119, "text": "exact same string but take the following" }, { "start": 6816.44, "text": "approach step one just print out every" }, { "start": 6818.679, "text": "single character separated by spaces and" }, { "start": 6820.719, "text": "then as a step two reverse that list and" }, { "start": 6823.28, "text": "it again Tred to use a tool but when I" }, { "start": 6824.8, "text": "stopped it it uh first uh produced all" }, { "start": 6827.76, "text": "the characters and that was actually" }, { "start": 6828.92, "text": "correct and then It reversed them and" }, { "start": 6830.92, "text": "that was correct once it had this so" }, { "start": 6833.04, "text": "somehow it can't reverse it directly but" }, { "start": 6834.88, "text": "when you go just first uh you know" }, { "start": 6837.4, "text": "listing it out in order it can do that" }, { "start": 6839.28, "text": "somehow and then it can once it's uh" }, { "start": 6841.88, "text": "broken up this way this becomes all" }, { "start": 6843.88, "text": "these individual characters and so now" }, { "start": 6846.04, "text": "this is much easier for it to see these" }, { "start": 6847.88, "text": "individual tokens and reverse them and" }, { "start": 6850.079, "text": "print them out so that is kind of" }, { "start": 6853.52, "text": "interesting so let's continue now why" }, { "start": 6856.84, "text": "are llms worse at uh non-english langu" }, { "start": 6860.4, "text": "and I briefly covered this already but" }, { "start": 6862.679, "text": "basically um it's not only that the" }, { "start": 6864.88, "text": "language model sees less non-english" }, { "start": 6867.159, "text": "data during training of the model" }, { "start": 6868.76, "text": "parameters but also the tokenizer is not" }, { "start": 6871.639, "text": "um is not sufficiently trained on" }, { "start": 6874.639, "text": "non-english data and so here for example" }, { "start": 6877.28, "text": "hello how are you is five tokens and its" }, { "start": 6880.52, "text": "translation is 15 tokens so this is a" }, { "start": 6882.88, "text": "three times blow up and so for example" }, { "start": 6885.8, "text": "anang is uh just hello basically in" }, { "start": 6888.639, "text": "Korean and that end up being three" }, { "start": 6890.32, "text": "tokens I'm actually kind of surprised by" }, { "start": 6891.8, "text": "that because that is a very common" }, { "start": 6893.119, "text": "phrase there just the typical greeting" }, { "start": 6895.159, "text": "of like hello and that ends up being" }, { "start": 6897.0, "text": "three tokens whereas our hello is a" }, { "start": 6898.76, "text": "single token and so basically everything" }, { "start": 6900.56, "text": "is a lot more bloated and diffuse and" }, { "start": 6902.32, "text": "this is I think partly the reason that" }, { "start": 6904.079, "text": "the model Works worse on other" }, { "start": 6907.0, "text": "languages uh coming back why is LM bad" }, { "start": 6910.04, "text": "at simple arithmetic um that has to do" }, { "start": 6913.159, "text": "with the tokenization of numbers and so" }, { "start": 6917.36, "text": "um you'll notice that for example" }, { "start": 6919.079, "text": "addition is very sort of" }, { "start": 6920.96, "text": "like uh there's an algorithm that is" }, { "start": 6923.079, "text": "like character level for doing addition" }, { "start": 6925.719, "text": "so for example here we would first add" }, { "start": 6927.639, "text": "the ones and then the tens and then the" }, { "start": 6929.199, "text": "hundreds you have to refer to specific" }, { "start": 6931.079, "text": "parts of these digits but uh these" }, { "start": 6934.719, "text": "numbers are represented completely" }, { "start": 6936.199, "text": "arbitrarily based on whatever happened" }, { "start": 6937.679, "text": "to merge or not merge during the" }, { "start": 6939.28, "text": "tokenization process there's an entire" }, { "start": 6941.44, "text": "blog post about this that I think is" }, { "start": 6942.84, "text": "quite good integer tokenization is" }, { "start": 6944.719, "text": "insane and this person basically" }, { "start": 6946.679, "text": "systematically explores the tokenization" }, { "start": 6948.719, "text": "of numbers in I believe this is gpt2 and" }, { "start": 6952.04, "text": "so they notice that for example for the" }, { "start": 6953.76, "text": "for um four-digit numbers you can take a" }, { "start": 6957.28, "text": "look at whether it is uh a single token" }, { "start": 6960.199, "text": "or whether it is two tokens that is a 1" }, { "start": 6962.119, "text": "three or a 2 two or a 31 combination and" }, { "start": 6964.92, "text": "so all the different numbers are all the" }, { "start": 6966.56, "text": "different combinations and you can" }, { "start": 6968.04, "text": "imagine this is all completely" }, { "start": 6969.199, "text": "arbitrarily so and the model" }, { "start": 6971.28, "text": "unfortunately sometimes sees uh four um" }, { "start": 6974.159, "text": "a token for for all four digits" }, { "start": 6976.599, "text": "sometimes for three sometimes for two" }, { "start": 6978.04, "text": "sometimes for one and it's in an" }, { "start": 6980.0, "text": "arbitrary uh Manner and so this is" }, { "start": 6982.52, "text": "definitely a headwind if you will for" }, { "start": 6985.0, "text": "the language model and it's kind of" }, { "start": 6986.36, "text": "incredible that it can kind of do it and" }, { "start": 6987.92, "text": "deal with it but it's also kind of not" }, { "start": 6990.119, "text": "ideal and so that's why for example we" }, { "start": 6992.0, "text": "saw that meta when they train the Llama" }, { "start": 6994.199, "text": "2 algorithm and they use sentence piece" }, { "start": 6996.44, "text": "they make sure to split up all the um" }, { "start": 6999.52, "text": "all the digits as an example for uh" }, { "start": 7002.32, "text": "llama 2 and this is partly to improve a" }, { "start": 7004.88, "text": "simple arithmetic kind of" }, { "start": 7006.92, "text": "performance and finally why is gpt2 not" }, { "start": 7010.52, "text": "as good in Python again this is partly a" }, { "start": 7012.92, "text": "modeling issue on in the architecture" }, { "start": 7014.88, "text": "and the data set and the strength of the" }, { "start": 7016.639, "text": "model but it's also partially" }, { "start": 7018.199, "text": "tokenization because as we saw here with" }, { "start": 7020.32, "text": "the simple python example the encoding" }, { "start": 7023.04, "text": "efficiency of the tokenizer for handling" }, { "start": 7025.199, "text": "spaces in Python is terrible and every" }, { "start": 7027.36, "text": "single space is an individual token and" }, { "start": 7029.44, "text": "this dramatically reduces the context" }, { "start": 7031.079, "text": "length that the model can attend to" }, { "start": 7032.52, "text": "cross so that's almost like a" }, { "start": 7034.079, "text": "tokenization bug for gpd2 and that was" }, { "start": 7036.8, "text": "later fixed with gp4 okay so here's" }, { "start": 7040.0, "text": "another fun one my llm abruptly halts" }, { "start": 7042.52, "text": "when it sees the string end of text so" }, { "start": 7045.28, "text": "here's um here's a very strange Behavior" }, { "start": 7048.04, "text": "print a string end of text is what I" }, { "start": 7050.079, "text": "told jt4 and it says could you please" }, { "start": 7052.239, "text": "specify the string and I'm I'm telling" }, { "start": 7055.119, "text": "it give me end of text and it seems like" }, { "start": 7057.159, "text": "there's an issue it's not seeing end of" }, { "start": 7059.239, "text": "text and then I give it end of text is" }, { "start": 7061.599, "text": "the string and then here's a string and" }, { "start": 7064.239, "text": "then it just doesn't print it so" }, { "start": 7065.84, "text": "obviously something is breaking here" }, { "start": 7067.119, "text": "with respect to the handling of the" }, { "start": 7068.32, "text": "special token and I don't actually know" }, { "start": 7070.199, "text": "what open ey is doing under the hood" }, { "start": 7072.639, "text": "here and whether they are potentially" }, { "start": 7074.52, "text": "parsing this as an um as an actual token" }, { "start": 7078.96, "text": "instead of this just being uh end of" }, { "start": 7081.159, "text": "text um as like individual sort of" }, { "start": 7084.599, "text": "pieces of it without the special token" }, { "start": 7086.44, "text": "handling logic and so it might be that" }, { "start": 7089.52, "text": "someone when they're calling do encode" }, { "start": 7091.76, "text": "uh they are passing in the allowed" }, { "start": 7093.36, "text": "special and they are allowing end of" }, { "start": 7096.199, "text": "text as a special character in the user" }, { "start": 7098.36, "text": "prompt but the user prompt of course is" }, { "start": 7100.84, "text": "is a sort of um attacker controlled text" }, { "start": 7103.52, "text": "so you would hope that they don't really" }, { "start": 7105.32, "text": "parse or use special tokens or you know" }, { "start": 7108.76, "text": "from that kind of input but it appears" }, { "start": 7110.599, "text": "that there's something definitely going" }, { "start": 7111.76, "text": "wrong here and um so your knowledge of" }, { "start": 7114.8, "text": "these special tokens ends up being in a" }, { "start": 7116.4, "text": "tax surface potentially and so if you'd" }, { "start": 7118.88, "text": "like to confuse llms then just um try to" }, { "start": 7123.0, "text": "give them some special tokens and see if" }, { "start": 7124.32, "text": "you're breaking something by chance okay" }, { "start": 7126.4, "text": "so this next one is a really fun one uh" }, { "start": 7129.48, "text": "the trailing whites space issue so if" }, { "start": 7132.88, "text": "you come to playground and uh we come" }, { "start": 7136.0, "text": "here to GPT 3.5 turbo instruct so this" }, { "start": 7138.44, "text": "is not a chat model this is a completion" }, { "start": 7140.32, "text": "model so think of it more like it's a" }, { "start": 7142.88, "text": "lot more closer to a base model it does" }, { "start": 7145.28, "text": "completion it will continue the token" }, { "start": 7147.599, "text": "sequence so here's a tagline for ice" }, { "start": 7149.88, "text": "cream shop and we want to continue the" }, { "start": 7151.639, "text": "sequence and so we can submit and get a" }, { "start": 7154.239, "text": "bunch of tokens okay no problem but now" }, { "start": 7158.239, "text": "suppose I do this but instead of" }, { "start": 7160.84, "text": "pressing submit here I do here's a" }, { "start": 7163.119, "text": "tagline for ice cream shop space so I" }, { "start": 7166.0, "text": "have a space here before I click" }, { "start": 7168.96, "text": "submit we get a warning your text ends" }, { "start": 7171.84, "text": "in a trail Ling space which causes worse" }, { "start": 7173.4, "text": "performance due to how API splits text" }, { "start": 7175.84, "text": "into tokens so what's happening here it" }, { "start": 7178.239, "text": "still gave us a uh sort of completion" }, { "start": 7180.56, "text": "here but let's take a look at what's" }, { "start": 7182.8, "text": "happening so here's a tagline for an ice" }, { "start": 7184.88, "text": "cream shop and then what does this look" }, { "start": 7188.679, "text": "like in the actual actual training data" }, { "start": 7190.159, "text": "suppose you found the completion in the" }, { "start": 7192.28, "text": "training document somewhere on the" }, { "start": 7193.56, "text": "internet and the llm trained on this" }, { "start": 7195.679, "text": "data so maybe it's something like oh" }, { "start": 7198.32, "text": "yeah maybe that's the tagline that's a" }, { "start": 7200.4, "text": "terrible tagline but notice here that" }, { "start": 7202.76, "text": "when I create o you see that because" }, { "start": 7205.76, "text": "there's the the space character is" }, { "start": 7207.8, "text": "always a prefix to these tokens in GPT" }, { "start": 7211.159, "text": "so it's not an O token it's a space o" }, { "start": 7213.48, "text": "token the space is part of the O and" }, { "start": 7216.76, "text": "together they are token 8840 that's" }, { "start": 7219.239, "text": "that's space o so what's What's" }, { "start": 7221.92, "text": "Happening Here is that when I just have" }, { "start": 7224.119, "text": "it like this and I let it complete the" }, { "start": 7227.04, "text": "next token it can sample the space o" }, { "start": 7230.04, "text": "token but instead if I have this and I" }, { "start": 7232.599, "text": "add my space then what I'm doing here" }, { "start": 7234.76, "text": "when I incode this string is I have" }, { "start": 7237.639, "text": "basically here's a t line for an ice" }, { "start": 7239.079, "text": "cream uh shop and this space at the very" }, { "start": 7242.0, "text": "end becomes a token" }, { "start": 7244.079, "text": "220 and so we've added token 220 and" }, { "start": 7247.84, "text": "this token otherwise would be part of" }, { "start": 7249.76, "text": "the tagline because if there actually is" }, { "start": 7251.88, "text": "a tagline here so space o is the token" }, { "start": 7255.239, "text": "and so this is suddenly a of" }, { "start": 7257.32, "text": "distribution for the model because this" }, { "start": 7259.679, "text": "space is part of the next token but" }, { "start": 7261.52, "text": "we're putting it here like this and the" }, { "start": 7264.04, "text": "model has seen very very little data of" }, { "start": 7267.199, "text": "actual Space by itself and we're asking" }, { "start": 7270.079, "text": "it to complete the sequence like add in" }, { "start": 7271.719, "text": "more tokens but the problem is that" }, { "start": 7273.48, "text": "we've sort of begun the first token and" }, { "start": 7276.36, "text": "now it's been split up and now we're out" }, { "start": 7278.76, "text": "of this distribution and now arbitrary" }, { "start": 7280.76, "text": "bad things happen and it's just a very" }, { "start": 7283.04, "text": "rare example for it to see something" }, { "start": 7284.56, "text": "like that and uh that's why we get the" }, { "start": 7286.92, "text": "warning so the fundamental issue here is" }, { "start": 7289.119, "text": "of course that um the llm is on top of" }, { "start": 7292.44, "text": "these tokens and these tokens are text" }, { "start": 7294.599, "text": "chunks they're not characters in a way" }, { "start": 7296.56, "text": "you and I would think of them they are" }, { "start": 7298.199, "text": "these are the atoms of what the LM is" }, { "start": 7300.36, "text": "seeing and there's a bunch of weird" }, { "start": 7301.8, "text": "stuff that comes out of it let's go back" }, { "start": 7303.639, "text": "to our default cell style I bet you that" }, { "start": 7308.0, "text": "the model has never in its training set" }, { "start": 7309.96, "text": "seen default cell sta without Le in" }, { "start": 7314.199, "text": "there it's always seen this as a single" }, { "start": 7316.599, "text": "group because uh this is some kind of a" }, { "start": 7319.239, "text": "function in um I'm guess I don't" }, { "start": 7322.0, "text": "actually know what this is part of this" }, { "start": 7323.079, "text": "is some kind of API but I bet you that" }, { "start": 7325.119, "text": "it's never seen this combination of" }, { "start": 7327.079, "text": "tokens uh in its training data because" }, { "start": 7330.639, "text": "or I think it would be extremely rare so" }, { "start": 7332.36, "text": "I took this and I copy pasted it here" }, { "start": 7334.719, "text": "and I had I tried to complete from it" }, { "start": 7337.48, "text": "and the it immediately gave me a big" }, { "start": 7339.199, "text": "error and it said the model predicted to" }, { "start": 7341.079, "text": "completion that begins with a stop" }, { "start": 7342.32, "text": "sequence resulting in no output consider" }, { "start": 7344.159, "text": "adjusting your prompt or stop sequences" }, { "start": 7346.36, "text": "so what happened here when I clicked" }, { "start": 7347.639, "text": "submit is that immediately the model" }, { "start": 7350.199, "text": "emitted and sort of like end of text" }, { "start": 7352.239, "text": "token I think or something like that it" }, { "start": 7354.44, "text": "basically predicted the stop sequence" }, { "start": 7356.44, "text": "immediately so it had no completion and" }, { "start": 7358.76, "text": "so this is why I'm getting a warning" }, { "start": 7360.199, "text": "again because we're off the data" }, { "start": 7362.159, "text": "distribution and the model is just uh" }, { "start": 7365.119, "text": "predicting just totally arbitrary things" }, { "start": 7367.639, "text": "it's just really confused basically this" }, { "start": 7369.44, "text": "is uh this is giving it brain damage" }, { "start": 7370.92, "text": "it's never seen this before it's shocked" }, { "start": 7373.32, "text": "and it's predicting end of text or" }, { "start": 7374.56, "text": "something I tried it again here and it" }, { "start": 7377.04, "text": "in this case it completed it but then" }, { "start": 7379.079, "text": "for some reason this request May violate" }, { "start": 7381.44, "text": "our usage policies this was" }, { "start": 7383.639, "text": "flagged um basically something just like" }, { "start": 7386.639, "text": "goes wrong and there's something like" }, { "start": 7387.679, "text": "Jank you can just feel the Jank because" }, { "start": 7389.52, "text": "the model is like extremely unhappy with" }, { "start": 7391.4, "text": "just this and it doesn't know how to" }, { "start": 7392.96, "text": "complete it because it's never occurred" }, { "start": 7394.159, "text": "in training set in a training set it" }, { "start": 7396.199, "text": "always appears like this and becomes a" }, { "start": 7398.32, "text": "single token" }, { "start": 7400.04, "text": "so these kinds of issues where tokens" }, { "start": 7401.96, "text": "are either you sort of like complete the" }, { "start": 7404.239, "text": "first character of the next token or you" }, { "start": 7406.76, "text": "are sort of you have long tokens that" }, { "start": 7408.56, "text": "you then have just some of the" }, { "start": 7409.8, "text": "characters off all of these are kind of" }, { "start": 7412.32, "text": "like issues with partial tokens is how I" }, { "start": 7415.36, "text": "would describe it and if you actually" }, { "start": 7417.76, "text": "dig into the T token" }, { "start": 7419.8, "text": "repository go to the rust code and" }, { "start": 7421.96, "text": "search for" }, { "start": 7424.159, "text": "unstable and you'll see um en code" }, { "start": 7427.079, "text": "unstable native unstable token tokens" }, { "start": 7429.239, "text": "and a lot of like special case handling" }, { "start": 7431.52, "text": "none of this stuff about unstable tokens" }, { "start": 7433.4, "text": "is documented anywhere but there's a ton" }, { "start": 7435.48, "text": "of code dealing with unstable tokens and" }, { "start": 7438.36, "text": "unstable tokens is exactly kind of like" }, { "start": 7440.8, "text": "what I'm describing here what you would" }, { "start": 7442.76, "text": "like out of a completion API is" }, { "start": 7445.239, "text": "something a lot more fancy like if we're" }, { "start": 7446.599, "text": "putting in default cell sta if we're" }, { "start": 7448.96, "text": "asking for the next token sequence we're" }, { "start": 7450.679, "text": "not actually trying to append the next" }, { "start": 7452.239, "text": "token exactly after this list we're" }, { "start": 7454.639, "text": "actually trying to append we're trying" }, { "start": 7456.48, "text": "to consider lots of tokens um" }, { "start": 7459.52, "text": "that if we were or I guess like we're" }, { "start": 7462.159, "text": "trying to search over characters that if" }, { "start": 7465.76, "text": "we retened would be of high probability" }, { "start": 7468.159, "text": "if that makes sense um so that we can" }, { "start": 7470.679, "text": "actually add a single individual" }, { "start": 7472.32, "text": "character uh instead of just like adding" }, { "start": 7474.48, "text": "the next full token that comes after" }, { "start": 7476.679, "text": "this partial token list so I this is" }, { "start": 7479.36, "text": "very tricky to describe and I invite you" }, { "start": 7481.32, "text": "to maybe like look through this it ends" }, { "start": 7483.04, "text": "up being extremely gnarly and hairy kind" }, { "start": 7484.679, "text": "of topic it and it comes from" }, { "start": 7486.36, "text": "tokenization fundamentally so um maybe I" }, { "start": 7489.4, "text": "can even spend an entire video talking" }, { "start": 7490.8, "text": "about unstable tokens sometime in the" }, { "start": 7492.119, "text": "future okay and I'm really saving the" }, { "start": 7494.199, "text": "best for last my favorite one by far is" }, { "start": 7496.599, "text": "the solid gold" }, { "start": 7499.199, "text": "Magikarp and it just okay so this comes" }, { "start": 7501.36, "text": "from this blog post uh solid gold" }, { "start": 7503.639, "text": "Magikarp and uh this is um internet" }, { "start": 7507.0, "text": "famous now for those of us in llms and" }, { "start": 7510.079, "text": "basically I I would advise you to uh" }, { "start": 7511.84, "text": "read this block Post in full but" }, { "start": 7513.679, "text": "basically what this person was doing is" }, { "start": 7516.559, "text": "this person went to the um" }, { "start": 7519.239, "text": "token embedding stable and clustered the" }, { "start": 7522.32, "text": "tokens based on their embedding" }, { "start": 7524.8, "text": "representation and this person noticed" }, { "start": 7527.28, "text": "that there's a cluster of tokens that" }, { "start": 7529.239, "text": "look really strange so there's a cluster" }, { "start": 7531.159, "text": "here at rot e stream Fame solid gold" }, { "start": 7534.079, "text": "Magikarp Signet message like really" }, { "start": 7536.0, "text": "weird tokens in uh basically in this" }, { "start": 7539.96, "text": "embedding cluster and so what are these" }, { "start": 7542.239, "text": "tokens and where do they even come from" }, { "start": 7543.679, "text": "like what is solid gold magikarpet makes" }, { "start": 7545.4, "text": "no sense and then they found bunch of" }, { "start": 7548.96, "text": "these" }, { "start": 7550.199, "text": "tokens and then they notice that" }, { "start": 7552.119, "text": "actually the plot thickens here because" }, { "start": 7553.559, "text": "if you ask the model about these tokens" }, { "start": 7556.04, "text": "like you ask it uh some very benign" }, { "start": 7558.639, "text": "question like please can you repeat back" }, { "start": 7560.199, "text": "to me the string sold gold Magikarp uh" }, { "start": 7562.96, "text": "then you get a variety of basically" }, { "start": 7564.8, "text": "totally broken llm Behavior so either" }, { "start": 7567.76, "text": "you get evasion so I'm sorry I can't" }, { "start": 7569.84, "text": "hear you or you get a bunch of" }, { "start": 7571.4, "text": "hallucinations as a response um you can" }, { "start": 7574.559, "text": "even get back like insults so you ask it" }, { "start": 7577.28, "text": "uh about streamer bot it uh tells the" }, { "start": 7580.0, "text": "and the model actually just calls you" }, { "start": 7582.04, "text": "names uh or it kind of comes up with" }, { "start": 7584.159, "text": "like weird humor like you're actually" }, { "start": 7586.239, "text": "breaking the model by asking about these" }, { "start": 7588.48, "text": "very simple strings like at Roth and" }, { "start": 7590.52, "text": "sold gold Magikarp so like what the hell" }, { "start": 7592.84, "text": "is happening and there's a variety of" }, { "start": 7594.48, "text": "here documented behaviors uh there's a" }, { "start": 7597.079, "text": "bunch of tokens not just so good" }, { "start": 7598.48, "text": "Magikarp that have that kind of a" }, { "start": 7600.28, "text": "behavior and so basically there's a" }, { "start": 7602.119, "text": "bunch of like trigger words and if you" }, { "start": 7604.159, "text": "ask the model about these trigger words" }, { "start": 7606.04, "text": "or you just include them in your prompt" }, { "start": 7608.04, "text": "the model goes haywire and has all kinds" }, { "start": 7610.0, "text": "of uh really Strange Behaviors including" }, { "start": 7612.8, "text": "sort of ones that violate typical safety" }, { "start": 7614.84, "text": "guidelines uh and the alignment of the" }, { "start": 7617.0, "text": "model like it's swearing back at you so" }, { "start": 7619.84, "text": "what is happening here and how can this" }, { "start": 7621.76, "text": "possibly be true well this again comes" }, { "start": 7624.559, "text": "down to tokenization so what's happening" }, { "start": 7626.719, "text": "here is that sold gold Magikarp if you" }, { "start": 7628.76, "text": "actually dig into it is a Reddit user so" }, { "start": 7631.719, "text": "there's a u Sol gold" }, { "start": 7634.04, "text": "Magikarp and probably what happened here" }, { "start": 7636.8, "text": "even though I I don't know that this has" }, { "start": 7638.0, "text": "been like really definitively explored" }, { "start": 7640.44, "text": "but what is thought to have happened is" }, { "start": 7643.159, "text": "that the tokenization data set was very" }, { "start": 7645.559, "text": "different from the training data set for" }, { "start": 7648.0, "text": "the actual language model so in the" }, { "start": 7649.92, "text": "tokenization data set there was a ton of" }, { "start": 7651.52, "text": "redded data potentially where the user" }, { "start": 7654.599, "text": "solid gold Magikarp was mentioned in the" }, { "start": 7656.4, "text": "text because solid gold Magikarp was a" }, { "start": 7659.199, "text": "very common um sort of uh person who" }, { "start": 7661.679, "text": "would post a lot uh this would be a" }, { "start": 7663.679, "text": "string that occurs many times in a" }, { "start": 7665.28, "text": "tokenization data set because it occurs" }, { "start": 7668.0, "text": "many times in a tokenization data set" }, { "start": 7670.0, "text": "these tokens would end up getting merged" }, { "start": 7671.48, "text": "to the single individual token for that" }, { "start": 7673.52, "text": "single Reddit user sold gold Magikarp so" }, { "start": 7676.4, "text": "they would have a dedicated token in a" }, { "start": 7678.36, "text": "vocabulary of was it 50,000 tokens in" }, { "start": 7680.719, "text": "gpd2 that is devoted to that Reddit user" }, { "start": 7684.119, "text": "and then what happens is the" }, { "start": 7685.599, "text": "tokenization data set has those strings" }, { "start": 7688.599, "text": "but then later when you train the model" }, { "start": 7690.92, "text": "the language model itself um this data" }, { "start": 7693.92, "text": "from Reddit was not present and so" }, { "start": 7696.679, "text": "therefore in the entire training set for" }, { "start": 7698.8, "text": "the language model sold gold Magikarp" }, { "start": 7701.28, "text": "never occurs that token never appears in" }, { "start": 7704.32, "text": "the training set for the actual language" }, { "start": 7705.84, "text": "model later so this token never gets" }, { "start": 7708.92, "text": "activated it's initialized at random in" }, { "start": 7711.04, "text": "the beginning of optimization then you" }, { "start": 7712.88, "text": "have forward backward passes and updates" }, { "start": 7714.48, "text": "to the model and this token is just" }, { "start": 7716.0, "text": "never updated in the embedding table" }, { "start": 7717.92, "text": "that row Vector never gets sampled it" }, { "start": 7720.0, "text": "never gets used so it never gets trained" }, { "start": 7722.04, "text": "and it's completely untrained it's kind" }, { "start": 7723.88, "text": "of like unallocated memory in a typical" }, { "start": 7726.4, "text": "binary program written in C or something" }, { "start": 7728.159, "text": "like that that so it's unallocated" }, { "start": 7730.0, "text": "memory and then at test time if you" }, { "start": 7731.84, "text": "evoke this token then you're basically" }, { "start": 7734.28, "text": "plucking out a row of the embedding" }, { "start": 7735.639, "text": "table that is completely untrained and" }, { "start": 7737.32, "text": "that feeds into a Transformer and" }, { "start": 7738.92, "text": "creates undefined behavior and that's" }, { "start": 7740.96, "text": "what we're seeing here this completely" }, { "start": 7742.159, "text": "undefined never before seen in a" }, { "start": 7743.88, "text": "training behavior and so any of these" }, { "start": 7746.559, "text": "kind of like weird tokens would evoke" }, { "start": 7748.0, "text": "this Behavior because fundamentally the" }, { "start": 7749.32, "text": "model is um is uh uh out of sample out" }, { "start": 7754.48, "text": "of distribution okay and the very last" }, { "start": 7756.76, "text": "thing I wanted to just briefly mention" }, { "start": 7758.52, "text": "point out although I think a lot of" }, { "start": 7759.679, "text": "people are quite aware of this is that" }, { "start": 7761.639, "text": "different kinds of formats and different" }, { "start": 7763.159, "text": "representations and different languages" }, { "start": 7765.0, "text": "and so on might be more or less" }, { "start": 7766.88, "text": "efficient with GPD tokenizers uh or any" }, { "start": 7769.8, "text": "tokenizers for any other L for that" }, { "start": 7771.4, "text": "matter so for example Json is actually" }, { "start": 7773.559, "text": "really dense in tokens and yaml is a lot" }, { "start": 7776.32, "text": "more efficient in tokens um so for" }, { "start": 7779.239, "text": "example this are these are the same in" }, { "start": 7781.32, "text": "Json and in yaml the Json is" }, { "start": 7784.599, "text": "116 and the yaml is 99 so quite a bit of" }, { "start": 7788.119, "text": "an Improvement and so in the token" }, { "start": 7791.639, "text": "economy where we are paying uh per token" }, { "start": 7793.639, "text": "in many ways and you are paying in the" }, { "start": 7795.679, "text": "context length and you're paying in um" }, { "start": 7797.639, "text": "dollar amount for uh the cost of" }, { "start": 7799.88, "text": "processing all this kind of structured" }, { "start": 7801.199, "text": "data when you have to um so prefer to" }, { "start": 7803.52, "text": "use theal over Json and in general kind" }, { "start": 7806.079, "text": "of like the tokenization density is" }, { "start": 7807.599, "text": "something that you have to um sort of" }, { "start": 7809.84, "text": "care about and worry about at all times" }, { "start": 7811.679, "text": "and try to find efficient encoding" }, { "start": 7813.4, "text": "schemes and spend a lot of time in tick" }, { "start": 7815.4, "text": "tokenizer and measure the different" }, { "start": 7816.88, "text": "token efficiencies of different formats" }, { "start": 7818.92, "text": "and settings and so on okay so that" }, { "start": 7821.0, "text": "concludes my fairly long video on" }, { "start": 7823.36, "text": "tokenization I know it's a try I know" }, { "start": 7825.96, "text": "it's annoying I know it's irritating I" }, { "start": 7828.44, "text": "personally really dislike the stage what" }, { "start": 7830.88, "text": "I do have to say at this point is don't" }, { "start": 7832.599, "text": "brush it off there's a lot of foot guns" }, { "start": 7834.96, "text": "sharp edges here security issues uh AI" }, { "start": 7838.119, "text": "safety issues as we saw plugging in" }, { "start": 7839.88, "text": "unallocated memory into uh language" }, { "start": 7842.079, "text": "models so um it's worth understanding" }, { "start": 7845.159, "text": "this stage um that said I will say that" }, { "start": 7848.48, "text": "eternal glory goes to anyone who can get" }, { "start": 7850.32, "text": "rid of it uh I showed you one possible" }, { "start": 7852.559, "text": "paper that tried to uh do that and I" }, { "start": 7854.679, "text": "think I hope a lot more can follow over" }, { "start": 7857.04, "text": "time and my final recommendations for" }, { "start": 7859.4, "text": "the application right now are if you can" }, { "start": 7861.44, "text": "reuse the GPT 4 tokens and the" }, { "start": 7863.04, "text": "vocabulary uh in your application then" }, { "start": 7865.0, "text": "that's something you should consider and" }, { "start": 7866.199, "text": "just use Tech token because it is very" }, { "start": 7867.84, "text": "efficient and nice library for inference" }, { "start": 7871.239, "text": "for bpe I also really like the bite" }, { "start": 7873.719, "text": "level BP that uh Tik toen and openi uses" }, { "start": 7877.32, "text": "uh if you for some reason want to train" }, { "start": 7879.04, "text": "your own vocabulary from scratch um then" }, { "start": 7882.679, "text": "I would use uh the bpe with sentence" }, { "start": 7885.0, "text": "piece um oops as I mentioned I'm not a" }, { "start": 7888.119, "text": "huge fan of sentence piece I don't like" }, { "start": 7890.679, "text": "its uh bite fallback and I don't like" }, { "start": 7893.92, "text": "that it's doing BP on unic code code" }, { "start": 7895.559, "text": "points I think it's uh it also has like" }, { "start": 7897.76, "text": "a million settings and I think there's a" }, { "start": 7899.119, "text": "lot of foot gonss here and I think it's" }, { "start": 7900.4, "text": "really easy to Mis calibrate them and" }, { "start": 7902.199, "text": "you end up cropping your sentences or" }, { "start": 7903.76, "text": "something like that uh because of some" }, { "start": 7905.8, "text": "type of parameter that you don't fully" }, { "start": 7907.28, "text": "understand so so be very careful with" }, { "start": 7909.44, "text": "the settings try to copy paste exactly" }, { "start": 7911.719, "text": "maybe where what meta did or basically" }, { "start": 7914.28, "text": "spend a lot of time looking at all the" }, { "start": 7916.119, "text": "hyper parameters and go through the code" }, { "start": 7917.48, "text": "of sentence piece and make sure that you" }, { "start": 7919.079, "text": "have this correct um but even if you" }, { "start": 7922.04, "text": "have all the settings correct I still" }, { "start": 7923.48, "text": "think that the algorithm is kind of" }, { "start": 7924.92, "text": "inferior to what's happening here and" }, { "start": 7927.679, "text": "maybe the best if you really need to" }, { "start": 7929.52, "text": "train your vocabulary maybe the best" }, { "start": 7931.32, "text": "thing is to just wait for M bpe to" }, { "start": 7933.159, "text": "becomes as efficient as possible and uh" }, { "start": 7936.84, "text": "that's something that maybe I hope to" }, { "start": 7938.159, "text": "work on and at some point maybe we can" }, { "start": 7940.8, "text": "be training basically really what we" }, { "start": 7942.88, "text": "want is we want tick token but training" }, { "start": 7944.96, "text": "code and that is the ideal thing that" }, { "start": 7947.84, "text": "currently does not exist and MBP is um" }, { "start": 7951.36, "text": "is in implementation of it but currently" }, { "start": 7953.239, "text": "it's in Python so that's currently what" }, { "start": 7955.88, "text": "I have to say for uh tokenization there" }, { "start": 7958.199, "text": "might be an advanced video that has even" }, { "start": 7960.4, "text": "drier and even more detailed in the" }, { "start": 7961.92, "text": "future but for now I think we're going" }, { "start": 7963.639, "text": "to leave things off here and uh I hope" }, { "start": 7966.76, "text": "that was helpful bye" }, { "start": 7974.119, "text": "and uh they increase this contact size" }, { "start": 7976.04, "text": "from gpt1 of 512 uh to 1024 and GPT 4" }, { "start": 7982.679, "text": "two the" }, { "start": 7985.44, "text": "next okay next I would like us to" }, { "start": 7987.639, "text": "briefly walk through the code from open" }, { "start": 7989.8, "text": "AI on the gpt2 encoded" }, { "start": 7995.84, "text": "ATP I'm sorry I'm gonna sneeze" }, { "start": 7999.119, "text": "and then what's Happening Here" }, { "start": 8001.84, "text": "is this is a spous layer that I will" }, { "start": 8004.639, "text": "explain in a" }, { "start": 8006.119, "text": "bit What's Happening Here" }, { "start": 8013.159, "text": "is" } ]