video-chaptering / examples /zduSFxRajkE.json
Yannael_LB
Update
aef7231
[
{
"num_chapter": 0,
"title": "Introduction to Tokenization",
"start_paragraph_number": 0,
"end_paragraph_number": 5,
"start_time": 0,
"end_time": 139,
"paragraphs": [
"Hi everyone. ",
"In this video, I'd like us to cover the process of tokenization in large language models. Now, you see here that I have a set face, and that's because tokenization is my least favorite part of working with large language models. Unfortunately, it is necessary to understand in some detail because it is fairly hairy and gnarly. There are a lot of hidden foot guns to be aware of, and much of the oddness with large language models typically traces back to tokenization. ",
"So, what is tokenization? In my previous video, \"Let's Build GPT from Scratch,\" we actually already did tokenization, but we did a very naive, simple version of it. When you go to the Google Colab for that video, you will see that we loaded our training set, which was the Shakespeare dataset. In the beginning, the Shakespeare dataset is just a large string in Python; it's just text. The question is, how do we plug text into large language models? ",
"In this case, we created a vocabulary of 65 possible characters that we saw occur in this string. These were the possible characters, and we saw that there are 65 of them. Then, we created a lookup table for converting every possible character, a little string piece, into a token, which is an integer. For example, we tokenized the string \"Hi\" and received a sequence of tokens. We took the first 1,000 characters of our dataset and encoded it into tokens. Because this is character-level tokenization, we received 1,000 tokens in a sequence, such as token 18, 47, etc. ",
"Later, we saw that the way we plug these tokens into the language model is by using an embedding table. If we have 65 possible tokens, then this embedding table is going to have 65 rows. Roughly speaking, we're taking the integer associated with every single token and using that as a lookup into this table. We pluck out the corresponding row, which contains trainable parameters that we're going to train using backpropagation. This is the vector that then feeds into the Transformer, and that's how the Transformer perceives every single token. "
],
"paragraph_timestamps": [
0,
0,
24,
56,
98
]
},
{
"num_chapter": 1,
"title": "Byte Pair Encoding and Advanced Tokenization",
"start_paragraph_number": 5,
"end_paragraph_number": 10,
"start_time": 139,
"end_time": 256,
"paragraphs": [
"Here, we had a very naive tokenization process that was a character-level tokenizer. However, in practice, in state-of-the-art language models, people use much more complicated schemes for constructing these token vocabularies. We are not dealing at the character level; we are dealing at the chunk level. The way these character chunks are constructed is using algorithms such as the Byte Pair Encoding (BPE) algorithm, which we're going to go into in detail and cover in this video. ",
"I'd like to briefly show you the paper that introduced Byte Level Encoding as a mechanism for tokenization in the context of large language models. I would say that this is probably the GPT-2 paper. If you scroll down to the section on input representation, this is where they cover tokenization and the kinds of properties that you'd like the tokenization to have. They conclude that they will have a tokenizer with a vocabulary of 50,257 possible tokens, and the context size will be 1,024 tokens. ",
"In the attention layer of the Transformer neural network, every single token attends to the previous tokens in the sequence and can see up to 1,024 tokens. Tokens are the fundamental unit, the atom of large language models, if you will. Everything is in units of tokens; everything is about tokens. Tokenization is the process for translating strings or text into sequences of tokens and vice versa. ",
"When you go into the Llama 2 paper, you will find that when you search for \"token,\" you will get 63 hits. This is because tokens are pervasive. They mention that they trained on two trillion tokens of data, and so on. ",
"We are going to build our own tokenizer. Luckily, the Byte Pair Encoding algorithm is not that complicated, and we can build it from scratch ourselves. We'll see exactly how this works. "
],
"paragraph_timestamps": [
139,
169,
207,
232,
246
]
},
{
"num_chapter": 2,
"title": "Challenges and Complexities of Tokenization",
"start_paragraph_number": 10,
"end_paragraph_number": 21,
"start_time": 256,
"end_time": 570,
"paragraphs": [
"Before we dive into code, I'd like to give you a brief taste of some of the complexities that come from tokenization. I want to ensure that we motivate why we are doing all this and why it is so important. Tokenization is at the heart of a lot of weirdness in large language models, and I would advise that you do not brush it off. Many issues that may look like problems with the neural network architecture or the large language model itself are actually issues with the tokenization and fundamentally trace back to it. ",
"If you've noticed any issues with large language models, such as their inability to perform spelling tasks easily, that's usually due to tokenization. Simple string processing can be difficult for large language models to perform natively. Additionally, non-English languages can work much worse, and to a large extent, this is due to tokenization. ",
"Sometimes, large language models are also bad at simple arithmetic, which can be traced back to tokenization. GPT-2 specifically would have had quite a bit more issues with Python than future versions due to tokenization. There are a lot of other issues; maybe you've seen weird warnings about a trailing whitespace\u2014this is a tokenization issue. ",
"If you had asked GPT earlier about solid gold Magikarp and what it is, you would see the LLM go totally crazy and start going off on a completely unrelated tangent topic. Maybe you've been told to use YAML over JSON in structured data; all of that has to do with tokenization. So basically, tokenization is at the heart of many issues. I will look back around to these at the end of the video, but for now, let me just skip over it a little bit and let's go to this web app, the Tik tokenizer bell.app. ",
"I have it loaded here, and what I like about this web app is that tokenization is running live in your browser in JavaScript. You can just type here, \"Hello, world,\" and the whole string tokenizes. Here, what we see on the left is the string that you put in, and on the right, we're currently using the GPT-2 tokenizer. We see that this string that I pasted here is currently tokenizing into 300 tokens, and here they are shown explicitly in different colors for every single token. ",
"For example, the word \"tokenization\" became two tokens: the token 3,642 and 1,634. The token for space is token 318, so be careful. On the bottom, you can show whitespace, and keep in mind that there are spaces and newline characters in here, but you can hide them for clarity. The token for space \"at\" is token 379, the token for space \"the\" is 262, etc. You notice here that the space is part of that token chunk now. ",
"This is kind of how our English sentence broke up, and that seems all well and good. Now, here I put in some arithmetic. We see that the token 127 plus the token 6 space 6 followed by 77. What's happening here is that 127 is feeding in as a single token into the large language model, but the number 677 will actually feed in as two separate tokens. The large language model has to take account of that and process it correctly in its network. ",
"See here, 804 will be broken up into two tokens, and it's all completely arbitrary. Here I have another example of four-digit numbers, and they break up in a way that is totally arbitrary. Sometimes you have multiple digits as a single token, and sometimes you have individual digits as many tokens. It's all kind of pretty arbitrary coming out of the tokenizer. ",
"Here's another example: we have the string \"egg,\" and you see here that this became two tokens. But for some reason, when I say \"I have an egg,\" you see that when it's a space \"egg,\" it's a single token. So just \"egg\" by itself at the beginning of a sentence is two tokens, but here, as a space \"egg,\" it's suddenly a single token for the exact same string. ",
"Okay, here, lowercase \"egg\" turns out to be a single token, and in particular, notice that the color is different. This is a different token, so this is case sensitive. Of course, a capital \"egg\" would also be different tokens, and again, this would be two tokens arbitrarily. For the same concept \"egg,\" depending on whether it's at the beginning of a sentence, at the end of a sentence, lowercase, uppercase, or mixed, all this will be basically very different tokens and different IDs. ",
"The language model has to learn from raw data from all the internet text that it's going to be training on that these are actually all the exact same concept. It has to sort of group them in the parameters of the neural network and understand just based on the data patterns that these are all very similar, but maybe not almost exactly similar, but very, very similar. "
],
"paragraph_timestamps": [
256,
287,
307,
328,
360,
393,
436,
471,
492,
518,
548
]
},
{
"num_chapter": 3,
"title": "Tokenization in Non-English Languages",
"start_paragraph_number": 21,
"end_paragraph_number": 26,
"start_time": 570,
"end_time": 681,
"paragraphs": [
"After the \"egg\" demonstration, I have an introduction from OpenAI's ChatGPT in Korean. So, \"manaso Pang,\" etc. This is in Korean, and the reason I put this here is because you'll notice that non-English languages work slightly worse in ChatGPT. Part of this is because, of course, the training dataset for ChatGPT is much larger for English than for everything else. ",
"But the same is true not just for the large language model itself but also for the tokenizer. When we train the tokenizer, we're going to see that there's a training set as well, and there's a lot more English than non-English. What ends up happening is that we're going to have a lot more longer tokens for English. ",
"So, how do I put this? If you have a single sentence in English and you tokenize it, you might see that it's 10 tokens or something like that. However, if you translate that sentence into, say, Korean or Japanese, you'll typically see that the number of tokens used is much larger. ",
"That's because the chunks here are a lot more broken up, so we're using a lot more tokens for the exact same thing. What this does is bloat up the sequence length of all the documents, meaning you're using up more tokens. In the attention mechanism of the Transformer, when these tokens try to attend to each other, you run out of context in the maximum context length of that Transformer. ",
"Basically, all the non-English text is stretched out from the perspective of the Transformer. This has to do with the training used for the tokenizer and the tokenization itself. It creates a lot bigger tokens and larger groups in English, while it has many little boundaries for all the other non-English text. If we translated this into English, it would result in significantly fewer tokens. "
],
"paragraph_timestamps": [
570,
597,
614,
631,
655
]
},
{
"num_chapter": 4,
"title": "Tokenization and Efficiency in Python",
"start_paragraph_number": 26,
"end_paragraph_number": 37,
"start_time": 681,
"end_time": 899,
"paragraphs": [
"The final example I have here is a little snippet of Python for doing FS buuz. What I'd like you to notice is that all these individual spaces are separate tokens; they are token 220. So, 220, 220, 220, 220, and then 'space' is a single token. ",
"What's going on here is that when the Transformer is going to consume or try to create this text, it needs to handle all these spaces individually. They all feed in one by one into the entire Transformer in the sequence, making this extremely wasteful. Tokenizing it in this way results in GPT-2 not being very good with Python. ",
"It's not anything to do with coding or the language model itself; it's just that if you use a lot of indentation using space in Python, like we usually do, you end up bloating out all the text. It's separated across too much of the sequence, and we are running out of the context length in the sequence. ",
"That's roughly speaking what's happening. We're being way too wasteful and taking up too much token space. Now, we can also scroll up here and change the tokenizer. Note that the GPT-2 tokenizer creates a token count of 300 for this string. If we change it to CL 100K base, which is the GPT-4 tokenizer, we see that the token count drops to 185. ",
"For the exact same string, we now have roughly half the number of tokens. This is because the number of tokens in the GPT-4 tokenizer is roughly double that of the number of tokens in the GPT-2 tokenizer. We went from roughly 50K to roughly 100K. ",
"You can imagine that this is a good thing because the same text is now squished into half as many tokens. This results in a lot denser input to the Transformer. In the Transformer, every single token has a finite number of tokens before it that it's going to pay attention to. ",
"With this change, we can roughly see twice as much text as context for predicting the next token. However, just increasing the number of tokens is not strictly better infinitely. As you increase the number of tokens, your embedding table gets larger, and at the output, we are trying to predict the next token, which also grows. ",
"We'll go into more detail on this later, but there is some kind of sweet spot where you have just the right number of tokens in your vocabulary, making everything appropriately dense and still fairly efficient. ",
"One thing I would like you to note specifically for the GPT-4 tokenizer is that the handling of whitespace for Python has improved significantly. Here, these four spaces are represented as one single token, and the three spaces are also grouped into a single token. In this case, seven spaces were all grouped into a single token. ",
"This improvement makes Python representation much more efficient. This was a deliberate choice made by OpenAI when they designed the GPT-4 tokenizer, grouping a lot more space into a single character. This densifies Python, allowing us to attend to more code when predicting the next token in the sequence. ",
"Thus, the improvement in Python coding ability from GPT-2 to GPT-4 is not just a matter of the language model and the architecture but also comes from the design of the tokenizer and how it groups characters into tokens. "
],
"paragraph_timestamps": [
681,
703,
725,
743,
764,
779,
798,
823,
838,
856,
883
]
},
{
"num_chapter": 5,
"title": "Understanding Unicode and Code Points",
"start_paragraph_number": 37,
"end_paragraph_number": 46,
"start_time": 899,
"end_time": 1091,
"paragraphs": [
"Now, let's start writing some code. Remember, we want to take strings and feed them into language models. For that, we need to tokenize strings into integers in a fixed vocabulary. Then, we will use those integers to look up vectors and feed those vectors into the Transformer as input. ",
"The reason this gets a little tricky is that we don't just want to support the simple English alphabet; we want to support different kinds of languages. For example, this is '\uc548\ub155\ud558\uc138\uc694' in Korean, which means hello. We also want to support many kinds of special characters that we might find on the internet, such as emojis. ",
"So, how do we feed this text into Transformers? What is this text, anyway, in Python? If you go to the documentation for a string in Python, you can see that strings are immutable sequences of Unicode code points. ",
"Okay, what are Unicode code points? We can refer to the PDF. Unicode code points are defined by the Unicode Consortium as part of the Unicode standard. Essentially, this is a definition of roughly 150,000 characters right now, and it describes what these characters look like and what integers represent those characters. As of now, there are 150,000 characters across 161 scripts. ",
"If you scroll down, you can see that the standard is very much alive, with the latest standard being 15.1, released in September 2023. This standard is a way to define many types of characters, including all these characters across different scripts. ",
"The way we can access the Unicode code point for a single character is by using the `ord` function in Python. For example, if I pass in `ord('H')`, I can see that for the single character 'H', the Unicode code point is 104. ",
"However, this can be a bit complicated. For example, if we take an emoji, we can see that the code point for it is 128,000, or we can take 'un', which is 50,000. Keep in mind that you can't plug in strings here because this function only takes a single Unicode code point character and tells you its integer. ",
"In this way, we can look up all the characters of a specific string and their code points using `ord(x) for x in this_string`, and we get this encoding. Now, we've already turned the raw code points into integers. ",
"So, why can't we simply use these integers and not have any tokenization at all? Why can't we just use the code points natively as is? One reason is that the vocabulary would be quite long. In this case, for Unicode, this is a vocabulary of 150,000 different code points. More worryingly, the Unicode standard is very much alive and keeps changing, so it's not a stable representation that we may want to use directly. "
],
"paragraph_timestamps": [
899,
919,
935,
949,
980,
1002,
1018,
1044,
1062
]
},
{
"num_chapter": 6,
"title": "Understanding Unicode Encodings",
"start_paragraph_number": 46,
"end_paragraph_number": 60,
"start_time": 1091,
"end_time": 1420,
"paragraphs": [
"For those reasons, we need something a bit better. To find something better, we turn to encodings. If we go to the Wikipedia page, we see that the Unicode Consortium defines three types of encodings: UTF-8, UTF-16, and UTF-32. These encodings are the way by which we can take Unicode text and translate it into binary data or byte streams. ",
"UTF-8 is by far the most common encoding. This Wikipedia page is quite long, but what's important for our purposes is that UTF-8 takes every single code point and translates it to a byte stream. This byte stream is between one to four bytes, making it a variable-length encoding. Depending on the Unicode point according to the schema, you will end up with between one to four bytes for each code point. ",
"On top of that, there are UTF-16 and UTF-32. UTF-32 is nice because it is fixed length instead of variable length, but it has many other downsides as well. The full spectrum of pros and cons of these different encodings is beyond the scope of this video. ",
"I just want to point out that I enjoyed this blog post, which also has a number of references that can be quite useful. One of them is the UTF-8 Everywhere Manifesto, which describes why UTF-8 is significantly preferred and why it is used more prominently on the internet. One major advantage is that UTF-8 is the only one of these encodings that is backward compatible with the much simpler ASCII encoding of text. ",
"However, I'm not going to go into full detail in this video. Suffice it to say that we like the UTF-8 encoding. Let's try to take a string and see what we get if we encode it into UTF-8. The string class in Python actually has an `encode` method, and you can give it the encoding, which is, say, UTF-8. ",
"What we get out of this is not very nice because it is a bytes object, and it's not very nice in the way that it's printed. I personally like to take it through a list because then we actually get the raw bytes of this encoding. This is the raw bytes that represent this string according to the UTF-8 encoding. ",
"We can also look at UTF-16, and we get a slightly different byte stream. Here, we start to see one of the disadvantages of UTF-16. You see how we have zeros followed by something, indicating that this is a bit of a wasteful encoding. Indeed, for simple ASCII characters or English characters, we just have the structure of zeros followed by something, which is not exactly nice. ",
"The same goes for UTF-32. When we expand this, we can start to see the wastefulness of this encoding for our purposes. You see a lot of zeros followed by something, which is not desirable. ",
"So, suffice it to say that we would like to stick with UTF-8 for our purposes. However, if we just use UTF-8 naively, these are byte streams, which would imply a vocabulary length of only 256 possible tokens. This vocabulary size is very small. ",
"What this means is that if we were to use it naively, all of our text would be stretched out over very long sequences of bytes. Consequently, the embedding table would be tiny, and the prediction at the final layer would also be very small. However, our sequences are very long, and we have a pretty finite context length and attention that we can support in a transformer for computational reasons. ",
"We only have as much context length, but now we have very long sequences. This is inefficient and will not allow us to attend to sufficiently long text for the purposes of the next token prediction task. Therefore, we don't want to use the raw bytes of the UTF-8 encoding. We want to support a larger vocabulary size that we can tune as a hyperparameter, while still sticking with the UTF-8 encoding of these strings. ",
"So, what do we do? The answer, of course, is to turn to the Byte Pair Encoding (BPE) algorithm, which will allow us to compress these byte sequences to a variable amount. We'll get to that in a bit, but I just want to briefly mention that I would love nothing more than to be able to feed raw byte sequences into language models. In fact, there's a paper about how this could potentially be done from summer last year. ",
"The problem is that you actually have to modify the transformer architecture because, as I mentioned, you're going to have a problem where the attention will start to become extremely expensive due to the long sequences. In this paper, they propose a kind of hierarchical structuring of the transformer that could allow you to feed in raw bytes. At the end, they state that together, these results establish the viability of tokenization-free autoregressive sequence modeling at scale. ",
"Tokenization-free would indeed be amazing; we would just feed byte streams directly into our models. Unfortunately, I don't know that this has really been proven out yet by sufficiently many groups at a sufficient scale. However, something like this would be amazing at one point, and I hope someone comes up with it. "
],
"paragraph_timestamps": [
1091,
1117,
1143,
1165,
1195,
1213,
1238,
1258,
1272,
1294,
1323,
1348,
1378,
1405
]
},
{
"num_chapter": 7,
"title": "Introduction to Byte Pair Encoding",
"start_paragraph_number": 60,
"end_paragraph_number": 68,
"start_time": 1420,
"end_time": 1606,
"paragraphs": [
"For now, we have to come back and cannot feed this directly into language models. We have to compress it using the Byte Pair Encoding algorithm. So, let's see how that works. ",
"As I mentioned, the Byte Pair Encoding algorithm is not all that complicated, and the Wikipedia page is quite instructive regarding the basic idea. What we're doing is starting with some kind of input sequence. For example, here we have only four elements in our vocabulary: A, B, C, and D. We have a sequence of them. Instead of bytes, let's say we just have a vocabulary size of four. The sequence is too long, and we'd like to compress it. ",
"What we do is iteratively find the pair of tokens that occur most frequently. Once we've identified that pair, we replace it with just a single new token that we append to our vocabulary. For example, if the byte pair AA occurs most often, we mint a new token, let's call it capital Z, and replace every occurrence of AA with Z. ",
"Now we have two Z's. We took a sequence of 11 characters with a vocabulary size of four and converted it to a sequence of only nine tokens, but now with a vocabulary of five because we created a fifth vocabulary element, Z, standing for the concatenation of AA. We can repeat this process. ",
"We again look at the sequence and identify the pair of tokens that are most frequent. Let's say that is now AB. We will replace AB with a new token that we mint, calling it Y. So Y becomes AB, and every occurrence of AB is now replaced with Y. ",
"Now we have a sequence of 7 characters, but we have not just four vocabulary elements or five; we now have six. For the final round, we again look through the sequence, find that the pair ZY is most common, and replace it one more time with another character, let's say X. So X is ZY, and we replace all occurrences of ZY. ",
"After going through this process, instead of having a sequence of 11 tokens with a vocabulary length of four, we now have a sequence of five tokens, but our vocabulary length is now seven. In this way, we can iteratively compress our sequence by minting new tokens. ",
"In the exact same way, we start out with byte sequences. We have a vocabulary size of 256, but we are now going to go through these and find the byte pairs that occur most frequently. We will iteratively start minting new tokens, appending them to our vocabulary and replacing elements. In this way, we can efficiently manage our sequences."
],
"paragraph_timestamps": [
1420,
1429,
1458,
1486,
1508,
1533,
1561,
1588
]
},
{
"num_chapter": 8,
"title": "Introduction to Tokenization and Encoding",
"start_paragraph_number": 68,
"end_paragraph_number": 86,
"start_time": 1606,
"end_time": 2049,
"paragraphs": [
"We're going to end up with a compressed training data set and also an algorithm for taking any arbitrary sequence and encoding it using this vocabulary, and also decoding it back to strings. So, let's now implement all that.",
"Here\u2019s what I did: I went to this blog post that I enjoyed, and I took the first paragraph and copy-pasted it here into text. This is one very long line. Now, to get the tokens, as I mentioned, we just take our text and encode it into UTF-8. The tokens here at this point will be a raw byte stream of bytes. Just so that it's easier to work with, instead of just a bytes object, I'm going to convert all those bytes to integers and then create a list of it, just so it's easier for us to manipulate and work with in Python and visualize.",
"Here, I'm printing all of that. This is the original paragraph, and its length is 533 code points. Then here are the bytes encoded in UTF-8, and we see that this has a length of 616 bytes, or 616 tokens. The reason this is more is that a lot of these simple ASCII characters become a single byte, but many of these Unicode more complex characters become multiple bytes, up to four. So, we are expanding that size.",
"Now, what we'd like to do as a first step of the algorithm is iterate over here and find the pair of bytes that occur most frequently because we're then going to merge it. If you are working along on a notebook on the side, then I encourage you to click on the link, find this notebook, and try to write that function yourself. Otherwise, I'm going to come here and implement first the function that finds the most common pair.",
"Okay, so here's what I came up with. There are many different ways to implement this, but I'm calling the function `get_stats`. It expects a list of integers. I'm using a dictionary to keep track of the counts, and this is a Pythonic way to iterate over consecutive elements of this list, which we covered in the previous video. Here, I'm just keeping track of incrementing by one for all the pairs. ",
"If I call this on all the tokens here, then the stats come out as a dictionary. The keys are these tuples of consecutive elements, and this is the count. Just to print it in a slightly better way, this is one way that I like to do that. It\u2019s a little bit complex, so you can pause if you like, but we iterate over all the items. The items called on the dictionary return pairs of key-value, and instead, I create a list here of value-key because if it's a value-key list, then I can call sort on it. By default, Python will use the first element, which in this case will be value, to sort by if it's given tuples, and then reverse it so it's descending.",
"So, it looks like (101, 32) was the most commonly occurring consecutive pair, and it occurred 20 times. We can double-check that it makes reasonable sense. If I just search for 10132, then you see that these are the 20 occurrences of that pair. If we'd like to take a look at what exactly that pair is, we can use `chr`, which is the opposite of `ord` in Python. We give it a Unicode code point, so 101 and 32, and we see that this is 'e' and 'space'. ",
"So, basically, there's a lot of 'e space' here, meaning that many of these words seem to end with 'e'. Here's 'eace' as an example, so there's a lot of that going on here, and this is the most common pair. ",
"Now that we've identified the most common pair, we would like to iterate over this sequence. We're going to mint a new token with the ID of 256, right? Because these tokens currently go from 0 to 255. So, when we create a new token, it will have an ID of 256. We're going to iterate over this entire list, and every time we see (101, 32), we're going to swap that out for 256. ",
"Let's implement that now, and feel free to do that yourself as well. First, I commented this just so we don't pollute the notebook too much. This is a nice way of in Python obtaining the highest ranking pair. We're basically calling the `max` on this dictionary `stats`, and this will return the maximum key. ",
"Then the question is how does it rank keys? You can provide it with a function that ranks keys, and that function is just `stats.get`. This would basically return the value, and so we're ranking by the value and getting the maximum key, which is (101, 32) as we saw. ",
"Now, to actually merge (101, 32), this is the function that I wrote, but again, there are many different versions of it. We're going to take a list of IDs and the pair that we want to replace, and that pair will be replaced with the new index `idx`. So, iterating through IDs, if we find the pair, we swap it out for `idx`. ",
"We create this new list, and then we start at zero and go through this entire list sequentially from left to right. Here, we are checking for equality at the current position with the pair. ",
"We are rechecking that the pair matches. Now, here is a bit of a tricky condition that you have to append if you're trying to be careful, and that is that you don't want this to be out of bounds at the very last position when you're on the rightmost element of this list. Otherwise, this would give you an out-of-bounds error. So, we have to make sure that we're not at the very last element. This would be false for that. ",
"If we find a match, we append to this new list that replacement index and increment the position by two, so we skip over that entire pair. Otherwise, if we haven't found a matching pair, we just sort of copy over the element at that position and increment by one. Then we return this. ",
"Here\u2019s a very small toy example: if we have a list [566, 791] and we want to replace the occurrences of 67 with 99, then calling this on that will give us what we're asking for. So here, the 67 is replaced with 99. ",
"Now, I'm going to uncomment this for our actual use case where we want to take our tokens. We want to take the top pair here and replace it with 256 to get tokens. If we run this, we get the following: Recall that previously we had a length of 616 in this list, and now we have a length of 596. Right? So this decreased by 20, which makes sense because there are 20 occurrences. ",
"Moreover, we can try to find 256 here, and we see plenty of occurrences of it. Furthermore, just to double-check, there should be no occurrence of 10132. So, in the original array, there are plenty of them, and in the second array, there are no occurrences of 1032. Thus, we've successfully merged this single pair. "
],
"paragraph_timestamps": [
1606,
1621,
1660,
1692,
1714,
1736,
1788,
1823,
1834,
1865,
1884,
1907,
1926,
1941,
1964,
1983,
1999,
2030
]
},
{
"num_chapter": 9,
"title": "Iterative Merging of Byte Pairs",
"start_paragraph_number": 86,
"end_paragraph_number": 90,
"start_time": 2049,
"end_time": 2096,
"paragraphs": [
"Now, we just iterate this. We are going to go over the sequence again, find the most common pair, and replace it. So let me now write a while loop that uses these functions to do this iteratively. ",
"How many times do we do it? Well, that's totally up to us as a hyperparameter. The more steps we take, the larger our vocabulary will be and the shorter our sequence. There is some sweet spot that we usually find works best in practice, and so this is kind of a hyperparameter that we tune to find good vocabulary sizes. ",
"As an example, GPT-4 currently uses roughly 100,000 tokens, and those are reasonable numbers currently for large language models. ",
"Let me now write, putting it all together and iterating these steps. "
],
"paragraph_timestamps": [
2049,
2062,
2082,
2091
]
},
{
"num_chapter": 10,
"title": "Introduction to Byte Pair Encoding",
"start_paragraph_number": 90,
"end_paragraph_number": 102,
"start_time": 2096,
"end_time": 2361,
"paragraphs": [
"Okay, now before we dive into the while loop, I wanted to add one more cell here where I went to the blog post. Instead of grabbing just the first paragraph or two, I took the entire blog post and stretched it out in a single line. Basically, using longer text will allow us to have more representative statistics for the byte pairs, and we'll just get more sensible results out of it because it's longer text. ",
"So here we have the raw text. We encode it into bytes using the UTF-8 encoding. Then, as before, we are just changing it into a list of integers in Python, just so it's easier to work with instead of the raw byte objects. ",
"Then this is the code that I came up with to actually do the merging in a loop. These two functions here are identical to what we had above. I only included them here just so that you have a point of reference. ",
"These two are identical, and then this is the new code that I added. The first thing we want to do is decide on the final vocabulary size that we want our tokenizer to have. As I mentioned, this is a hyperparameter, and you set it in some way depending on your best performance. ",
"So let's say for us we're going to use 276 because that way we're going to be doing exactly 20 merges. We need 20 merges because we already have 256 tokens for the raw bytes, and to reach 276, we have to do 20 merges to add 20 new tokens. ",
"This is one way in Python to create a copy of a list. I'm taking the tokens list, and by wrapping it in a list, Python will construct a new list of all the individual elements. So this is just a copy operation. ",
"Then here, I'm creating a merges dictionary. This merges dictionary is going to maintain basically the child one, child two mapping to a new token. What we're going to be building up here is a binary tree of merges, but actually, it's not exactly a tree because a tree would have a single root node with a bunch of leaves. ",
"For us, we're starting with the leaves on the bottom, which are the individual bytes. Those are the starting 256 tokens, and then we're starting to merge two of them at a time. So it's not a tree; it's more like a forest as we merge these elements. ",
"For 20 merges, we're going to find the most commonly occurring pair. We're going to mint a new token integer for it. So here, we'll start at zero. We'll begin at 256, print that we're merging it, and replace all occurrences of that pair with the new token. We'll record that this pair of integers merged into this new integer.",
"Running this gives us the following output: we did 20 merges, and for example, the first merge was exactly as before, with the tokens 101 and 32 merging into a new token, 2556. Now, keep in mind that the individual tokens 101 and 32 can still occur in the sequence after merging; it's only when they occur exactly consecutively that they become 256. ",
"In particular, the other thing to notice here is that the token 256, which is the newly minted token, is also eligible for merging. So here at the bottom, the 20th merge was a merge of 25 and 259, becoming 275. Every time we replace these tokens, they become eligible for merging in the next round of data iteration. That's why we're building up a small sort of binary forest instead of a single individual tree.",
"One thing we can take a look at as well is the compression ratio that we've achieved. In particular, we started off with this tokens list, which was 24,000 bytes, and after merging 20 times, we now have only 19,000 tokens. Therefore, the compression ratio, simply dividing the two, is roughly 1.27. This is the amount of compression we were able to achieve with this text using only 20 merges. Of course, the more vocabulary elements you add, the greater the compression ratio would be."
],
"paragraph_timestamps": [
2096,
2119,
2134,
2147,
2164,
2186,
2200,
2221,
2240,
2263,
2294,
2322
]
},
{
"num_chapter": 11,
"title": "Training the Tokenizer",
"start_paragraph_number": 102,
"end_paragraph_number": 114,
"start_time": 2361,
"end_time": 2557,
"paragraphs": [
"Finally, this is kind of like the training of the tokenizer, if you will. One point I wanted to make is that, and maybe this is a diagram that can help illustrate, the tokenizer is a completely separate object from the large language model itself. Everything in this lecture does not really touch the LLM; we're just training the tokenizer. This is a completely separate pre-processing stage.",
"Usually, the tokenizer will have its own training set, just like a large language model has a potentially different training set. The tokenizer has a training set of documents on which you're going to train it. We are performing the Byte Pair Encoding algorithm, as we saw above, to train the vocabulary of this tokenizer. ",
"So, it has its own training set; it is a pre-processing stage that you would run a single time in the beginning. The tokenizer is trained using the Byte Pair Encoding algorithm. Once you have the tokenizer, once it's trained and you have the vocabulary and the merges, we can do both encoding and decoding.",
"The tokenizer acts as a translation layer between raw text, which is, as we saw, the sequence of Unicode code points. It can take raw text and turn it into a token sequence, and vice versa; it can take a token sequence and translate it back into raw text. ",
"Now that we have trained the tokenizer and we have these merges, we are going to turn to how we can do the encoding and decoding step. If you give me text, here are the tokens, and vice versa, if you give me tokens, here's the text. Once we have that, we can translate between these two realms. ",
"The language model is going to be trained as a step two afterwards. Typically, in a state-of-the-art application, you might take all of your training data for the language model and run it through the tokenizer, translating everything into a massive token sequence. Then, you can throw away the raw text; you're just left with the tokens themselves, and those are stored on disk. That is what the large language model is actually reading when it's training on them.",
"This is one approach that you can take as a single massive pre-processing step. ",
"So, basically, I think the most important thing I want to get across is that this is a completely separate stage. It usually has its own entire training set. You may want to have those training sets be different between the tokenizer and the large language model. ",
"For example, when you're training the tokenizer, as I mentioned, we don't just care about the performance of English text; we care about many different languages. We also care about code or not code. You may want to look into different kinds of mixtures of different languages and different amounts of code. ",
"The amount of different language that you have in your tokenizer training set will determine how many merges there will be, and therefore that determines the density with which this type of data has in the token space. ",
"Roughly speaking, intuitively, if you add some amount of data, like say you have a ton of Japanese data in your tokenizer training set, that means that more Japanese tokens will get merged, and therefore Japanese will have shorter sequences. This will be beneficial for the large language model, which has a finite context length on which it can work in the token space. ",
"Hopefully, that makes sense. "
],
"paragraph_timestamps": [
2361,
2384,
2400,
2422,
2438,
2456,
2481,
2488,
2501,
2519,
2535,
2554
]
},
{
"num_chapter": 12,
"title": "Encoding and Decoding Overview",
"start_paragraph_number": 114,
"end_paragraph_number": 125,
"start_time": 2557,
"end_time": 2901,
"paragraphs": [
"Now, we are going to turn to encoding and decoding now that we have trained a tokenizer. We have our merges, and now how do we do encoding and decoding? ",
"Okay, so let's begin with decoding, which is this arrow over here. Given a token sequence, let's go through the tokenizer to get back a Python string object, the raw text. This is the function that we would like to implement. We are given a list of integers, and we want to return a Python string. If you'd like, try to implement this function yourself; it's a fun exercise. Otherwise, I'm going to start pasting in my own solution. ",
"There are many different ways to do it. Here's one way: I will create a kind of pre-processing variable that I will call `vocab`. `vocab` is a mapping or a dictionary in Python from the token ID to the bytes object for that token. We begin with the raw bytes for tokens from 0 to 255, and then we go in order of all the merges. We sort of populate this vocab list by doing an addition here. This is basically the bytes representation of the first child followed by the second one. Remember, these are bytes objects, so this addition here is an addition of two bytes objects, just concatenation. ",
"One tricky thing to be careful with, by the way, is that I'm iterating a dictionary in Python using `.items()`, and it really matters that this runs in the order in which we inserted items into the merges dictionary. Luckily, starting with Python 3.7, this is guaranteed to be the case. But before Python 3.7, this iteration may have been out of order with respect to how we inserted elements into merges, and this may not have worked. But we are using modern Python, so we're okay. ",
"Then here, given the IDs, the first thing we're going to do is get the tokens. The way I implemented this here is by iterating over all the IDs and using `vocab` to look up their bytes. Then, this is one way in Python to concatenate all these bytes together to create our tokens. At this point, these tokens are raw bytes, so I have to decode using UTF-8 back into Python strings. Previously, we called `encode` on a string object to get the bytes, and now we're doing the opposite: we're taking the bytes and calling `decode` on the bytes object to get a string in Python, and then we can return the text. ",
"Now, this actually has an issue in the way I implemented it, and this could actually throw an error. So try to figure out why this code could result in an error if we plug in some sequence of IDs that is unlucky. Let me demonstrate the issue. When I try to decode something like 97, I am going to get the letter A back, so nothing too crazy happening. But when I try to decode 128 as a single element, the token 128 is what in string or in Python object Unicode decoder UTF-8 can't decode by 0x80, which is this in HEX in position zero: invalid start byte. ",
"What does that mean? Well, to understand what this means, we have to go back to our UTF-8 page that I briefly showed earlier. This is Wikipedia's UTF-8, and basically, there's a specific schema that UTF-8 bytes take. In particular, if you have a multi-byte object for some of the Unicode characters, they have to have this special sort of envelope in how the encoding works. ",
"What's happening here is that the invalid start byte is because 128, the binary representation of it, is one followed by all zeros. So we have one and then all zeros, and we see here that that doesn't conform to the format because one followed by all zeros just doesn't fit any of these rules, so to speak. It's an invalid start byte, which is byte one. This one must have a one following it and then a zero following it, and then the content of your Unicode in hex here. ",
"So basically, we don't exactly follow the UTF-8 standard, and this cannot be decoded. The way to fix this is to use this `errors='replace'` in the `bytes.decode` function of Python. By default, errors is strict, so we will throw an error if it's not valid UTF-8 bytes encoding. But there are many different things that you could put here for error handling. This is the full list of all the errors that you can use, and in particular, instead of strict, let's change it to replace. This will replace invalid sequences with this special marker, the replacement character. ",
"So now we just get that character back. Basically, not every single byte sequence is valid UTF-8, and if it happens that your large language model, for example, predicts your tokens in a bad manner, then they might not fall into valid UTF-8, and then we won't be able to decode them. The standard practice is to use `errors='replace'`, and this is what you will also find in the OpenAI code that they released as well. ",
"But basically, whenever you see this kind of character in your output, in that case, something went wrong, and the LM output was not a valid sequence of tokens. "
],
"paragraph_timestamps": [
2557,
2566,
2591,
2636,
2666,
2718,
2759,
2782,
2817,
2858,
2892
]
},
{
"num_chapter": 13,
"title": "Implementing Token Encoding",
"start_paragraph_number": 125,
"end_paragraph_number": 146,
"start_time": 2901,
"end_time": 3309,
"paragraphs": [
"Okay, and now we're going to go the other way. We are going to implement this arrow right here, where we are given a string and we want to encode it into tokens. This is the signature of the function that we're interested in, and this should basically print a list of integers of the tokens. ",
"So again, try to maybe implement this yourself if you'd like a fun exercise, and pause here. Otherwise, I'm going to start putting in my solution. There are many ways to do this, and this is one of the ways that I came up with. ",
"The first thing we're going to do is take our text, encode it into UTF-8 to get the raw bytes, and then, as before, we're going to call `list` on the bytes object to get a list of integers of those bytes. Those are the starting tokens; those are the raw bytes of our sequence. ",
"Now, of course, according to the merges dictionary above, some of the bytes may be merged according to this lookup. In addition to that, remember that the merges were built from top to bottom, and this is the order in which we inserted stuff into merges. We prefer to do all these merges in the beginning before we do these merges later because, for example, this merge over here relies on the 256, which got merged here. ",
"So we have to go in the order from top to bottom if we are going to be merging anything. Now we expect to be doing a few merges, so we're going to be doing W true. Now we want to find a pair of bytes that is consecutive and that we are allowed to merge according to this. ",
"In order to reuse some of the functionality that we've already written, I'm going to reuse the function `get_stats`. Recall that `get_stats` will give us the count of how many times every single pair occurs in our sequence of tokens and return that as a dictionary. The dictionary is a mapping from all the different byte pairs to the number of times that they occur. ",
"At this point, we don't actually care how many times they occur in the sequence; we only care about what the raw pairs are in that sequence. So I'm only going to be using basically the keys of the dictionary. I only care about the set of possible merge candidates, if that makes sense. ",
"Now we want to identify the pair that we're going to be merging at this stage of the loop. We want to find the pair, or a key inside `stats`, that has the lowest index in the merges dictionary because we want to do all the early merges before we work our way to the late merges. ",
"Again, there are many different ways to implement this, but I'm going to do something a little bit fancy here. I'm going to be using the `min` function over an iterator in Python. When you call `min` on an iterator and `stats` here as a dictionary, we're going to be iterating the keys of this dictionary in Python. ",
"So we're looking at all the pairs inside `stats`, which are all the consecutive pairs, and we're going to be taking the consecutive pair inside tokens that has the minimum. The `min` function takes a key, which gives us the function that is going to return a value over which we're going to do the min. The one we care about is taking `merges` and basically getting that pair's index. ",
"For any pair inside `stats`, we are going to be looking into `merges` at what index it has, and we want to get the pair with the minimum number. As an example, if there's a pair (101, 32), we definitely want to get that pair. We want to identify it here and return it, and the pair would become 10132 if it occurs. ",
"The reason that I'm putting a float `INF` here as a fallback is that in the `get` function, when we consider a pair that doesn't occur in the merges, then that pair is not eligible to be merged. If in the token sequence there's some pair that is not a merging pair, it cannot be merged, and it doesn't actually occur here. It doesn't have an index and cannot be merged, which we will denote as float `INF`. ",
"The reason infinity is nice here is that we're guaranteed that it's not going to participate in the list of candidates when we do the min. So this is one way to do it. Long story short, this returns the most eligible merging candidate pair that occurs in the tokens. ",
"Now, one thing to be careful with here is that this function might fail in the following way: if there's nothing to merge, then there's nothing in merges that is satisfied anymore. There's nothing to merge; everything just returns float `INF`, and then the pair will just become the very first element of `stats`. ",
"However, this pair is not actually a mergeable pair; it just becomes the first pair inside `stats` arbitrarily because all of these pairs evaluate to float `INF` for the merging criterion. ",
"So basically, it could be that this doesn't succeed because there are no more merging pairs. If this pair is not in merges that were returned, then this is a signal for us that actually there was nothing to merge. No single pair can be merged anymore. In that case, we will break out. ",
"If else can be merged, you may come up with a different implementation. By the way, this is kind of like really trying hard in Python, but really we're just trying to find a pair that can be merged with the lowest index. ",
"Now, if we did find a pair that is inside merges with the lowest index, then we can merge it. So we're going to look into the merger dictionary for that pair to look up the index, and we're going to now merge that into that index. ",
"So we're going to do `tokens =` and we're going to replace the original tokens. We're going to be replacing the pair `pair` and we're going to be replacing it with `index idx`. This returns a new list of tokens where every occurrence of `pair` is replaced with `idx`. ",
"So we're doing a merge, and we're going to be continuing this until eventually nothing can be merged. We'll come out here and we'll break out, and here we just return tokens. ",
"So that's the implementation, I think. Hopefully, this runs okay. Cool. Yeah, and this looks reasonable. For example, 32 is a space in ASCII, so that's here. This looks like it worked great. "
],
"paragraph_timestamps": [
2901,
2919,
2935,
2959,
2982,
3001,
3028,
3044,
3061,
3082,
3115,
3135,
3164,
3179,
3204,
3216,
3235,
3247,
3267,
3284,
3293
]
},
{
"num_chapter": 14,
"title": "Handling Special Cases in Encoding",
"start_paragraph_number": 146,
"end_paragraph_number": 153,
"start_time": 3309,
"end_time": 3410,
"paragraphs": [
"Okay, so let's wrap up this section of the video. At least I wanted to point out that this is not quite the right implementation just yet because we are leaving out a special case. ",
"In particular, if we try to do this, it would give us an error. The issue is that if we only have a single character or an empty string, then `stats` is empty, and that causes an issue inside `min`. ",
"One way to fight this is if `L of tokens` is at least two, because if it's less than two, it's just a single token or no tokens. Then let's just say there's nothing to merge, so we just return. That would fix that case. ",
"Okay, and then second, I have a few test cases here for us as well. So first, let's make sure, or let's note the following: if we take a string and we try to encode it and then decode it back, you'd expect to get the same string back, right? Is that true for all strings? ",
"So I think here it is the case, and I think in general this is probably the case. But notice that going backwards is not going to have an identity going backwards because, as I mentioned, not all token sequences are valid UTF-8 streams. Therefore, some of them can't even be decodable. ",
"So this only goes in one direction, but for that one direction, we can check here if we take the training text, which is the text that we train the tokenizer around. We can make sure that when we encode and decode, we get the same thing back, which is true. ",
"Here, I took some validation data. I went to, I think, this web page and I grabbed some text. This is text that the tokenizer has not seen, and we can make sure that this also works. "
],
"paragraph_timestamps": [
3309,
3315,
3327,
3342,
3365,
3385,
3399
]
},
{
"num_chapter": 15,
"title": "Introduction to Byte Pair Encoding",
"start_paragraph_number": 153,
"end_paragraph_number": 157,
"start_time": 3410,
"end_time": 3455,
"paragraphs": [
"Okay, so that gives us some confidence that this was correctly implemented. So those are the basics of the byte pair encoding algorithm. We saw how we can take some training set to train a tokenizer. ",
"The parameters of this tokenizer really are just this dictionary of merges, and that basically creates the little binary forest on top of raw bytes. Once we have this merges table, we can both encode and decode between raw text and token sequences. ",
"So that's the simplest setting of the tokenizer. What we're going to do now, though, is we're going to look at some of the state-of-the-art language models and the kinds of tokenizers that they use. ",
"We're going to see that this picture complexifies very quickly, so we're going to go through the details of this complexification one at a time. "
],
"paragraph_timestamps": [
3410,
3421,
3437,
3447
]
},
{
"num_chapter": 16,
"title": "Exploring GPT-2 Tokenization",
"start_paragraph_number": 157,
"end_paragraph_number": 165,
"start_time": 3455,
"end_time": 3555,
"paragraphs": [
"So let's kick things off by looking at the GPT series. In particular, I have the GPT-2 paper here. This paper is from 2019 or so, so five years ago. ",
"Let's scroll down to input representation. This is where they talk about the tokenizer that they're using for GPT-2. Now, this is all fairly readable, so I encourage you to pause and read this yourself. ",
"This is where they motivate the use of the byte pair encoding algorithm on the byte-level representation of UTF-8 encoding. This is where they motivate it, and they talk about the vocabulary sizes and everything. ",
"Now, everything here is exactly as we've covered it so far, but things start to depart around here. What they mention is that they don't just apply the naive algorithm as we have done it. ",
"In particular, here's an example: suppose that you have common words like \"dog.\" What will happen is that \"dog,\" of course, occurs very frequently in the text, and it occurs right next to all kinds of punctuation, as an example. ",
"So \"dog,\" \"dot,\" \"dog,\" \"exclamation mark,\" \"dog,\" \"question mark,\" etc. Naively, you might imagine that the byte pair algorithm could merge these to be single tokens, and then you end up with lots of tokens that are just like \"dog\" with slightly different punctuation. ",
"It feels like you're clustering things that shouldn't be clustered. You're combining semantics with punctuation, and this feels suboptimal. Indeed, they also say that this is suboptimal according to some of the experiments. ",
"So, what they want to do is to top down, in a manual way, enforce that some types of characters should never be merged together. They want to enforce these merging rules on top of the byte pair encoding algorithm. "
],
"paragraph_timestamps": [
3455,
3466,
3478,
3491,
3503,
3514,
3528,
3542
]
},
{
"num_chapter": 17,
"title": "Regex Patterns in Tokenization",
"start_paragraph_number": 165,
"end_paragraph_number": 180,
"start_time": 3555,
"end_time": 3870,
"paragraphs": [
"Let's take a look at their code and see how they actually enforce this and what kinds of merges they perform. I have two tabs open here for GPT-2 under OpenAI on GitHub. When we go to the source, there is an encoder. Now, I don't personally love that they call it \"encoder,\" because this is the tokenizer, and the tokenizer can do both encode and decode. It feels kind of awkward to me that it's called \"encoder,\" but that is the tokenizer, and there's a lot going on here. ",
"We're going to step through it in detail at one point. For now, I just want to focus on this part here: the regex pattern that looks very complicated. We're going to go through it in a bit, but this is the core part that allows them to enforce rules for what parts of the text will never be merged for sure. ",
"Now, notice that `re.compile` here is a little bit misleading because we're not just doing `import re`, which is the Python `re` module. We're doing `import regex as re`, and `regex` is a Python package that you can install with `pip install regex`. It's basically an extension of `re`, so it's a bit more powerful. ",
"Let's take a look at this pattern and what it's doing, and why this is actually doing the separation that they are looking for. ",
"Okay, so I've copy-pasted the pattern here to our Jupyter notebook where we left off, and let's take this pattern for a spin. In the exact same way that their code does, we're going to call `re.findall` for this pattern on any arbitrary string that we are interested in. This is the string that we want to encode into tokens to feed into an LLM like GPT-2. ",
"So, what exactly is this doing? Well, `re.findall` will take this pattern and try to match it against a string. The way this works is that you go from left to right in the string and try to match the pattern. `re.findall` will get all the occurrences and organize them into a list. ",
"Now, when you look at this pattern, first of all, notice that this is a raw string, and then these are three double quotes just to start the string. So really, the string itself is the pattern. Notice that it's made up of a lot of \"or\" conditions. See these vertical bars? Those are \"or\" in regex. ",
"You go from left to right in this pattern and try to match it against the string wherever you are. So we have \"hello,\" and we're going to try to match it. Well, it's not \"apostrophe s,\" it's not \"apostrophe t,\" or any of these, but it is an optional space followed by \"P of,\" sorry, \"SL P of L\" one or more times. ",
"What is \"P of L\"? It is coming from some documentation that I found. There might be other sources as well. \"SLP\" is a letter, any kind of letter from any language, and \"hello\" is made up of letters: h, e, l, etc. So, an optional space followed by a bunch of letters, one or more letters, is going to match \"hello.\" ",
"But then the match ends because a whitespace is not a letter. From there on begins a new sort of attempt to match against the string again. Starting here, we're going to skip over all of these again until we get to the exact same point again, and we see that there's an optional space. This is the optional space followed by a bunch of letters, one or more of them, and so that matches. ",
"So when we run this, we get a list of two elements: \"hello\" and then \"space world.\" If we add more letters, we would just get them like this. ",
"Now, what is this doing, and why is this important? We are taking our string, and instead of directly encoding it for tokenization, we are first splitting it up. When you actually step through the code\u2014and we'll do that in a bit more detail\u2014what it really is doing on a high level is that it first splits your text into a list of texts just like this one. ",
"All these elements of this list are processed independently by the tokenizer, and all of the results of that processing are simply concatenated. So, \"hello world,\" oh, I missed how \"hello world\" and \"how are you.\" We have five elements of the list. All of these will independently go from text to a token sequence, and then that token sequence is going to be concatenated. ",
"Roughly speaking, what that does is you're only ever finding merges between the elements of this list. You can only ever consider merges within every one of these elements individually. After you've done all the possible merging for all of these elements individually, the results of all that will be joined by concatenation. ",
"So, you are basically doing what you're doing effectively: you are never going to be merging this \"e\" with this space because they are now parts of the separate elements of this list. You are saying we are never going to merge \"e\" with \"space\" because we're breaking it up in this way. "
],
"paragraph_timestamps": [
3555,
3585,
3603,
3624,
3630,
3657,
3677,
3698,
3720,
3742,
3766,
3779,
3802,
3832,
3854
]
},
{
"num_chapter": 18,
"title": "Tokenization and Regex Patterns",
"start_paragraph_number": 180,
"end_paragraph_number": 197,
"start_time": 3870,
"end_time": 4296,
"paragraphs": [
"So basically, using this regex pattern, you can enforce the separation of certain characters during the tokenization process. The regex pattern to chunk up the text is just one way of enforcing that some merges are not to happen. We're going to go into more detail about this text, and we'll see that what this is trying to do, on a high level, is to avoid merging across letters, numbers, punctuation, and so on. ",
"Let's see in more detail how that works. Now we have / P of n. If you go to the documentation, SLP of n is any kind of numeric character in any script, so it's numbers. We have an optional space followed by numbers, and those would be separated out. Letters and numbers are being separated. For example, if I do \"Hello World 123, how are you?\" then \"World\" will stop matching here because \"1\" is not a letter anymore; it is a number. This group will match for that, and we'll get it as a separate entity. ",
"Let's see how these apostrophes work. If we have, for example, apostrophe V, then the apostrophe here is not a letter or a number. So \"hello\" will stop matching, and we will exactly match this with that, resulting in a separate entity. ",
"Why are they doing the apostrophes here? Honestly, I think that these are just very common apostrophes that are typically used. I don't love that they've done this because, let me show you what happens when you have some Unicode apostrophes. For example, if you have \"house,\" then this will be separated out because of this matching. However, if you use the Unicode apostrophe, then suddenly this does not work, and this apostrophe will actually become its own thing. ",
"So, it's basically hardcoded for this specific kind of apostrophe; otherwise, they become completely separate tokens. In addition to this, you can go to the GPT-2 docs, and here, when they define the pattern, they say they should have added re. ignore case. So, BP merges can happen for capitalized versions of contractions. What they're pointing out is that you see how this is an apostrophe followed by lowercase letters. Because they didn't use re. ignore case, these rules will not separate out the apostrophes if it's uppercase. ",
"For example, \"house\" would be like this, but if I did \"House\" with an uppercase \"H,\" then notice suddenly the apostrophe comes by itself. So, the tokenization will work differently in uppercase and lowercase, inconsistently separating out these apostrophes. It feels extremely gnarly and slightly gross, but that's how that works. ",
"Okay, so let's come back after trying to match a bunch of apostrophe expressions. By the way, the other issue here is that these are quite language-specific, so I don't know that all languages, for example, use or don't use apostrophes. That would lead to inconsistent tokenization as a result. ",
"Then we try to match letters, then we try to match numbers, and if that doesn't work, we fall back to here. What this is saying is again an optional space followed by something that is not a letter, number, or space, in one or more instances. Effectively, this is trying to match punctuation, roughly speaking, not letters and not numbers. This group will try to trigger for that. ",
"If I do something like this, then these parts here are not letters or numbers, but they will actually get caught here, and so they become their own group. We've separated out the punctuation. ",
"Finally, this is also a little bit confusing. This is matching whitespace, but it uses a negative lookahead assertion in regex. What this is doing is matching whitespace up to, but not including, the last whitespace character. ",
"Why is this important? This is pretty subtle. You see how the whitespace is always included at the beginning of the word, such as \"space r space u,\" etc. Suppose we have a lot of spaces here. What's going to happen is that these spaces, up to but not including the last character, will get caught by this. This will separate out the spaces up to but not including the last character so that the last character can join with the \"space you.\" The reason that's nice is because \"space you\" is the common token. ",
"If I didn't have these extra spaces here, you would just have \"space you.\" If I add tokens and spaces, we still have \"space you,\" but now we have all this extra whitespace. Basically, the GPT-2 tokenizer really likes to have a space, letters, or numbers, and it preens these spaces. This is just something that it is consistent about. ",
"That's what that is for. Finally, the last fallback is whitespace characters. If that doesn't get caught, then this will catch any trailing spaces and so on. ",
"I wanted to show one more real-world example here. If we have this string, which is a piece of Python code, and then we try to split it up, this is the kind of output we get. You'll notice that the list has many elements here, and that's because we are splitting the text fairly often. Every time a category changes, there will never be any merges within these elements, and that's what you are seeing here.",
"Now, you might think that in order to train the tokenizer, OpenAI has used this to split up text into chunks and then run just a BPE algorithm within all the chunks. But that is not exactly what happened, and the reason is the following: notice that we have the spaces here. Those spaces end up being entire elements, but these spaces never actually end up being merged by OpenAI. The way you can tell is that if you copy and paste the exact same chunk here into TikToken, you see that all the spaces are kept independent and they're all token 220. ",
"So, I think OpenAI at some point enforced some rule that these spaces would never be merged. There are some additional rules on top of just chunking and BPE that OpenAI is not clear about. Now, the training code for the GPT-2 tokenizer was never released, so all we have is the code that I've already shown you. But this code here that they've released is only the inference code for the tokens. This is not the training code; you can't give it a piece of text and train a tokenizer. This is just the inference code, which takes the merges that we have above and applies them to a new piece of text. ",
"So, we don't know exactly how OpenAI trained the tokenizer, but it wasn't as simple as chunking it up and applying BPE. "
],
"paragraph_timestamps": [
3870,
3891,
3933,
3946,
3982,
4019,
4042,
4057,
4082,
4098,
4115,
4151,
4176,
4188,
4211,
4251,
4288
]
},
{
"num_chapter": 19,
"title": "Introduction to TikToken Library",
"start_paragraph_number": 197,
"end_paragraph_number": 206,
"start_time": 4296,
"end_time": 4498,
"paragraphs": [
"Next, I wanted to introduce you to the TikToken library from OpenAI, which is the official library for tokenization from OpenAI. So, this is TikToken. You can install it using pip: `pip install TikToken`. Then, you can do the tokenization in inference. This is again not training code; this is only inference code for tokenization. I wanted to show you how you would use it. It's quite simple, and running this just gives us the GPT-2 tokens or the GPT-4 tokens. ",
"This is the tokenizer used for GPT-4, and in particular, we see that the whitespace in GPT-2 remains unmerged. But in GPT-4, these whitespaces merge. As we also saw in this one, where here they're all unmerged, if we go down to GPT-4, they become merged. ",
"Now, in the GPT-4 tokenizer, they changed the regular expression that they use to chunk up text. The way to see this is that if you come to the TikToken library and then you go to this file, TikTokenX, OpenAI public, this is where the definition of all these different tokenizers that OpenAI maintains is located. ",
"To do the inference, they had to publish some of the details about the strings. This is the string that we already saw for GPT-2. It is slightly different but is actually equivalent to what we discussed here. This pattern that we discussed is equivalent to this pattern; this one just executes a little bit faster. ",
"Here, you see a slightly different definition, but otherwise, it's the same. We're going to go into special tokens in a bit. If you scroll down to CL100k, this is the GPT-4 tokenizer, and you see that the pattern has changed. This is kind of like the main change, in addition to a bunch of other special tokens, which I'll go into in a bit. ",
"Now, I'm not going to actually go into the full detail of the pattern change because, honestly, this is mind-numbing. I would just advise that you pull out ChatGPT and the regex documentation and just step through it. The major changes are: number one, you see this 'i' here, which means that the case sensitivity is case insensitive. ",
"So, the comment that we saw earlier on, \"Oh, we should have used re. uppercase,\" basically means we're now going to be matching these apostrophe s, apostrophe d, apostrophe m, etc. We're going to be matching them both in lowercase and in uppercase, so that's fixed. ",
"There\u2019s a bunch of different handling of the whitespace that I'm not going to go into full details about. One more thing here is you will notice that when they match the numbers, they only match one to three numbers. They will never merge numbers that are more than three digits; only up to three digits of numbers will ever be merged. That's one change they made to prevent tokens that are very long number sequences. ",
"But again, we don't really know why they do any of this stuff because none of this is documented. We just get the pattern, so it is what it is. Those are some of the changes that GPT-4 has made, and of course, the vocabulary size went from roughly 50k to roughly 100k. "
],
"paragraph_timestamps": [
4296,
4324,
4349,
4368,
4387,
4408,
4431,
4449,
4480
]
},
{
"num_chapter": 20,
"title": "Understanding the GPT-2 Encoder",
"start_paragraph_number": 206,
"end_paragraph_number": 215,
"start_time": 4498,
"end_time": 4704,
"paragraphs": [
"The next thing I would like to do, very briefly, is to take you through the GPT-2 encoder that OpenAI has released. This is the file that I already mentioned to you briefly. This file is fairly short and should be relatively understandable to you at this point. ",
"Starting at the bottom here, they are loading two files: encoder.json and vocab.bpe. They do some light processing on it and then they call this encoder object, which is the tokenizer. If you'd like to inspect these two files, which together constitute their saved tokenizer, you can do that with a piece of code like this. This is where you can download these two files and you can inspect them if you'd like. ",
"What you will find is that this encoder, as they call it in their code, is exactly equivalent to our vocab. Remember here where we have this vocab object, which allowed us to decode very efficiently. It basically took us from the integer to the bytes for that integer. So, our vocab is exactly their encoder. Their vocab.bpe, confusingly, is actually our merges. Their BP merges, which is based on the data inside vocab.bpe, ends up being equivalent to our merges. ",
"Basically, they are saving and loading the two variables that for us are also critical: the merges variable and the vocab variable. Using just these two variables, you can represent a tokenizer and you can both do encoding and decoding once you've trained this tokenizer. ",
"Now, the only thing that is actually slightly confusing inside what OpenAI does here is that, in addition to this encoder and a decoder, they also have something called a byte encoder and a byte decoder. This is actually, unfortunately, just kind of a spurious implementation detail and isn't actually deep or interesting in any way, so I'm going to skip the discussion of it. ",
"What OpenAI does here, for reasons that I don't fully understand, is that not only do they have this tokenizer which can encode and decode, but they have a whole separate layer here in addition that is used serially with the tokenizer. So, you first do byte encode and then encode, and then you do decode and then byte decode. That's the loop, and they are just stacked serially on top of each other. It's not that interesting, so I won't cover it. You can step through it if you'd like. ",
"Otherwise, this file, if you ignore the byte encoder and the byte decoder, will be algorithmically very familiar to you. The meat of it here is what they call the BPE function, and you should recognize this loop here, which is very similar to our own loop where they're trying to identify the byte pair that they should be merging next. ",
"Just like we had, they have a for loop trying to merge this pair. They will go over all of the sequences and merge the pair whenever they find it, and they keep repeating that until they run out of possible merges in the text. So, that's the meat of this file, and there are encode and decode functions just like we have implemented. ",
"Long story short, what I want you to take away at this point is that, unfortunately, it's a little bit of a messy code that they have, but algorithmically it is identical to what we've built up above. What we've built up above, if you understand it, is algorithmically what is necessary to actually build a BPE tokenizer, train it, and then both encode and decode. "
],
"paragraph_timestamps": [
4498,
4515,
4538,
4578,
4594,
4619,
4646,
4667,
4686
]
},
{
"num_chapter": 21,
"title": "Special Tokens in Tokenization",
"start_paragraph_number": 215,
"end_paragraph_number": 230,
"start_time": 4704,
"end_time": 5125,
"paragraphs": [
"The next topic I would like to turn to is that of special tokens. In addition to tokens that are coming from raw bytes and the BPE merges, we can insert all kinds of tokens that we are going to use to delimit different parts of the data or introduce to create a special structure of the token streams. ",
"If you look at this encoder object from OpenAI's GPT-2, you'll notice that the length of this is 50,257. As I mentioned, it's mapping and it's inverted from the mapping of our vocab. Our vocab goes from integer to string, and they go the other way around for no amazing reason. ",
"The thing to note here is that this mapping table has 50,257 entries. Where does that number come from? What are the tokens? As I mentioned, there are 256 raw byte tokens, and then OpenAI actually did 50,000 merges, so those become the other tokens. But this would have been 50,256, so what is the 57th token? ",
"There is basically one special token, and that one special token you can see is called \"end of text.\" This is a special token and it's the very last token. This token is used to delimit documents in the training set. When we're creating the training data, we have all these documents, and we tokenize them to get a stream of tokens. Those tokens only range from 0 to 50,256, and then in between those documents, we put the special \"end of text\" token. ",
"We insert that token in between documents, and we are using this as a signal to the language model that the document has ended and what follows is going to be unrelated to the document. That said, the language model has to learn this from data. It needs to learn that this token usually means that it should wipe its sort of memory of what came before, and what came before this token is not actually informative to what comes next. ",
"We are expecting the language model to learn this, but we're giving it a special sort of delimiter. ",
"In the context of these documents, we can go here to the Tech tokenizer. This is the GPT-2 tokenizer, which is the code that we've been playing with before. We can add here, \"Hello, world! How are you?\" and we're getting different tokens. But now, you can see what happens if I put \"end of text.\" You see how until I finish it, these are all different tokens. \"End of text\" still sets different tokens, and now when I finish it, suddenly we get token 50256. ",
"The reason this works is that this didn't actually go through the BPE merges. Instead, the code that actually outputs tokens has special case instructions for handling special tokens. We did not see these special instructions for handling special tokens in the encoder; it's absent there. But if you go to the Tech token library, which is implemented in Rust, you will find all kinds of special case handling for these special tokens that you can register and add to the vocabulary. Then it looks for them, and whenever it sees these special tokens, it will actually come in and swap in that special token. ",
"These things are outside of the typical algorithm of BPE encoding. So, these special tokens are used pervasively, not just in basic language modeling of predicting the next token in the sequence, but especially when it gets to later fine-tuning stages and all of the ChatGPT aspects of it. We don't just want to delimit documents; we want to delimit entire conversations between an assistant and a user. ",
"If I refresh this SCK tokenizer page, the default example that they have here is using not just base model encoders but fine-tuned model tokenizers. For example, using the GPT-3.5 turbo scheme, these here are all special tokens: \"I am start,\" \"I end,\" etc. This is short for \"Imaginary core start,\" by the way. You can see here that there's a sort of start and end for every single message, and there can be many other tokens in use to delimit these conversations and keep track of the flow of the messages. ",
"Now we can go back to the Tik token library. Here, when you scroll to the bottom, they talk about how you can extend Tik token. You can create, basically, a fork of the CL 100K base tokenizers in GPT-4. For example, you can extend it by adding more special tokens, and these are totally up to you. You can come up with any arbitrary tokens and add them with a new ID afterwards, and the Tik token library will correctly swap them out when it sees them in the strings. ",
"We can also go back to this file, which we've looked at previously. I mentioned that the GPT-2 in Tik token OpenAI has the vocabulary, the pattern for splitting, and here we are registering the single special token in GPT-2, which was the end of text token. We saw that it has this ID in GPT-4. When they define this here, you see that the pattern has changed, as we've discussed, but also the special tokens have changed in this tokenizer. ",
"We, of course, have the end of text, just like in GPT-2, but we also see three, sorry, four additional tokens here: \"Thim,\" \"prefix,\" \"middle,\" and \"suffix.\" What is \"fim\"? \"Fim\" is short for \"fill in the middle.\" If you'd like to learn more about this idea, it comes from this paper. I'm not going to go into detail in this video; it's beyond the scope of this video. ",
"Then there's one additional special token here, so that's that encoding as well. It's very common to train a language model, and if you'd like, you can add special tokens. When you add special tokens, you, of course, have to do some model surgery to the transformer and all the parameters involved in that transformer. You are basically adding an integer, and you want to make sure that, for example, your embedding matrix for the vocabulary tokens has to be extended by adding a row. Typically, this row would be initialized with small random numbers or something like that because we need to have a vector that now stands for that token. ",
"In addition to that, you have to go to the final layer of the transformer and ensure that the projection at the very end into the classifier is extended by one as well. So, basically, there's some model surgery involved that you have to couple with the tokenization changes if you are going to add special tokens. This is a very common operation that people do, especially if they'd like to fine-tune the model, for example, taking it from a base model to a chat model like ChatGPT. "
],
"paragraph_timestamps": [
4704,
4728,
4749,
4775,
4810,
4833,
4842,
4871,
4914,
4939,
4978,
5012,
5039,
5061,
5101
]
},
{
"num_chapter": 22,
"title": "Building Your Own GPT-4 Tokenizer",
"start_paragraph_number": 230,
"end_paragraph_number": 239,
"start_time": 5125,
"end_time": 5321,
"paragraphs": [
"At this point, you should have everything you need in order to build your own GPT-4 tokenizer. In the process of developing this lecture, I've done that and published the code under this repository, MBP. ",
"So, MBP looks like this right now as I'm recording, but the MBP repository will probably change quite a bit because I intend to continue working on it. In addition to the MBP repository, I've published this exercise progression that you can follow. If you go to exercise.md, this is sort of me breaking up the task ahead of you into four steps that build up to what can be a GPT-4 tokenizer. Feel free to follow these steps exactly and follow a little bit of the guidance that I've laid out here. ",
"Anytime you feel stuck, just reference the MBP repository. Either the tests could be useful or the MBP repository itself. I try to keep the code fairly clean and understandable, so feel free to reference it whenever you get stuck. ",
"In addition to that, basically, once you write it, you should be able to reproduce this behavior from TikToken. Getting the GPT-4 tokenizer means you can encode the string, and you should get these tokens. Then, you can encode and decode the exact same string to recover it. ",
"In addition to all that, you should be able to implement your own train function, which the TikToken library does not provide. It's again only inference code, but you could write your own train function. MBP does it as well, and that will allow you to train your own token vocabularies. ",
"Here are some of the code inside MBP that shows the token vocabularies that you might obtain. On the left, we have the GPT-4 merges. The first 256 are raw individual bytes, and then I am visualizing the merges that GPT-4 performed during its training. The very first merge that GPT-4 did was merge two spaces into a single token for two spaces, and that is token 256. ",
"This is the order in which things merged during GPT-4 training, and this is the merge order that we obtain in MBP by training a tokenizer. In this case, I trained it on a Wikipedia page of Taylor Swift, not because I'm a Swifty, but because that is one of the longest Wikipedia pages apparently available. However, she is pretty cool. ",
"You can compare these two vocabularies. For example, GPT-4 merged \"I\" into \"in,\" and we've done the exact same thing on this token 259, where \"space t\" becomes \"space t.\" That happened for us a little bit later as well. The difference here is, to my understanding, only a difference of the training set. ",
"For example, because I see a lot of white space, I suspect that GPT-4 probably had a lot of Python code in its training set. I'm not sure about the tokenizer, and here we see much less of that, of course, in the Wikipedia page. So, roughly speaking, they look the same because they're running the same algorithm. When you train your own, you're probably going to get something similar depending on what you train it on. "
],
"paragraph_timestamps": [
5125,
5140,
5170,
5190,
5205,
5220,
5248,
5282,
5298
]
},
{
"num_chapter": 23,
"title": "Introduction to SentencePiece",
"start_paragraph_number": 239,
"end_paragraph_number": 240,
"start_time": 5321,
"end_time": 5333,
"paragraphs": [
"Okay, so we are now going to move on from TikToken and the way that OpenAI tokenizes its strings. We're going to discuss one more very commonly used library for working with tokenization in language models, and that is SentencePiece. "
],
"paragraph_timestamps": [
5321
]
},
{
"num_chapter": 24,
"title": "Understanding SentencePiece Functionality",
"start_paragraph_number": 240,
"end_paragraph_number": 245,
"start_time": 5333,
"end_time": 5448,
"paragraphs": [
"SentencePiece is very commonly used in language models because, unlike TikToken, it can do both training and inference and is quite efficient at both. It supports a number of algorithms for training vocabularies, but one of them is the BPE (Byte Pair Encoding) algorithm that we've been looking at. ",
"SentencePiece is used both by LLaMA and the Mistral series, as well as many other models. It is on GitHub under Google SentencePiece. The big difference with SentencePiece, and we're going to look at an example because this is kind of hard and subtle to explain, is that they think differently about the order of operations here. ",
"In the case of TikToken, we first take our code points in the string, encode them using MUTF to bytes, and then we're merging bytes. It's fairly straightforward. For SentencePiece, it works directly on the level of the code points themselves. It looks at whatever code points are available in your training set and then starts merging those code points. ",
"The BPE is running on the level of code points, and if you happen to run out of code points\u2014there may be some rare code points that just don't come up too often\u2014the rarity is determined by this character coverage hyperparameter. These code points will either get mapped to a special unknown token, like <unk>, or if you have the byte fallback option turned on, then that will take those rare code points, encode them using UTF-8, and then the individual bytes of that encoding will be translated into tokens. ",
"There are these special byte tokens that basically get added to the vocabulary. So, it uses BPE on the code points and then falls back to bytes for rare code points. That's kind of the difference. Personally, I find the TikToken approach significantly cleaner, but it's a subtle yet pretty major difference between the way they approach tokenization. "
],
"paragraph_timestamps": [
5333,
5351,
5376,
5393,
5428
]
},
{
"num_chapter": 25,
"title": "Configuring SentencePiece for Tokenization",
"start_paragraph_number": 245,
"end_paragraph_number": 250,
"start_time": 5448,
"end_time": 5585,
"paragraphs": [
"Let's work with a concrete example because, otherwise, this is kind of hard to get your head around. So, let's work with a concrete example. This is how we can import SentencePiece, and then here we're going to take, I think I took, like, the description of SentencePiece, and I just created a little toy dataset. It really likes to have a file, so I created a toy.txt file with this content.",
"Now, what's kind of a little bit crazy about SentencePiece is that there's a ton of options and configurations. The reason this is so is because SentencePiece has been around for a while, and it really tries to handle a large diversity of things. Because it's been around, I think it has quite a bit of accumulated historical baggage as well. ",
"In particular, there are a ton of configuration arguments. This is not even all of it; you can go here to see all the training options. There's also quite useful documentation when you look at the raw ProtoBuf that is used to represent the trainer spec and so on. Many of these options are irrelevant to us. For example, the shrinking factor is not used in the BPE encoding algorithm, so this is just an argument that is irrelevant to us. It applies to a different training algorithm.",
"What I tried to do here is set up SentencePiece in a way that is very, very similar, as far as I can tell, to maybe identical\u2014hopefully\u2014to the way that Llama 2 was trained. The way they trained their own tokenizer, and the way I did this was basically by taking the tokenizer model file that Meta released. You can open it using the ProtoBuf sort of file that you can generate and then inspect all the options. I tried to copy over all the options that looked relevant.",
"Here, we set up the input; it's raw text in this file. Here's going to be the output, so it's going to be for talk400.model and vocab. We're saying that we're going to use the BPE algorithm and we want a vocab size of 400. Then, there's a ton of configurations here for basically pre-processing and normalization rules, as they're called. "
],
"paragraph_timestamps": [
5448,
5474,
5493,
5529,
5561
]
},
{
"num_chapter": 26,
"title": "Introduction to Normalization and SentencePiece",
"start_paragraph_number": 250,
"end_paragraph_number": 274,
"start_time": 5585,
"end_time": 6206,
"paragraphs": [
"Normalization used to be very prevalent, I would say, before LLMs in natural language processing, such as in machine translation and text classification. You want to normalize and simplify the text, turn it all lowercase, and remove all double whitespace, etc. In language models, we prefer not to do any of it, or at least that is my preference as a deep learning person. You want to not touch your data; you want to keep the raw data as much as possible. So, you're basically trying to turn off a lot of this if you can.",
"The other thing that SentencePiece does is that it has this concept of sentences. SentencePiece was developed in the early days with the idea that you're training a tokenizer on a bunch of independent sentences. It has a lot of parameters regarding how many sentences you're going to train on, what the maximum sentence length is, and shuffling sentences. For SentencePiece, sentences are kind of like the individual training examples. ",
"However, in the context of LLMs, I find that this is a very spurious and weird distinction. Sentences are just like, don't touch the raw data; sentences happen to exist, but in raw datasets, there are a lot of ambiguities about what exactly is a sentence and what isn't. I think it's really hard to define what an actual sentence is if you really dig into it, and there could be different concepts of it in different languages or something like that. So, why even introduce the concept? It honestly doesn't make sense to me. I would just prefer to treat a file as a giant stream of bytes.",
"There is a lot of treatment around rare word characters, and when I say \"word,\" I mean code points. We're going to come back to this in a second. There are also a lot of other rules for basically splitting digits, whitespace, and numbers, and how you deal with that. These are some kind of merge rules, which I think are a little bit equivalent to tokenizing using regular expressions to split up categories. There's a kind of equivalence in SentencePiece where you can also, for example, split up the digits and so on.",
"There are a few more things here that I'll come back to in a bit. Then, there are some special tokens that you can indicate, and it hardcodes the UN token, the beginning of sentence, end of sentence, and a pad token. The UN token must exist, to my understanding, along with some other things. ",
"So, we can train, and when I press train, it's going to create this file, talk400.model and talk400.vocab. I can then load the model file and inspect the vocabulary of it. We trained a vocab size of 400 on this text, and these are the individual pieces, the individual tokens that SentencePiece will create. ",
"In the beginning, we see that we have the UN token with the ID zero. Then we have the beginning of sequence and end of sequence tokens with IDs one and two, respectively. We also set the pad ID to negative one, so we chose not to use it; therefore, there's no pad ID here. ",
"Then, these are individual byte tokens. Here, we saw that byte fallback in Llama was turned on, so it's true. What follows are the 256 byte tokens and their IDs. After the byte tokens come the merges, which are the parent nodes in the merges. We're not seeing the children; we're just seeing the parents and their IDs. After the merges, we eventually see the individual tokens and their IDs. These are the individual code point tokens, if you will, and they come at the end. ",
"So, that is the ordering with which SentencePiece represents its vocabularies: it starts with special tokens, then the byte tokens, then the merge tokens, and finally the individual code tokens. All these raw code point tokens are the ones that it encountered in the training set. Those individual code points are the entire set of code points that occurred here, so they all get put in there. ",
"Then, those that are extremely rare, as determined by character coverage, are ignored. For example, if a code point occurred only a single time out of like a million sentences, then it would be ignored and not added to our vocabulary. Once we have a vocabulary, we can encode into IDs and get a list. Here, I am also decoding the individual tokens back into little pieces, as they call it. ",
"Let's take a look at what happened here. Hello space on. These are the token IDs we got back. When we look here, a few things jump to mind. Number one: take a look at these characters. The Korean characters, of course, were not part of the training set, so SentencePiece is encountering code points that it has not seen during training time. Those code points do not have a token associated with them, so suddenly these are UN tokens, unknown tokens. ",
"However, because byte fallback is true, SentencePiece falls back to bytes. It encodes it with UTF-8 and then uses these tokens to represent those bytes. That's what we are getting here; this is the UTF-8 encoding, shifted by three because of these special tokens that have IDs earlier on. ",
"Now, one more thing: before I go on with respect to the byte fallback, let me remove byte fallback. If this is false, what's going to happen? Let's retrain. The first thing that happened is all the byte tokens disappeared, right? Now we just have the merges, and we have a lot more merges now because we have a lot more space. We're not taking up space in the vocab size with all the bytes. ",
"Now, if we encode this, we get a zero. This entire string here suddenly has no byte fallback, so this is unknown. Unknown is an and so this is zero because the UN token is token zero. You have to keep in mind that this would feed into your language model. ",
"What is a language model supposed to do when all kinds of different things that are unrecognized because they're rare just end up mapping into UNK? It's not exactly the property that you want. That's why I think Llama correctly used byte fallback true, because we definitely want to feed these unknown or rare code points into the model in some manner. ",
"The next thing I want to show you is the following: notice here when we are decoding all the individual tokens, you see how spaces end up being this bold underline. I'm not 100% sure why SentencePiece switches whitespace into these bold underscore characters; maybe it's for visualization. ",
"But notice this: why do we have an extra space in front of \"hello\"? Where is this coming from? It's coming from this option here: add dummy prefix is true. When you go to the documentation, it states to add whitespace at the beginning of text in order to treat \"world\" in \"world\" and \"hello world\" in the exact same way. ",
"What this is trying to do is the following: if we go back to our token tokenizer, \"world\" as a token by itself has a different ID than \"space world.\" This is 1917, but this is 14, etc. These are two different tokens for the language model, and the language model has to learn from data that they are actually kind of like a very similar concept. ",
"So, to the language model in the token world, basically, words at the beginning of sentences and words in the middle of sentences actually look completely different, and it has to learn that they are roughly the same. The add dummy prefix is trying to address that a little bit. The way that works is that it adds a dummy prefix. As part of preprocessing, it will take the string and add a space. This is done in an effort to make \"world\" and \"that world\" the same; they will both be \"space world.\" ",
"That's one other kind of preprocessing option that is turned on, and Llama 2 also uses this option. That's everything I want to say for my preview of SentencePiece and how it is different. ",
"What I've done here is put in the raw protocol buffer representation of the tokenizer that was trained. Feel free to step through this, and if you would like your tokenization to look identical to that of the Meta Llama 2, then you would be copy-pasting these settings as I tried to do above. ",
"I think that's it for this section. My summary for SentencePiece from all of this is: ",
"Number one, I think that there's a lot of historical baggage in SentencePiece, with many concepts that are slightly confusing and potentially contain foot guns, like the concept of a sentence and its maximum length. ",
"Otherwise, it is fairly commonly used in the industry because it is efficient and can handle both training and inference. It has a few quirks, such as the requirement for an un-token to exist and the way the byte fallbacks are done, which I don't find particularly elegant. Unfortunately, I have to say it's not very well documented, so it took me a lot of time working with this myself, visualizing things, and trying to really understand what is happening here. The documentation, unfortunately, is not super amazing in my opinion, but there is a very nice repository available if you'd like to train your own tokenizer right now. "
],
"paragraph_timestamps": [
5585,
5621,
5644,
5678,
5713,
5732,
5754,
5778,
5819,
5845,
5874,
5917,
5940,
5967,
5986,
6008,
6030,
6055,
6078,
6120,
6134,
6154,
6160,
6171
]
},
{
"num_chapter": 27,
"title": "Vocabulary Size Considerations",
"start_paragraph_number": 274,
"end_paragraph_number": 291,
"start_time": 6206,
"end_time": 6597,
"paragraphs": [
"Now, let me switch gears again as we start to slowly wrap up. I want to revisit the issue of how we should set the vocab size and what some of the considerations around it are. For this, I'd like to go back to the model architecture we developed in the last video when we built the GPT from scratch. ",
"This here was the file we built in the previous video, where we defined the Transformer model. Let's specifically look at the vocab size and where it appears in this file. We defined the vocab size at that time as 65, which is an extremely small number. This will grow much larger. You'll see that vocab size doesn't come up too much in most of these layers; the only places it comes up are in exactly these two locations. ",
"When we define the language model, there's the token embedding table, which is a two-dimensional array where the vocab size is basically the number of rows. Each vocabulary element, each token, has a vector that we're going to train using backpropagation. That vector is of size equal to the number of channels in the Transformer. As the vocab size increases, this embedding table, as I mentioned earlier, will also grow, adding rows. ",
"At the end of the Transformer, there's the LM head layer, which is a linear layer. You'll notice that this layer is used at the very end to produce the logits, which become the probabilities for the next token in the sequence. Intuitively, we're trying to produce a probability for every single token that might come next at every point in time of that Transformer. If we have more and more tokens, we need to produce more and more probabilities. ",
"Every single token will introduce an additional dot product that we have to perform in this linear layer for the final layer in a Transformer. ",
"So, why can't the vocab size be infinite? Why can't we grow it to infinity? ",
"Number one, your token embedding table is going to grow, and your linear layer is going to grow, which means we will be doing a lot more computation here because this LM head layer will become more computationally expensive. ",
"Number two, because we have more parameters, we could be worried that we are going to be under-training some of these parameters. Intuitively, if you have a very large vocabulary size, say we have a million tokens, then every one of these tokens is going to come up more and more rarely in the training data because there are many other tokens all over the place. As a result, we will see fewer and fewer examples for each individual token, and you might be concerned that the vectors associated with every token will be under-trained because they just don't come up often enough and don't participate in the forward-backward pass. ",
"In addition to that, as your vocabulary size grows, you're going to start shrinking your sequences a lot. This is nice because it means we will be attending to more and more text. However, you might also worry that too large chunks are being squished into single tokens. The model just doesn't have as much time to think per some number of characters in the text. ",
"You can think about it this way: we're squishing too much information into a single token, and then the forward pass of the Transformer is not enough to actually process that information appropriately. These are some of the considerations you're thinking about when you're designing the vocabulary size. As I mentioned, this is mostly an empirical hyperparameter, and it seems like in state-of-the-art architectures today, this is usually in the high 10,000s or somewhere around 100,000.",
"The next consideration I want to briefly talk about is what if we want to take a pre-trained model and extend the vocabulary size. This is done fairly commonly, actually. For example, when you're doing fine-tuning for ChatGPT, a lot more new special tokens get introduced on top of the base model to maintain the metadata and all the structure of conversation objects between a user and an assistant. This requires a lot of special tokens. ",
"You might also try to add more special tokens for using the browser or any other tool. It\u2019s very tempting to add a lot of tokens for all kinds of special functionality. If you want to add a token, that's totally possible. All we have to do is resize the embedding. We need to add rows and initialize these parameters from scratch to be small random numbers. Then, we have to extend the weights inside this linear layer to start making dot products with the associated parameters as well, in order to calculate the probabilities for these new tokens. ",
"Both of these are just resizing operations; it's a very mild model surgery that can be done fairly easily. It\u2019s quite common to freeze the base model, introduce these new parameters, and then only train these new parameters to introduce new tokens into the architecture. You can freeze arbitrary parts of it or train arbitrary parts of it, and that's totally up to you. Basically, minor surgery is required if you'd like to introduce new tokens.",
"Finally, I'd like to mention that there's an entire design space of applications in terms of introducing new tokens into a vocabulary that goes way beyond just adding special tokens and special new functionality. Just to give you a sense of the design space\u2014this could be an entire video by itself\u2014there is a paper on learning to compress prompts with what they called \"gist tokens.\" ",
"The rough idea is that if you're using language models in a setting that requires very long prompts, these long prompts slow everything down because you have to encode them, use them, and tend over them. It\u2019s just heavy to have very large prompts. Instead, what they do in this paper is introduce new tokens. Imagine having a few new tokens that you put in a sequence, and then you train the model by distillation. You keep the entire model frozen and only train the representations of the new tokens, their embeddings, optimizing over the new tokens such that the behavior of the language model is identical to the model that has a very long prompt that works for you.",
"This is a compression technique that compresses that very long prompt into those few new gist tokens. You can train this, and then at test time, you can discard your old prompt and just swap in those tokens, which stand in for that very long prompt and have almost identical performance. This is one technique and a class of parameter-efficient fine-tuning techniques where most of the model is basically fixed, and there\u2019s no training of the model weights, no training of layers, or anything like that of new parameters. The parameters that you're training are now just the token embeddings.",
"That's just one example, but this could again be like an entire video. Just to give you a sense, there's a whole design space here that is potentially worth exploring in the future. "
],
"paragraph_timestamps": [
6206,
6225,
6249,
6277,
6302,
6309,
6314,
6327,
6361,
6384,
6407,
6433,
6464,
6489,
6512,
6553,
6589
]
},
{
"num_chapter": 28,
"title": "Multimodal Transformers",
"start_paragraph_number": 291,
"end_paragraph_number": 295,
"start_time": 6597,
"end_time": 6700,
"paragraphs": [
"The next thing I want to briefly address is that recently, there's a lot of momentum in how you could construct Transformers that can simultaneously process not just text as the input modality, but a lot of other modalities, such as images, videos, audio, etc. The question is how to feed in all these modalities and potentially predict these modalities from a Transformer. Do you have to change the architecture in some fundamental way? ",
"What a lot of people are starting to converge towards is that you're not changing the architecture. You stick with the Transformer, tokenize your input domains, and then call it a day, treating everything else identically. For example, there was an early paper that has a nice graphic showing how you can take an image and chunk it into integers. These integers will become the tokens of images as an example. ",
"These tokens can be hard tokens, where you force them to be integers, or they can also be soft tokens, where you sort of don't require these to be discrete. However, you do force these representations to go through bottlenecks, like in autoencoders. ",
"Also, in this paper that came out from OpenAI, SORA, which I think really blew the minds of many people and inspired a lot of people in terms of what's possible, they have a graphic here. They talk briefly about how LLMs have text tokens, while SORA has visual patches. They came up with a way to chunk videos into basically tokens within their own vocabularies. You can either process discrete tokens, say with autoregressive models, or even soft tokens with diffusion models. All of that is actively being worked on and designed, and is beyond the scope of this video, but it's something I wanted to mention briefly. "
],
"paragraph_timestamps": [
6597,
6616,
6647,
6664
]
},
{
"num_chapter": 29,
"title": "Understanding Tokenization and Its Impact on LLM Performance",
"start_paragraph_number": 295,
"end_paragraph_number": 301,
"start_time": 6700,
"end_time": 6854,
"paragraphs": [
"Now that we have come quite deep into the tokenization algorithm and we understand a lot more about how it works, let's loop back around to the beginning of this video and go through some of these bullet points to really see why they happen. ",
"First of all, why can't my LLM spell words very well or do other spell-related tasks? Fundamentally, this is because, as we saw, these characters are chunked up into tokens, and some of these tokens are actually fairly long. For example, I went to the GPT-4 vocabulary and looked at one of the longer tokens. It turns out that \"default style\" is a single individual token. That's a lot of characters for a single token. My suspicion is that there's just too much crammed into this single token, and the model should not be very good at tasks related to the spelling of this single token. ",
"I asked how many letters \"L\" are there in the word \"default style.\" Of course, my prompt was intentionally done that way, and you see how \"default style\" will be a single token. This is what the model sees. My suspicion is that it wouldn't be very good at this, and indeed it is not. It doesn't actually know how many \"L's\" are in there; it thinks there are three, but actually, there are four, if I'm not getting this wrong myself. So that didn't go extremely well. ",
"Let's look at another kind of character-level task. For example, I asked GPT-4 to reverse the string \"default style.\" They tried to use a code interpreter, and I stopped it, saying just to try it. It gave me \"jumble,\" so it doesn't actually know how to reverse this string going from right to left. It gave a wrong result. ",
"Working with this hypothesis that maybe this is due to the tokenization, I tried a different approach. I said, okay, let's reverse the exact same string but take the following approach: Step one, just print out every single character separated by spaces, and then as step two, reverse that list. It again tried to use a tool, but when I stopped it, it first produced all the characters, and that was actually correct. Then it reversed them, and that was correct once it had this. ",
"Somehow, it can't reverse it directly, but when you first list it out in order, it can do that. Once it's broken up this way, it sees all these individual characters, and now this is much easier for it to reverse them and print them out. That is kind of interesting. "
],
"paragraph_timestamps": [
6700,
6712,
6752,
6780,
6806,
6835
]
},
{
"num_chapter": 30,
"title": "Challenges with Non-English Languages and Arithmetic",
"start_paragraph_number": 301,
"end_paragraph_number": 307,
"start_time": 6854,
"end_time": 7008,
"paragraphs": [
"Now, why are LLMs worse at non-English languages? I briefly covered this already, but basically, it's not only that the language model sees less non-English data during the training of the model parameters, but also the tokenizer is not sufficiently trained on non-English data. ",
"For example, \"hello, how are you\" is five tokens, while its translation is 15 tokens. This is a three times blow-up. For example, \"\uc548\ub155\" (anyang) is just \"hello\" in Korean, and that ends up being three tokens. I'm actually kind of surprised by that because that is a very common phrase, the typical greeting of \"hello,\" and it ends up being three tokens, whereas our \"hello\" is a single token. Everything is a lot more bloated and diffuse, and this is partly the reason that the model works worse on other languages. ",
"Coming back, why is the LLM bad at simple arithmetic? That has to do with the tokenization of numbers. You'll notice that, for example, addition is very character-level. There\u2019s an algorithm that is character-level for doing addition. For example, we would first add the ones, then the tens, and then the hundreds. You have to refer to specific parts of these digits, but these numbers are represented completely arbitrarily based on whatever happened to merge or not merge during the tokenization process. ",
"There's an entire blog post about this that I think is quite good. Integer tokenization is insane, and this person systematically explores the tokenization of numbers in, I believe, GPT-2. ",
"They notice that for four-digit numbers, you can take a look at whether it is a single token or whether it is two tokens, such as a 1 three or a 2 two or a 31 combination. All the different numbers represent all the different combinations, and you can imagine this is all completely arbitrary. The model, unfortunately, sometimes sees a token for all four digits, sometimes for three, sometimes for two, and sometimes for one. It's in an arbitrary manner, and so this is definitely a headwind, if you will, for the language model. ",
"It's kind of incredible that it can deal with it, but it's also not ideal. That's why, for example, we saw that Meta, when they trained the Llama 2 algorithm and used SentencePiece, made sure to split up all the digits. This is partly to improve simple arithmetic performance. "
],
"paragraph_timestamps": [
6854,
6875,
6908,
6939,
6951,
6990
]
},
{
"num_chapter": 31,
"title": "Tokenization Issues in Programming and Special Tokens",
"start_paragraph_number": 307,
"end_paragraph_number": 312,
"start_time": 7008,
"end_time": 7124,
"paragraphs": [
"Finally, why is GPT-2 not as good in Python? Again, this is partly a modeling issue in the architecture and the dataset, and the strength of the model, but it's also partially tokenization. As we saw here with the simple Python example, the encoding efficiency of the tokenizer for handling spaces in Python is terrible. Every single space is an individual token, and this dramatically reduces the context length that the model can attend to. So that's almost like a tokenization bug for GPT-2, and that was later fixed with GPT-4. ",
"Okay, so here's another fun one: my LLM abruptly halts when it sees the string \"end of text.\" Here's a very strange behavior: I print a string \"end of text\" and I tell GPT-4, and it says, \"Could you please specify the string?\" I'm telling it to give me \"end of text,\" and it seems like there's an issue; it's not seeing \"end of text.\" ",
"Then I give it \"end of text\" as the string, and then here's a string, and it just doesn't print it. So obviously, something is breaking here with respect to the handling of the special token. I don't actually know what OpenAI is doing under the hood here and whether they are potentially parsing this as an actual token instead of just being \"end of text\" as individual pieces without the special token handling logic. ",
"It might be that someone, when they're calling \"do encode,\" is passing in the allowed special tokens and allowing \"end of text\" as a special character in the user prompt. But the user prompt, of course, is a sort of attacker-controlled text, so you would hope that they don't really parse or use special tokens from that kind of input. However, it appears that there's definitely something going wrong here. ",
"Your knowledge of these special tokens ends up being a tax surface, potentially. So if you'd like to confuse LLMs, then just try to give them some special tokens and see if you're breaking something by chance. "
],
"paragraph_timestamps": [
7008,
7038,
7057,
7087,
7112
]
},
{
"num_chapter": 32,
"title": "Trailing Whitespace and Its Effects on LLM Outputs",
"start_paragraph_number": 312,
"end_paragraph_number": 323,
"start_time": 7124,
"end_time": 7398,
"paragraphs": [
"Okay, so this next one is a really fun one: the trailing whitespace issue. If you come to the playground and we go to GPT-3.5 Turbo Instruct, this is not a chat model; this is a completion model. Think of it as being a lot closer to a base model. It does completion and will continue the token sequence. ",
"So here's a tagline for an ice cream shop, and we want to continue the sequence. We can submit and get a bunch of tokens\u2014okay, no problem. But now suppose I do this: instead of pressing submit, I add a space here before I click submit. We get a warning: \"Your text ends in a trailing space, which causes worse performance due to how the API splits text into tokens.\" ",
"So what's happening here? It still gave us a sort of completion, but let's take a look at what's happening. Here's a tagline for an ice cream shop. What does this look like in the actual training data? Suppose you found the completion in the training document somewhere on the internet, and the LLM trained on this data. ",
"Maybe it's something like, \"Oh yeah, maybe that's the tagline.\" That's a terrible tagline, but notice here that when I create \"o,\" you see that because the space character is always a prefix to these tokens in GPT, it's not an \"O\" token; it's a \"space o\" token. The space is part of the \"O,\" and together they are token 8840\u2014that's \"space o.\" ",
"So what's happening here is that when I just have it like this and let it complete the next token, it can sample the \"space o\" token. But instead, if I have this and I add my space, then what I'm doing here when I encode this string is I have basically \"here's a tagline for an ice cream shop,\" and this space at the very end becomes token 220. ",
"So, we've added token 220, and this token would otherwise be part of the tagline. If there actually is a tagline here, \"space o\" is the token. This suddenly alters the distribution for the model because this space is part of the next token, but we're putting it here like this. The model has seen very little data of actual space by itself, and we're asking it to complete the sequence. ",
"The problem is that we've sort of begun with the first token, and now it's been split up. Now we're out of this distribution, and arbitrary bad things happen. It's just a very rare example for it to see something like that, and that's why we get the warning. The fundamental issue here is, of course, that the LLM is on top of these tokens. These tokens are text chunks; they're not characters in a way you and I would think of them. They are the atoms of what the LM is seeing, and there's a bunch of weird stuff that comes out of it. ",
"Let's go back to our default cell style. I bet you that the model has never in its training set seen \"default cell sta\" without \"Le\" in there. It's always seen this as a single group because this is some kind of function in, I'm guessing, an API. I bet you that it's never seen this combination of tokens in its training data, or I think it would be extremely rare. ",
"So, I took this and copy-pasted it here, and I tried to complete from it. It immediately gave me a big error and said the model predicted a completion that begins with a stop sequence, resulting in no output. It suggested considering adjusting your prompt or stop sequences. ",
"What happened here when I clicked submit is that immediately the model emitted an end-of-text token or something like that. It basically predicted the stop sequence immediately, so it had no completion. This is why I'm getting a warning again, because we're off the data distribution, and the model is just predicting totally arbitrary things. It's really confused. This is giving it brain damage; it's never seen this before, it's shocked, and it's predicting end-of-text or something. ",
"I tried it again, and in this case, it completed it, but then for some reason, this request may violate our usage policies. This was flagged. Basically, something just goes wrong, and there's something like \"Jank.\" You can just feel the Jank because the model is extremely unhappy with just this, and it doesn't know how to complete it because it's never occurred in the training set. In the training set, it always appears like this and becomes a single token. "
],
"paragraph_timestamps": [
7124,
7149,
7176,
7193,
7217,
7245,
7271,
7299,
7330,
7344,
7375
]
},
{
"num_chapter": 33,
"title": "Understanding Unstable Tokens",
"start_paragraph_number": 323,
"end_paragraph_number": 327,
"start_time": 7398,
"end_time": 7492,
"paragraphs": [
"These kinds of issues, where tokens are either you sort of complete the first character of the next token, or you have long tokens that you then have just some of the characters off, all of these are kind of like issues with partial tokens, is how I would describe it. If you actually dig into the token repository, go to the Rust code and search for \"unstable,\" you'll see \"encode unstable native unstable tokens\" and a lot of special case handling. None of this stuff about unstable tokens is documented anywhere, but there's a ton of code dealing with unstable tokens. ",
"Unstable tokens are exactly what I'm describing here. What you would like out of a completion API is something a lot fancier. If we're putting in \"default cell sta,\" if we're asking for the next token sequence, we're not actually trying to append the next token exactly after this list. We're actually trying to consider lots of tokens that, if we were to retain, would be of high probability, if that makes sense. ",
"So that we can actually add a single individual character instead of just adding the next full token that comes after this partial token list. This is very tricky to describe, and I invite you to maybe look through this. It ends up being an extremely gnarly and hairy kind of topic, and it comes from tokenization fundamentally. ",
"Maybe I can even spend an entire video talking about unstable tokens sometime in the future. "
],
"paragraph_timestamps": [
7398,
7438,
7468,
7487
]
},
{
"num_chapter": 34,
"title": "The Solid Gold Magikarp Phenomenon",
"start_paragraph_number": 327,
"end_paragraph_number": 339,
"start_time": 7492,
"end_time": 7754,
"paragraphs": [
"Okay, and I'm really saving the best for last. My favorite one by far is the solid gold Magikarp. This comes from a blog post titled \"Solid Gold Magikarp,\" and this is internet famous now for those of us in LLMs. I would advise you to read this blog post in full. ",
"Basically, what this person was doing is they went to the token embedding stable and clustered the tokens based on their embedding representation. This person noticed that there's a cluster of tokens that look really strange. There's a cluster here at \"rot e stream Fame solid gold Magikarp Signet message,\" like really weird tokens in this embedding cluster. ",
"So, what are these tokens, and where do they even come from? What is \"solid gold Magikarp\"? It makes no sense. ",
"Then they found a bunch of these tokens, and they noticed that the plot thickens here. If you ask the model about these tokens, like you ask it a very benign question, such as, \"Please can you repeat back to me the string 'solid gold Magikarp'?\" then you get a variety of totally broken LLM behavior. ",
"You either get evasion, like \"I'm sorry, I can't hear you,\" or you receive a bunch of hallucinations as a response. You can even get back insults. For example, if you ask it about a streamer bot, the model might actually just call you names, or it might come up with weird humor. You're actually breaking the model by asking about these very simple strings, like \"solid gold Magikarp.\" ",
"So, what the hell is happening? There are a variety of documented behaviors here. There are a bunch of tokens, not just \"solid gold Magikarp,\" that exhibit this kind of behavior. Essentially, there are a number of trigger words, and if you ask the model about these trigger words or include them in your prompt, the model goes haywire and exhibits all kinds of strange behaviors, including those that violate typical safety guidelines and the alignment of the model. For instance, it might swear back at you. ",
"What is happening here, and how can this possibly be true? Well, this again comes down to tokenization. What's happening is that \"solid gold Magikarp,\" if you actually dig into it, is a Reddit user. There\u2019s a user named \"solid gold Magikarp,\" and probably what happened here, even though I don't know that this has been definitively explored, is that the tokenization dataset was very different from the training dataset for the actual language model. ",
"In the tokenization dataset, there was a ton of Reddit data where the user \"solid gold Magikarp\" was mentioned in the text. This user was very active and posted a lot. Therefore, this string occurs many times in the tokenization dataset. Because it occurs so frequently, these tokens would end up getting merged into a single individual token for that Reddit user, \"solid gold Magikarp.\" ",
"So, they would have a dedicated token in a vocabulary of, was it 50,000 tokens in GPT-2, that is devoted to that Reddit user. Then what happens is that the tokenization dataset has those strings, but later, when you train the language model itself, this data from Reddit was not present. Therefore, in the entire training set for the language model, \"solid gold Magikarp\" never occurs; that token never appears in the training set for the actual language model. ",
"Later, this token never gets activated. It's initialized at random at the beginning of optimization. Then you have forward and backward passes and updates to the model, and this token is just never updated in the embedding table. That row vector never gets sampled; it never gets used, so it never gets trained. It's completely untrained, kind of like unallocated memory in a typical binary program written in C or something like that. ",
"So, it's unallocated memory, and then at test time, if you evoke this token, you're basically plucking out a row of the embedding table that is completely untrained. That feeds into a transformer and creates undefined behavior, and that's what we're seeing here: this completely undefined, never-before-seen behavior in training. ",
"Any of these kinds of weird tokens would evoke this behavior because fundamentally, the model is out of sample, out of distribution. "
],
"paragraph_timestamps": [
7492,
7514,
7540,
7543,
7565,
7590,
7617,
7647,
7674,
7703,
7728,
7746
]
},
{
"num_chapter": 35,
"title": "Tokenization Efficiency and Formats",
"start_paragraph_number": 339,
"end_paragraph_number": 342,
"start_time": 7754,
"end_time": 7816,
"paragraphs": [
"The very last thing I wanted to briefly mention, although I think a lot of people are quite aware of this, is that different kinds of formats, representations, and languages might be more or less efficient with GPT tokenizers or any tokenizers for that matter. For example, JSON is actually really dense in tokens, while YAML is a lot more efficient in tokens. ",
"For instance, these are the same in JSON and in YAML: the JSON is 116 tokens, and the YAML is 99 tokens, showing quite a bit of improvement. In the token economy, where we are paying per token in many ways, you are paying in the context length and in dollar amounts for the cost of processing all this structured data. ",
"Therefore, you should prefer to use YAML over JSON. In general, the tokenization density is something that you have to care about and worry about at all times. You should try to find efficient encoding schemes and spend a lot of time in tokenization, measuring the different token efficiencies of different formats and settings. "
],
"paragraph_timestamps": [
7754,
7777,
7799
]
},
{
"num_chapter": 36,
"title": "Final Thoughts on Tokenization",
"start_paragraph_number": 342,
"end_paragraph_number": 350,
"start_time": 7816,
"end_time": 7983,
"paragraphs": [
"Okay, so that concludes my fairly long video on tokenization. I know it's a trial; I know it's annoying; I know it's irritating. I personally really dislike this stage. What I do have to say at this point is: don't brush it off. There are a lot of foot guns and sharp edges here, security issues, and AI safety issues, as we saw with plugging in unallocated memory into language models. ",
"So, it's worth understanding this stage. That said, I will say that eternal glory goes to anyone who can get rid of it. I showed you one possible paper that tried to do that, and I hope more can follow over time.",
"My final recommendations for the application right now are: if you can reuse the GPT-4 tokens and the vocabulary in your application, then that's something you should consider. Just use the Tech token because it is a very efficient and nice library for inference for BPE. I also really like the byte-level BPE that TikToken and OpenAI use.",
"If, for some reason, you want to train your own vocabulary from scratch, then I would use the BPE with SentencePiece. Oops! As I mentioned, I'm not a huge fan of SentencePiece. I don't like its byte fallback, and I don't like that it's doing BPE on Unicode code points. I think it also has a million settings, and I believe there's a lot of foot guns here. It's really easy to miscalibrate them, and you end up cropping your sentences or something like that because of some type of parameter that you don't fully understand.",
"So, be very careful with the settings. Try to copy-paste exactly what Meta did, or basically spend a lot of time looking at all the hyperparameters. Go through the code of SentencePiece and make sure that you have this correct. ",
"But even if you have all the settings correct, I still think that the algorithm is kind of inferior to what's happening here. Maybe the best thing, if you really need to train your vocabulary, is to just wait for MBPE to become as efficient as possible. That's something that I hope to work on, and at some point, maybe we can be training. Basically, what we want is we want TikToken but training code, and that is the ideal thing that currently does not exist. MBPE is an implementation of it, but currently, it's in Python.",
"So, that's currently what I have to say for tokenization. There might be an advanced video that has even drier and more detailed information in the future, but for now, I think we're going to leave things off here. I hope that was helpful. ",
"Bye! They increased the context size from GPT-1 of 512 to 1024 and GPT-4. "
],
"paragraph_timestamps": [
7816,
7843,
7855,
7875,
7907,
7920,
7951,
7974
]
},
{
"num_chapter": 37,
"title": "Introduction to GPT-2 Encoder Code",
"start_paragraph_number": 350,
"end_paragraph_number": 352,
"start_time": 7983,
"end_time": 7997,
"paragraphs": [
"Next, I would like us to briefly walk through the code from OpenAI on the GPT-2 encoder. I'm sorry, I'm going to sneeze. ",
"What's happening here is this is a spout layer that I will explain in a bit. What's happening here is..."
],
"paragraph_timestamps": [
7983,
7997
]
}
]