Milan Straka
commited on
Commit
•
41cfe08
1
Parent(s):
354716a
Better formulation.
Browse files
README.md
CHANGED
@@ -14,7 +14,9 @@ tags:
|
|
14 |
## Version History
|
15 |
|
16 |
- **version 1.1**: Version 1.1 was released in Jan 2024, with a change to the
|
17 |
-
tokenizer; the
|
|
|
|
|
18 |
|
19 |
The tokenizer in the initial release (a) contained a hole (51959 did not
|
20 |
correspond to any token), and (b) mapped several tokens (unseen during training
|
@@ -25,7 +27,7 @@ tags:
|
|
25 |
|
26 |
In version 1.1, the tokenizer was modified by (a) removing the hole, (b)
|
27 |
mapping all tokens to a unique ID. That also required increasing the
|
28 |
-
vocabulary
|
29 |
`[UNK]` token). Without finetuning, version 1.1 and version 1.0 gives exactly
|
30 |
the same results on any input, and the tokens in version 1.0 that mapped to
|
31 |
a different ID than the `[UNK]` token map to the same ID in version 1.1.
|
|
|
14 |
## Version History
|
15 |
|
16 |
- **version 1.1**: Version 1.1 was released in Jan 2024, with a change to the
|
17 |
+
tokenizer; the model parameters were mostly kept the same, but the embeddings
|
18 |
+
were enlarged (by copying suitable rows) to correspond to the updated
|
19 |
+
tokenizer.
|
20 |
|
21 |
The tokenizer in the initial release (a) contained a hole (51959 did not
|
22 |
correspond to any token), and (b) mapped several tokens (unseen during training
|
|
|
27 |
|
28 |
In version 1.1, the tokenizer was modified by (a) removing the hole, (b)
|
29 |
mapping all tokens to a unique ID. That also required increasing the
|
30 |
+
vocabulary size and embeddings weights (by replicating the embedding of the
|
31 |
`[UNK]` token). Without finetuning, version 1.1 and version 1.0 gives exactly
|
32 |
the same results on any input, and the tokens in version 1.0 that mapped to
|
33 |
a different ID than the `[UNK]` token map to the same ID in version 1.1.
|