Milan Straka
commited on
Commit
•
354716a
1
Parent(s):
8a75d1f
Release version 1.1 of RobeCzech.
Browse filesIn version 1.1, the tokenizer was modified by (a) removing the hole, (b)
mapping all tokens to a unique ID. That also required increasing the
vocabulary sizes and embeddings weights (by replicating the embedding of the
`[UNK]` token). Without finetuning, version 1.1 and version 1.0 gives exactly
the same results on any input, and the tokens in version 1.0 that mapped to
a different ID than the `[UNK]` token map to the same ID in version 1.1.
However, the sizes of the embeddings (and LM head weights and biases) are
different, so the weights of the version 1.1 are not compatible with the
configuration of version 1.0 and vice versa.
- README.md +32 -1
- config.json +1 -1
- model.safetensors +2 -2
- pytorch_model.bin +2 -2
- tf_model.h5 +2 -2
- tokenizer.json +0 -0
- vocab.json +0 -0
README.md
CHANGED
@@ -11,7 +11,38 @@ tags:
|
|
11 |
|
12 |
# Model Card for RobeCzech
|
13 |
|
14 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
15 |
|
16 |
# Model Details
|
17 |
|
|
|
11 |
|
12 |
# Model Card for RobeCzech
|
13 |
|
14 |
+
## Version History
|
15 |
+
|
16 |
+
- **version 1.1**: Version 1.1 was released in Jan 2024, with a change to the
|
17 |
+
tokenizer; the weights are unmodified.
|
18 |
+
|
19 |
+
The tokenizer in the initial release (a) contained a hole (51959 did not
|
20 |
+
correspond to any token), and (b) mapped several tokens (unseen during training
|
21 |
+
but required by the BBPE tokenizer) to the same ID as the `[UNK]` token (3).
|
22 |
+
That sometimes caused problems, as in https://huggingface.co/ufal/robeczech-base/discussions/4.
|
23 |
+
See https://huggingface.co/ufal/robeczech-base/discussions/4#64b8f6a7f1f8e6ea5860b314
|
24 |
+
for more information.
|
25 |
+
|
26 |
+
In version 1.1, the tokenizer was modified by (a) removing the hole, (b)
|
27 |
+
mapping all tokens to a unique ID. That also required increasing the
|
28 |
+
vocabulary sizes and embeddings weights (by replicating the embedding of the
|
29 |
+
`[UNK]` token). Without finetuning, version 1.1 and version 1.0 gives exactly
|
30 |
+
the same results on any input, and the tokens in version 1.0 that mapped to
|
31 |
+
a different ID than the `[UNK]` token map to the same ID in version 1.1.
|
32 |
+
|
33 |
+
However, the sizes of the embeddings (and LM head weights and biases) are
|
34 |
+
different, so the weights of the version 1.1 are not compatible with the
|
35 |
+
configuration of version 1.0 and vice versa.
|
36 |
+
|
37 |
+
- **version 1.0**: Initial version released in May 2021 (with the tokenization
|
38 |
+
issues described above).
|
39 |
+
|
40 |
+
If you want to load a pretrained model, configuration, or a tokenizer of
|
41 |
+
version 1.0, you can use
|
42 |
+
```python
|
43 |
+
from_pretrained("ufal/robeczech-base", revision="v1.0")
|
44 |
+
```
|
45 |
+
to create an `AutoModel`, an `AutoConfig`, or an `AutoTokenizer`.
|
46 |
|
47 |
# Model Details
|
48 |
|
config.json
CHANGED
@@ -18,5 +18,5 @@
|
|
18 |
"num_hidden_layers": 12,
|
19 |
"pad_token_id": 1,
|
20 |
"type_vocab_size": 1,
|
21 |
-
"vocab_size":
|
22 |
}
|
|
|
18 |
"num_hidden_layers": 12,
|
19 |
"pad_token_id": 1,
|
20 |
"type_vocab_size": 1,
|
21 |
+
"vocab_size": 51997
|
22 |
}
|
model.safetensors
CHANGED
@@ -1,3 +1,3 @@
|
|
1 |
version https://git-lfs.github.com/spec/v1
|
2 |
-
oid sha256:
|
3 |
-
size
|
|
|
1 |
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:1471bbf101e701ff889b87a5dd0dc87bd862e688b77f30fdb5c78f571750a1b9
|
3 |
+
size 504141580
|
pytorch_model.bin
CHANGED
@@ -1,3 +1,3 @@
|
|
1 |
version https://git-lfs.github.com/spec/v1
|
2 |
-
oid sha256:
|
3 |
-
size
|
|
|
1 |
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:d352e06e70edb7626a1dfd59ad2481f9c7d01996d6c8b137b1ad9c38ae6c5a37
|
3 |
+
size 504184434
|
tf_model.h5
CHANGED
@@ -1,3 +1,3 @@
|
|
1 |
version https://git-lfs.github.com/spec/v1
|
2 |
-
oid sha256:
|
3 |
-
size
|
|
|
1 |
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:70e3cc7e936ff540d2d8e345ddfe7c40662f40732941d7bfee6b1125ac0b5e7a
|
3 |
+
size 665719492
|
tokenizer.json
CHANGED
The diff for this file is too large to render.
See raw diff
|
|
vocab.json
CHANGED
The diff for this file is too large to render.
See raw diff
|
|