Milan Straka commited on
Commit
354716a
1 Parent(s): 8a75d1f

Release version 1.1 of RobeCzech.

Browse files

In version 1.1, the tokenizer was modified by (a) removing the hole, (b)
mapping all tokens to a unique ID. That also required increasing the
vocabulary sizes and embeddings weights (by replicating the embedding of the
`[UNK]` token). Without finetuning, version 1.1 and version 1.0 gives exactly
the same results on any input, and the tokens in version 1.0 that mapped to
a different ID than the `[UNK]` token map to the same ID in version 1.1.

However, the sizes of the embeddings (and LM head weights and biases) are
different, so the weights of the version 1.1 are not compatible with the
configuration of version 1.0 and vice versa.

Files changed (7) hide show
  1. README.md +32 -1
  2. config.json +1 -1
  3. model.safetensors +2 -2
  4. pytorch_model.bin +2 -2
  5. tf_model.h5 +2 -2
  6. tokenizer.json +0 -0
  7. vocab.json +0 -0
README.md CHANGED
@@ -11,7 +11,38 @@ tags:
11
 
12
  # Model Card for RobeCzech
13
 
14
- **If you are having issues with the tokenizer, please see https://huggingface.co/ufal/robeczech-base/discussions/4#64b8f6a7f1f8e6ea5860b314.**
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
15
 
16
  # Model Details
17
 
 
11
 
12
  # Model Card for RobeCzech
13
 
14
+ ## Version History
15
+
16
+ - **version 1.1**: Version 1.1 was released in Jan 2024, with a change to the
17
+ tokenizer; the weights are unmodified.
18
+
19
+ The tokenizer in the initial release (a) contained a hole (51959 did not
20
+ correspond to any token), and (b) mapped several tokens (unseen during training
21
+ but required by the BBPE tokenizer) to the same ID as the `[UNK]` token (3).
22
+ That sometimes caused problems, as in https://huggingface.co/ufal/robeczech-base/discussions/4.
23
+ See https://huggingface.co/ufal/robeczech-base/discussions/4#64b8f6a7f1f8e6ea5860b314
24
+ for more information.
25
+
26
+ In version 1.1, the tokenizer was modified by (a) removing the hole, (b)
27
+ mapping all tokens to a unique ID. That also required increasing the
28
+ vocabulary sizes and embeddings weights (by replicating the embedding of the
29
+ `[UNK]` token). Without finetuning, version 1.1 and version 1.0 gives exactly
30
+ the same results on any input, and the tokens in version 1.0 that mapped to
31
+ a different ID than the `[UNK]` token map to the same ID in version 1.1.
32
+
33
+ However, the sizes of the embeddings (and LM head weights and biases) are
34
+ different, so the weights of the version 1.1 are not compatible with the
35
+ configuration of version 1.0 and vice versa.
36
+
37
+ - **version 1.0**: Initial version released in May 2021 (with the tokenization
38
+ issues described above).
39
+
40
+ If you want to load a pretrained model, configuration, or a tokenizer of
41
+ version 1.0, you can use
42
+ ```python
43
+ from_pretrained("ufal/robeczech-base", revision="v1.0")
44
+ ```
45
+ to create an `AutoModel`, an `AutoConfig`, or an `AutoTokenizer`.
46
 
47
  # Model Details
48
 
config.json CHANGED
@@ -18,5 +18,5 @@
18
  "num_hidden_layers": 12,
19
  "pad_token_id": 1,
20
  "type_vocab_size": 1,
21
- "vocab_size": 51961
22
  }
 
18
  "num_hidden_layers": 12,
19
  "pad_token_id": 1,
20
  "type_vocab_size": 1,
21
+ "vocab_size": 51997
22
  }
model.safetensors CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:8f1fe5a8d6de3910c79af9ef587453aafc3a098ba131521c23d092af6f65e8ee
3
- size 506605544
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:1471bbf101e701ff889b87a5dd0dc87bd862e688b77f30fdb5c78f571750a1b9
3
+ size 504141580
pytorch_model.bin CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:6ada3c49cf56eda3228362a987ef813d4cec59bbb70730237d0d86bad6f0111c
3
- size 506663689
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:d352e06e70edb7626a1dfd59ad2481f9c7d01996d6c8b137b1ad9c38ae6c5a37
3
+ size 504184434
tf_model.h5 CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:a135a8b68b32fba27b5950b203787cfb5f06b714122ecc216b1b2a83808e27c0
3
- size 667860748
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:70e3cc7e936ff540d2d8e345ddfe7c40662f40732941d7bfee6b1125ac0b5e7a
3
+ size 665719492
tokenizer.json CHANGED
The diff for this file is too large to render. See raw diff
 
vocab.json CHANGED
The diff for this file is too large to render. See raw diff