mkocian commited on
Commit
49736c4
1 Parent(s): b15da03

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +41 -0
README.md ADDED
@@ -0,0 +1,41 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Small-E-Czech
2
+
3
+ Small-E-Czech is an [Electra](https://arxiv.org/abs/2003.10555)-small model pretrained on a Czech corpus created at Seznam.cz. Like other pretrained models, it should be finetuned on a downstream task of interest before use.
4
+
5
+ ### How to use the discriminator in transformers
6
+ ```python
7
+ from transformers import ElectraForPreTraining, ElectraTokenizerFast
8
+ import torch
9
+
10
+ discriminator = ElectraForPreTraining.from_pretrained("seznam/small-e-czech")
11
+ tokenizer = ElectraTokenizerFast.from_pretrained(
12
+ "seznam/small-e-czech", strip_accents=False
13
+ )
14
+
15
+ sentence = "Za hory, za doly, mé zlaté parohy"
16
+ fake_sentence = "Za hory, za doly, kočka zlaté parohy"
17
+
18
+ fake_sentence_tokens = ["[CLS]"] + tokenizer.tokenize(fake_sentence) + ["[SEP]"]
19
+ fake_inputs = tokenizer.encode(fake_sentence, return_tensors="pt")
20
+ discriminator_outputs = discriminator(fake_inputs)
21
+ predictions = torch.nn.Sigmoid()(discriminator_outputs[0]).cpu().detach().numpy()
22
+
23
+ for token in fake_sentence_tokens:
24
+ print("{:>7s}".format(token), end="")
25
+ print()
26
+
27
+ for prediction in predictions.squeeze():
28
+ print("{:7.1f}".format(prediction), end="")
29
+ print()
30
+ ```
31
+
32
+ In the output we can see the probabilities of particular tokens not belonging in the sentence (i.e. having been faked by the generator) according to the discriminator:
33
+
34
+ ```
35
+ [CLS] za hory , za dol ##y , kočka zlaté paro ##hy [SEP]
36
+ 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.8 0.3 0.2 0.1 0.0
37
+ ```
38
+
39
+ ### Finetuning
40
+
41
+ For instructions on how to finetune the model on a new task, see the official HuggingFace transformers [tutorial](https://huggingface.co/transformers/training.html).