arpelarpe commited on
Commit
c87c19e
1 Parent(s): cf64859

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +71 -1
README.md CHANGED
@@ -1 +1,71 @@
1
- # Punctuation Restoration for danish
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: da
3
+ tags:
4
+ - bert
5
+ - punctuation restoration
6
+ license: apache-2.0
7
+ datasets:
8
+ - custom
9
+
10
+ ---
11
+
12
+ # Bert Punctuation Restoration Danish
13
+ This model performs the punctuation restoration task in Danish. The method used is sequence classification similar to the NER models
14
+ are trained.
15
+
16
+ ## Model description
17
+ Amazing description of a model that does stuff
18
+
19
+ ## Intended uses & limitations
20
+ Use it through custom library do to extra inference code.
21
+
22
+ ### How to use
23
+
24
+ You can use this model directly with a pipeline for masked language modeling:
25
+
26
+ ```python
27
+ from punctfix import Autopunct
28
+ model = Autopunct(language="da")
29
+
30
+ example_text = "hej med dig mit navn det er rasmus og det er mig som har trænet denne lækre model"
31
+
32
+ print(model.punctuate(example_test))
33
+ ```
34
+
35
+ ### Limitations and bias
36
+
37
+ ## Training data
38
+
39
+ To Do
40
+ ## Training procedure
41
+
42
+ ### Preprocessing
43
+
44
+ TODO
45
+ The texts are lowercased and tokenized using WordPiece and a vocabulary size of 30,000. The inputs of the model are
46
+ then of the form:
47
+
48
+ ```
49
+ [CLS] Sentence A [SEP] Sentence B [SEP]
50
+ ```
51
+
52
+ With probability 0.5, sentence A and sentence B correspond to two consecutive sentences in the original corpus and in
53
+ the other cases, it's another random sentence in the corpus. Note that what is considered a sentence here is a
54
+ consecutive span of text usually longer than a single sentence. The only constrain is that the result with the two
55
+ "sentences" has a combined length of less than 512 tokens.
56
+
57
+ The details of the masking procedure for each sentence are the following:
58
+ - 15% of the tokens are masked.
59
+ - In 80% of the cases, the masked tokens are replaced by `[MASK]`.
60
+ - In 10% of the cases, the masked tokens are replaced by a random token (different) from the one they replace.
61
+ - In the 10% remaining cases, the masked tokens are left as is.
62
+
63
+ ## Evaluation results
64
+ TODO
65
+ When fine-tuned on downstream tasks, this model achieves the following results:
66
+
67
+ Results:
68
+
69
+ | Task | MNLI | QQP | QNLI | SST-2 | CoLA | STS-B | MRPC | RTE |
70
+ |:----:|:----:|:----:|:----:|:-----:|:----:|:-----:|:----:|:----:|
71
+ | | 82.2 | 88.5 | 89.2 | 91.3 | 51.3 | 85.8 | 87.5 | 59.9 |