Update README.md
Browse files
README.md
CHANGED
@@ -1 +1,71 @@
|
|
1 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
language: da
|
3 |
+
tags:
|
4 |
+
- bert
|
5 |
+
- punctuation restoration
|
6 |
+
license: apache-2.0
|
7 |
+
datasets:
|
8 |
+
- custom
|
9 |
+
|
10 |
+
---
|
11 |
+
|
12 |
+
# Bert Punctuation Restoration Danish
|
13 |
+
This model performs the punctuation restoration task in Danish. The method used is sequence classification similar to the NER models
|
14 |
+
are trained.
|
15 |
+
|
16 |
+
## Model description
|
17 |
+
Amazing description of a model that does stuff
|
18 |
+
|
19 |
+
## Intended uses & limitations
|
20 |
+
Use it through custom library do to extra inference code.
|
21 |
+
|
22 |
+
### How to use
|
23 |
+
|
24 |
+
You can use this model directly with a pipeline for masked language modeling:
|
25 |
+
|
26 |
+
```python
|
27 |
+
from punctfix import Autopunct
|
28 |
+
model = Autopunct(language="da")
|
29 |
+
|
30 |
+
example_text = "hej med dig mit navn det er rasmus og det er mig som har trænet denne lækre model"
|
31 |
+
|
32 |
+
print(model.punctuate(example_test))
|
33 |
+
```
|
34 |
+
|
35 |
+
### Limitations and bias
|
36 |
+
|
37 |
+
## Training data
|
38 |
+
|
39 |
+
To Do
|
40 |
+
## Training procedure
|
41 |
+
|
42 |
+
### Preprocessing
|
43 |
+
|
44 |
+
TODO
|
45 |
+
The texts are lowercased and tokenized using WordPiece and a vocabulary size of 30,000. The inputs of the model are
|
46 |
+
then of the form:
|
47 |
+
|
48 |
+
```
|
49 |
+
[CLS] Sentence A [SEP] Sentence B [SEP]
|
50 |
+
```
|
51 |
+
|
52 |
+
With probability 0.5, sentence A and sentence B correspond to two consecutive sentences in the original corpus and in
|
53 |
+
the other cases, it's another random sentence in the corpus. Note that what is considered a sentence here is a
|
54 |
+
consecutive span of text usually longer than a single sentence. The only constrain is that the result with the two
|
55 |
+
"sentences" has a combined length of less than 512 tokens.
|
56 |
+
|
57 |
+
The details of the masking procedure for each sentence are the following:
|
58 |
+
- 15% of the tokens are masked.
|
59 |
+
- In 80% of the cases, the masked tokens are replaced by `[MASK]`.
|
60 |
+
- In 10% of the cases, the masked tokens are replaced by a random token (different) from the one they replace.
|
61 |
+
- In the 10% remaining cases, the masked tokens are left as is.
|
62 |
+
|
63 |
+
## Evaluation results
|
64 |
+
TODO
|
65 |
+
When fine-tuned on downstream tasks, this model achieves the following results:
|
66 |
+
|
67 |
+
Results:
|
68 |
+
|
69 |
+
| Task | MNLI | QQP | QNLI | SST-2 | CoLA | STS-B | MRPC | RTE |
|
70 |
+
|:----:|:----:|:----:|:----:|:-----:|:----:|:-----:|:----:|:----:|
|
71 |
+
| | 82.2 | 88.5 | 89.2 | 91.3 | 51.3 | 85.8 | 87.5 | 59.9 |
|