Update README.md
Browse files
README.md
CHANGED
@@ -10,62 +10,39 @@ datasets:
|
|
10 |
---
|
11 |
|
12 |
# Bert Punctuation Restoration Danish
|
13 |
-
This model performs the punctuation restoration task in Danish. The method used is sequence classification similar to
|
14 |
are trained.
|
15 |
|
16 |
## Model description
|
17 |
-
|
18 |
-
|
19 |
-
## Intended uses & limitations
|
20 |
-
Use it through custom library do to extra inference code.
|
21 |
|
22 |
### How to use
|
23 |
|
24 |
-
|
|
|
25 |
|
26 |
```python
|
27 |
-
from punctfix import
|
28 |
-
model =
|
29 |
|
30 |
-
example_text = "
|
|
|
|
|
31 |
|
32 |
-
|
|
|
|
|
33 |
```
|
34 |
|
35 |
-
### Limitations and bias
|
36 |
-
|
37 |
## Training data
|
38 |
-
|
39 |
-
|
40 |
## Training procedure
|
|
|
41 |
|
42 |
### Preprocessing
|
43 |
|
44 |
TODO
|
45 |
-
The texts are lowercased and tokenized using WordPiece and a vocabulary size of 30,000. The inputs of the model are
|
46 |
-
then of the form:
|
47 |
-
|
48 |
-
```
|
49 |
-
[CLS] Sentence A [SEP] Sentence B [SEP]
|
50 |
-
```
|
51 |
-
|
52 |
-
With probability 0.5, sentence A and sentence B correspond to two consecutive sentences in the original corpus and in
|
53 |
-
the other cases, it's another random sentence in the corpus. Note that what is considered a sentence here is a
|
54 |
-
consecutive span of text usually longer than a single sentence. The only constrain is that the result with the two
|
55 |
-
"sentences" has a combined length of less than 512 tokens.
|
56 |
-
|
57 |
-
The details of the masking procedure for each sentence are the following:
|
58 |
-
- 15% of the tokens are masked.
|
59 |
-
- In 80% of the cases, the masked tokens are replaced by `[MASK]`.
|
60 |
-
- In 10% of the cases, the masked tokens are replaced by a random token (different) from the one they replace.
|
61 |
-
- In the 10% remaining cases, the masked tokens are left as is.
|
62 |
|
63 |
## Evaluation results
|
64 |
TODO
|
65 |
-
When fine-tuned on downstream tasks, this model achieves the following results:
|
66 |
-
|
67 |
-
Results:
|
68 |
-
|
69 |
-
| Task | MNLI | QQP | QNLI | SST-2 | CoLA | STS-B | MRPC | RTE |
|
70 |
-
|:----:|:----:|:----:|:----:|:-----:|:----:|:-----:|:----:|:----:|
|
71 |
-
| | 82.2 | 88.5 | 89.2 | 91.3 | 51.3 | 85.8 | 87.5 | 59.9 |
|
|
|
10 |
---
|
11 |
|
12 |
# Bert Punctuation Restoration Danish
|
13 |
+
This model performs the punctuation restoration task in Danish. The method used is sequence classification similar to how NER models
|
14 |
are trained.
|
15 |
|
16 |
## Model description
|
17 |
+
TODO
|
|
|
|
|
|
|
18 |
|
19 |
### How to use
|
20 |
|
21 |
+
The model requires some additional inference code, hence we created an awesome little pip package for inference.
|
22 |
+
The inference code is based on the `TokenClassificationPipeline` pipeline from huggingface
|
23 |
|
24 |
```python
|
25 |
+
>>> from punctfix import PunctFixer
|
26 |
+
>>> model = PunctFixer(language="da")
|
27 |
|
28 |
+
>>> example_text = "mit navn det er rasmus og jeg kommer fra firmaet alvenir det er mig som har trænet denne lækre model"
|
29 |
+
>>> print(model.punctuate(example_test))
|
30 |
+
Mit navn det er Rasmus og jeg kommer fra firmaet Alvenir. Det er mig som har trænet denne lækre model.
|
31 |
|
32 |
+
>>> example_text = "mit navn det er rasmus og jeg kommer fra firmaet alvenir det er mig som har trænet denne lækre model"
|
33 |
+
>>> print(fixer.punctuate(example_text))
|
34 |
+
En dag bliver vi sku glade for, at vi nu kan sætte punktummer og kommaer i en sætning. Det fungerer da meget godt, ikke?
|
35 |
```
|
36 |
|
|
|
|
|
37 |
## Training data
|
38 |
+
To Do
|
39 |
+
|
40 |
## Training procedure
|
41 |
+
To Do
|
42 |
|
43 |
### Preprocessing
|
44 |
|
45 |
TODO
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
46 |
|
47 |
## Evaluation results
|
48 |
TODO
|
|
|
|
|
|
|
|
|
|
|
|
|
|