Alvenir
/

bert-punct-restoration-da

Token Classification

punctuation restoration

Inference Endpoints

Model card Files Files and versions Community

arpelarpe commited on Mar 22, 2022

Commit

c87c19e

•

1 Parent(s): cf64859

Update README.md

Files changed (1) hide show

README.md +71 -1

README.md CHANGED Viewed

	@@ -1 +1,71 @@
1	- ~~# Punctuation Restoration for danish~~

+---
+language: da
+tags:
+- bert
+- punctuation restoration
+license: apache-2.0
+datasets:
+- custom
+---
+# Bert Punctuation Restoration Danish
+This model performs the punctuation restoration task in Danish. The method used is sequence classification similar to the NER models
+are trained.
+## Model description
+Amazing description of a model that does stuff
+## Intended uses & limitations
+Use it through custom library do to extra inference code.
+### How to use
+You can use this model directly with a pipeline for masked language modeling:
+```python
+from punctfix import Autopunct
+model = Autopunct(language="da")
+example_text = "hej med dig mit navn det er rasmus og det er mig som har trænet denne lækre model"
+print(model.punctuate(example_test))
+```
+### Limitations and bias
+## Training data
+To Do
+## Training procedure
+### Preprocessing
+TODO
+The texts are lowercased and tokenized using WordPiece and a vocabulary size of 30,000. The inputs of the model are
+then of the form:
+```
+[CLS] Sentence A [SEP] Sentence B [SEP]
+```
+With probability 0.5, sentence A and sentence B correspond to two consecutive sentences in the original corpus and in
+the other cases, it's another random sentence in the corpus. Note that what is considered a sentence here is a
+consecutive span of text usually longer than a single sentence. The only constrain is that the result with the two
+"sentences" has a combined length of less than 512 tokens.
+The details of the masking procedure for each sentence are the following:
+- 15% of the tokens are masked.
+- In 80% of the cases, the masked tokens are replaced by `[MASK]`.
+- In 10% of the cases, the masked tokens are replaced by a random token (different) from the one they replace.
+- In the 10% remaining cases, the masked tokens are left as is.
+## Evaluation results
+TODO
+When fine-tuned on downstream tasks, this model achieves the following results:
+Results:
+| Task | MNLI | QQP  | QNLI | SST-2 | CoLA | STS-B | MRPC | RTE  |
+|:----:|:----:|:----:|:----:|:-----:|:----:|:-----:|:----:|:----:|
+|      | 82.2 | 88.5 | 89.2 | 91.3  | 51.3 | 85.8  | 87.5 | 59.9 |