arpelarpe commited on
Commit
cc69a8f
1 Parent(s): c87c19e

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +15 -38
README.md CHANGED
@@ -10,62 +10,39 @@ datasets:
10
  ---
11
 
12
  # Bert Punctuation Restoration Danish
13
- This model performs the punctuation restoration task in Danish. The method used is sequence classification similar to the NER models
14
  are trained.
15
 
16
  ## Model description
17
- Amazing description of a model that does stuff
18
-
19
- ## Intended uses & limitations
20
- Use it through custom library do to extra inference code.
21
 
22
  ### How to use
23
 
24
- You can use this model directly with a pipeline for masked language modeling:
 
25
 
26
  ```python
27
- from punctfix import Autopunct
28
- model = Autopunct(language="da")
29
 
30
- example_text = "hej med dig mit navn det er rasmus og det er mig som har trænet denne lækre model"
 
 
31
 
32
- print(model.punctuate(example_test))
 
 
33
  ```
34
 
35
- ### Limitations and bias
36
-
37
  ## Training data
38
-
39
- To Do
40
  ## Training procedure
 
41
 
42
  ### Preprocessing
43
 
44
  TODO
45
- The texts are lowercased and tokenized using WordPiece and a vocabulary size of 30,000. The inputs of the model are
46
- then of the form:
47
-
48
- ```
49
- [CLS] Sentence A [SEP] Sentence B [SEP]
50
- ```
51
-
52
- With probability 0.5, sentence A and sentence B correspond to two consecutive sentences in the original corpus and in
53
- the other cases, it's another random sentence in the corpus. Note that what is considered a sentence here is a
54
- consecutive span of text usually longer than a single sentence. The only constrain is that the result with the two
55
- "sentences" has a combined length of less than 512 tokens.
56
-
57
- The details of the masking procedure for each sentence are the following:
58
- - 15% of the tokens are masked.
59
- - In 80% of the cases, the masked tokens are replaced by `[MASK]`.
60
- - In 10% of the cases, the masked tokens are replaced by a random token (different) from the one they replace.
61
- - In the 10% remaining cases, the masked tokens are left as is.
62
 
63
  ## Evaluation results
64
  TODO
65
- When fine-tuned on downstream tasks, this model achieves the following results:
66
-
67
- Results:
68
-
69
- | Task | MNLI | QQP | QNLI | SST-2 | CoLA | STS-B | MRPC | RTE |
70
- |:----:|:----:|:----:|:----:|:-----:|:----:|:-----:|:----:|:----:|
71
- | | 82.2 | 88.5 | 89.2 | 91.3 | 51.3 | 85.8 | 87.5 | 59.9 |
 
10
  ---
11
 
12
  # Bert Punctuation Restoration Danish
13
+ This model performs the punctuation restoration task in Danish. The method used is sequence classification similar to how NER models
14
  are trained.
15
 
16
  ## Model description
17
+ TODO
 
 
 
18
 
19
  ### How to use
20
 
21
+ The model requires some additional inference code, hence we created an awesome little pip package for inference.
22
+ The inference code is based on the `TokenClassificationPipeline` pipeline from huggingface
23
 
24
  ```python
25
+ >>> from punctfix import PunctFixer
26
+ >>> model = PunctFixer(language="da")
27
 
28
+ >>> example_text = "mit navn det er rasmus og jeg kommer fra firmaet alvenir det er mig som har trænet denne lækre model"
29
+ >>> print(model.punctuate(example_test))
30
+ Mit navn det er Rasmus og jeg kommer fra firmaet Alvenir. Det er mig som har trænet denne lækre model.
31
 
32
+ >>> example_text = "mit navn det er rasmus og jeg kommer fra firmaet alvenir det er mig som har trænet denne lækre model"
33
+ >>> print(fixer.punctuate(example_text))
34
+ En dag bliver vi sku glade for, at vi nu kan sætte punktummer og kommaer i en sætning. Det fungerer da meget godt, ikke?
35
  ```
36
 
 
 
37
  ## Training data
38
+ To Do
39
+
40
  ## Training procedure
41
+ To Do
42
 
43
  ### Preprocessing
44
 
45
  TODO
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
46
 
47
  ## Evaluation results
48
  TODO