File size: 4,500 Bytes
23efc60
 
6c72d32
 
 
 
 
0b38bfd
dd8b948
23efc60
6c72d32
 
 
4fb9c16
 
6c72d32
4fb9c16
6c72d32
 
 
4fb9c16
e457e7c
 
dd8b948
e457e7c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
dd8b948
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
---
license: mit
language:
- tr
tags:
- punctuation restoration
- punctuation prediction
widget:
- text: "Türkiye toprakları üzerindeki ilk yerleşmeler Yontma Taş Devri'nde başlar Doğu Trakya'da Traklar olmak üzere Hititler Frigler Lidyalılar ve Dor istilası sonucu Yunanistan'dan kaçan Akalar tarafından kurulan İyon medeniyeti gibi çeşitli eski Anadolu medeniyetlerinin ardından Makedonya kralı Büyük İskender'in egemenliğiyle ve fetihleriyle birlikte Helenistik Dönem başladı"
---

# Transformer Based Punctuation Restoration Models for Turkish

<div float="center">
    <a href="https://github.com/uygarkurt/Turkish-Punctuation-Restoration">
        <img alt="open-source-image"
        src="https://img.shields.io/badge/GitHub-repository-green?logo=GitHub">
    </a>
</div>
<div align="center">
    <p>Liked our work? give us a ⭐ on GitHub!</p>
</div>

You can find the BERT model used in the paper [Transformer Based Punctuation Restoration for Turkish](https://ieeexplore.ieee.org/document/10286690). Aim of this work is correctly place pre-decided punctuation marks in a given text.  We present three pre-trained transformer models to predict **period(.)**, **comma(,)** and **question(?)** marks for the Turkish language.

## Usage <a class="anchor" id="usage"></a>

### Inference <a class="anchor" id="inference"></a>
Recommended usage is via HuggingFace. You can run an inference using the pre-trained BERT model with the following code:
``` 
from transformers import pipeline

pipe = pipeline(task="token-classification", model="uygarkurt/bert-restore-punctuation-turkish")

sample_text = "Türkiye toprakları üzerindeki ilk yerleşmeler Yontma Taş Devri'nde başlar Doğu Trakya'da Traklar olmak üzere Hititler Frigler Lidyalılar ve Dor istilası sonucu Yunanistan'dan kaçan Akalar tarafından kurulan İyon medeniyeti gibi çeşitli eski Anadolu medeniyetlerinin ardından Makedonya kralı Büyük İskender'in egemenliğiyle ve fetihleriyle birlikte Helenistik Dönem başladı"

out = pipe(sample_text)
```

To use a different pre-trained model you can just replace the `model` argument with one of the other [available models](#models) we provided.

## Data <a class="anchor" id="data"></a>
Dataset is provided in `data/` directory as train, validation and test splits.

Dataset can be summarized as below:

|    Split    |  Total  | Period (.) | Comma (,) | Question (?) |
|:-----------:|:-------:|:----------:|:---------:|:------------:|
|    Train    | 1471806 |   124817   |   98194   |     9816     |
| Validation  |  180326 |    15306   |   11980   |     1199     |
|   Test      |  182487 |    15524   |   12242   |     1255     |

## Available Models <a class="anchor" id="models"></a>
We experimented with BERT, ELECTRA and ConvBERT. Pre-trained models can be accessed via Huggingface.

BERT: https://huggingface.co./uygarkurt/bert-restore-punctuation-turkish \
ELECTRA: https://huggingface.co./uygarkurt/electra-restore-punctuation-turkish \
ConvBERT: https://huggingface.co./uygarkurt/convbert-restore-punctuation-turkish

## Results <a class="results" id="results"></a>
`Precision` and `Recall` and `F1` scores for each model and punctuation mark are summarized below.

|   Model  |          |  PERIOD  |          |          |  COMMA   |          |          | QUESTION |          |          | OVERALL  |          |
|:--------:|:--------:|:--------:|:--------:|:--------:|:--------:|:--------:|:--------:|:--------:|:--------:|:--------:|:--------:|:--------:|
|Score Type|     P    |     R    |    F1    |     P    |     R    |    F1    |     P    |     R    |    F1    |     P    |     R    |    F1    |
|   BERT   | 0.972602 | 0.947504 | 0.959952 | 0.576145 | 0.700010 | 0.632066 | 0.927642 | 0.911342 | 0.919420 | 0.825506 | 0.852952 | 0.837146 |
|  ELECTRA | 0.972602 | 0.948689 | 0.960497 | 0.576800 | 0.710208 | 0.636590 | 0.920325 | 0.921074 | 0.920699 | 0.823242 | 0.859990 | 0.839262 |
| ConvBERT | 0.972731 | 0.946791 | 0.959585 | 0.576964 | 0.708124 | 0.635851 | 0.922764 | 0.913849 | 0.918285 | 0.824153 | 0.856254 | 0.837907 |

## Citation <a class="anchor" id="citation"></a>
```
@INPROCEEDINGS{10286690,
    author={Kurt, Uygar and Çayır, Aykut},
    booktitle={2023 8th International Conference on Computer Science and Engineering (UBMK)}, 
    title={Transformer Based Punctuation Restoration for Turkish}, 
    year={2023},
    volume={},
    number={},
    pages={169-174},
    doi={10.1109/UBMK59864.2023.10286690}
}
```