Go Inoue
commited on
Commit
·
fd39d7b
1
Parent(s):
89437c2
Fix typo
Browse files
README.md
CHANGED
@@ -3,7 +3,7 @@ language:
|
|
3 |
- ar
|
4 |
license: apache-2.0
|
5 |
widget:
|
6 |
-
- text: "
|
7 |
---
|
8 |
|
9 |
# bert-base-camelbert-msa
|
@@ -18,7 +18,7 @@ We release eight models with different sizes and variants as follows:
|
|
18 |
|-|-|:-:|-:|-:|
|
19 |
||`bert-base-camelbert-mix`|CA,DA,MSA|167GB|17.3B|
|
20 |
||`bert-base-camelbert-ca`|CA|6GB|847M|
|
21 |
-
|
22 |
||`bert-base-camelbert-msa`|MSA|107GB|12.6B|
|
23 |
||`bert-base-camelbert-msa-half`|MSA|53GB|6.3B|
|
24 |
||`bert-base-camelbert-msa-quarter`|MSA|27GB|3.1B|
|
@@ -37,27 +37,27 @@ You can use this model directly with a pipeline for masked language modeling:
|
|
37 |
```python
|
38 |
>>> from transformers import pipeline
|
39 |
>>> unmasker = pipeline('fill-mask', model='CAMeL-Lab/bert-base-camelbert-da')
|
40 |
-
>>> unmasker("
|
41 |
-
[{'sequence': '[CLS]
|
42 |
'score': 0.062508225440979,
|
43 |
'token': 18,
|
44 |
'token_str': '.'},
|
45 |
-
{'sequence': '[CLS]
|
46 |
'score': 0.033172328025102615,
|
47 |
'token': 4295,
|
48 |
-
'token_str': '
|
49 |
-
{'sequence': '[CLS]
|
50 |
'score': 0.029575437307357788,
|
51 |
'token': 3696,
|
52 |
-
'token_str': '
|
53 |
-
{'sequence': '[CLS]
|
54 |
'score': 0.02724040113389492,
|
55 |
'token': 11449,
|
56 |
-
'token_str': '
|
57 |
-
{'sequence': '[CLS]
|
58 |
'score': 0.01564178802073002,
|
59 |
'token': 3088,
|
60 |
-
'token_str': '
|
61 |
```
|
62 |
|
63 |
Here is how to use this model to get the features of a given text in PyTorch:
|
@@ -65,7 +65,7 @@ Here is how to use this model to get the features of a given text in PyTorch:
|
|
65 |
from transformers import AutoTokenizer, AutoModel
|
66 |
tokenizer = AutoTokenizer.from_pretrained('CAMeL-Lab/bert-base-camelbert-da')
|
67 |
model = AutoModel.from_pretrained('CAMeL-Lab/bert-base-camelbert-da')
|
68 |
-
text = "
|
69 |
encoded_input = tokenizer(text, return_tensors='pt')
|
70 |
output = model(**encoded_input)
|
71 |
```
|
@@ -75,18 +75,14 @@ and in TensorFlow:
|
|
75 |
from transformers import AutoTokenizer, TFAutoModel
|
76 |
tokenizer = AutoTokenizer.from_pretrained('CAMeL-Lab/bert-base-camelbert-da')
|
77 |
model = TFAutoModel.from_pretrained('CAMeL-Lab/bert-base-camelbert-da')
|
78 |
-
text = "
|
79 |
encoded_input = tokenizer(text, return_tensors='tf')
|
80 |
output = model(encoded_input)
|
81 |
```
|
82 |
|
83 |
## Training data
|
84 |
-
-
|
85 |
-
-
|
86 |
-
- [Abu El-Khair Corpus](http://www.abuelkhair.net/index.php/en/arabic/abu-el-khair-corpus)
|
87 |
-
- [OSIAN corpus](https://vlo.clarin.eu/search;jsessionid=31066390B2C9E8C6304845BA79869AC1?1&q=osian)
|
88 |
-
- [Arabic Wikipedia](https://archive.org/details/arwiki-20190201)
|
89 |
-
- The unshuffled version of the Arabic [OSCAR corpus](https://oscar-corpus.com/)
|
90 |
|
91 |
## Training procedure
|
92 |
We use [the original implementation](https://github.com/google-research/bert) released by Google for pre-training.
|
|
|
3 |
- ar
|
4 |
license: apache-2.0
|
5 |
widget:
|
6 |
+
- text: "\UTF{0627}\UTF{0644}\UTF{0647}\UTF{062F}\UTF{0641} \UTF{0645}\UTF{0646} \UTF{0627}\UTF{0644}\UTF{062D}\UTF{064A}\UTF{0627}\UTF{0629} \UTF{0647}\UTF{0648} [MASK] ."
|
7 |
---
|
8 |
|
9 |
# bert-base-camelbert-msa
|
|
|
18 |
|-|-|:-:|-:|-:|
|
19 |
||`bert-base-camelbert-mix`|CA,DA,MSA|167GB|17.3B|
|
20 |
||`bert-base-camelbert-ca`|CA|6GB|847M|
|
21 |
+
|\UTF{2714}|`bert-base-camelbert-da`|DA|54GB|5.8B|
|
22 |
||`bert-base-camelbert-msa`|MSA|107GB|12.6B|
|
23 |
||`bert-base-camelbert-msa-half`|MSA|53GB|6.3B|
|
24 |
||`bert-base-camelbert-msa-quarter`|MSA|27GB|3.1B|
|
|
|
37 |
```python
|
38 |
>>> from transformers import pipeline
|
39 |
>>> unmasker = pipeline('fill-mask', model='CAMeL-Lab/bert-base-camelbert-da')
|
40 |
+
>>> unmasker("\UTF{0627}\UTF{0644}\UTF{0647}\UTF{062F}\UTF{0641} \UTF{0645}\UTF{0646} \UTF{0627}\UTF{0644}\UTF{062D}\UTF{064A}\UTF{0627}\UTF{0629} \UTF{0647}\UTF{0648} [MASK] .")
|
41 |
+
[{'sequence': '[CLS] \UTF{0627}\UTF{0644}\UTF{0647}\UTF{062F}\UTF{0641} \UTF{0645}\UTF{0646} \UTF{0627}\UTF{0644}\UTF{062D}\UTF{064A}\UTF{0627}\UTF{0629} \UTF{0647}\UTF{0648}.. [SEP]',
|
42 |
'score': 0.062508225440979,
|
43 |
'token': 18,
|
44 |
'token_str': '.'},
|
45 |
+
{'sequence': '[CLS] \UTF{0627}\UTF{0644}\UTF{0647}\UTF{062F}\UTF{0641} \UTF{0645}\UTF{0646} \UTF{0627}\UTF{0644}\UTF{062D}\UTF{064A}\UTF{0627}\UTF{0629} \UTF{0647}\UTF{0648} \UTF{0627}\UTF{0644}\UTF{0645}\UTF{0648}\UTF{062A}. [SEP]',
|
46 |
'score': 0.033172328025102615,
|
47 |
'token': 4295,
|
48 |
+
'token_str': '\UTF{0627}\UTF{0644}\UTF{0645}\UTF{0648}\UTF{062A}'},
|
49 |
+
{'sequence': '[CLS] \UTF{0627}\UTF{0644}\UTF{0647}\UTF{062F}\UTF{0641} \UTF{0645}\UTF{0646} \UTF{0627}\UTF{0644}\UTF{062D}\UTF{064A}\UTF{0627}\UTF{0629} \UTF{0647}\UTF{0648} \UTF{0627}\UTF{0644}\UTF{062D}\UTF{064A}\UTF{0627}\UTF{0629}. [SEP]',
|
50 |
'score': 0.029575437307357788,
|
51 |
'token': 3696,
|
52 |
+
'token_str': '\UTF{0627}\UTF{0644}\UTF{062D}\UTF{064A}\UTF{0627}\UTF{0629}'},
|
53 |
+
{'sequence': '[CLS] \UTF{0627}\UTF{0644}\UTF{0647}\UTF{062F}\UTF{0641} \UTF{0645}\UTF{0646} \UTF{0627}\UTF{0644}\UTF{062D}\UTF{064A}\UTF{0627}\UTF{0629} \UTF{0647}\UTF{0648} \UTF{0627}\UTF{0644}\UTF{0631}\UTF{062D}\UTF{064A}\UTF{0644}. [SEP]',
|
54 |
'score': 0.02724040113389492,
|
55 |
'token': 11449,
|
56 |
+
'token_str': '\UTF{0627}\UTF{0644}\UTF{0631}\UTF{062D}\UTF{064A}\UTF{0644}'},
|
57 |
+
{'sequence': '[CLS] \UTF{0627}\UTF{0644}\UTF{0647}\UTF{062F}\UTF{0641} \UTF{0645}\UTF{0646} \UTF{0627}\UTF{0644}\UTF{062D}\UTF{064A}\UTF{0627}\UTF{0629} \UTF{0647}\UTF{0648} \UTF{0627}\UTF{0644}\UTF{062D}\UTF{0628}. [SEP]',
|
58 |
'score': 0.01564178802073002,
|
59 |
'token': 3088,
|
60 |
+
'token_str': '\UTF{0627}\UTF{0644}\UTF{062D}\UTF{0628}'}]
|
61 |
```
|
62 |
|
63 |
Here is how to use this model to get the features of a given text in PyTorch:
|
|
|
65 |
from transformers import AutoTokenizer, AutoModel
|
66 |
tokenizer = AutoTokenizer.from_pretrained('CAMeL-Lab/bert-base-camelbert-da')
|
67 |
model = AutoModel.from_pretrained('CAMeL-Lab/bert-base-camelbert-da')
|
68 |
+
text = "\UTF{0645}\UTF{0631}\UTF{062D}\UTF{0628}\UTF{0627} \UTF{064A}\UTF{0627} \UTF{0639}\UTF{0627}\UTF{0644}\UTF{0645}."
|
69 |
encoded_input = tokenizer(text, return_tensors='pt')
|
70 |
output = model(**encoded_input)
|
71 |
```
|
|
|
75 |
from transformers import AutoTokenizer, TFAutoModel
|
76 |
tokenizer = AutoTokenizer.from_pretrained('CAMeL-Lab/bert-base-camelbert-da')
|
77 |
model = TFAutoModel.from_pretrained('CAMeL-Lab/bert-base-camelbert-da')
|
78 |
+
text = "\UTF{0645}\UTF{0631}\UTF{062D}\UTF{0628}\UTF{0627} \UTF{064A}\UTF{0627} \UTF{0639}\UTF{0627}\UTF{0644}\UTF{0645}."
|
79 |
encoded_input = tokenizer(text, return_tensors='tf')
|
80 |
output = model(encoded_input)
|
81 |
```
|
82 |
|
83 |
## Training data
|
84 |
+
- DA
|
85 |
+
- A collection of dialectal Arabic data described in our paper.
|
|
|
|
|
|
|
|
|
86 |
|
87 |
## Training procedure
|
88 |
We use [the original implementation](https://github.com/google-research/bert) released by Google for pre-training.
|