Added downstream tasks results
Browse files
README.md
CHANGED
@@ -15,10 +15,11 @@ Pretrained model on Marathi language using a masked language modeling (MLM) obje
|
|
15 |
[Flax/Jax Community Week](https://discuss.huggingface.co/t/pretrain-roberta-from-scratch-in-hindi/7091), organized by [HuggingFace](https://huggingface.co/) and TPU usage sponsored by Google.
|
16 |
|
17 |
## Model description
|
18 |
-
RoBERTa Hindi is a transformers model pretrained on a large corpus of Hindi data in a self-supervised fashion.
|
19 |
|
|
|
20 |
|
21 |
### How to use
|
|
|
22 |
You can use this model directly with a pipeline for masked language modeling:
|
23 |
```python
|
24 |
>>> from transformers import pipeline
|
@@ -48,6 +49,7 @@ You can use this model directly with a pipeline for masked language modeling:
|
|
48 |
```
|
49 |
|
50 |
## Training data
|
|
|
51 |
The RoBERTa model was pretrained on the reunion of the following datasets:
|
52 |
- [OSCAR](https://huggingface.co/datasets/oscar) is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
|
53 |
- [mC4](https://huggingface.co/datasets/mc4) is a multilingual colossal, cleaned version of Common Crawl's web crawl corpus.
|
@@ -58,8 +60,9 @@ The RoBERTa model was pretrained on the reunion of the following datasets:
|
|
58 |
- [Hindi Text Short Summarization Corpus](https://www.kaggle.com/disisbig/hindi-text-short-summarization-corpus) is a collection of ~330k articles with their headlines collected from Hindi News Websites.
|
59 |
- [Old Newspapers Hindi](https://www.kaggle.com/crazydiv/oldnewspapershindi) is a cleaned subset of HC Corpora newspapers.
|
60 |
|
61 |
-
## Training procedure
|
62 |
### Preprocessing
|
|
|
63 |
The texts are tokenized using a byte version of Byte-Pair Encoding (BPE) and a vocabulary size of 50265. The inputs of
|
64 |
the model take pieces of 512 contiguous token that may span over documents. The beginning of a new document is marked
|
65 |
with `<s>` and the end of one by `</s>`
|
@@ -82,10 +85,10 @@ RoBERTa Hindi is evaluated on downstream tasks. The results are summarized below
|
|
82 |
|
83 |
| Task | Task Type | IndicBERT | HindiBERTa | Indic Transformers Hindi BERT | RoBERTa Hindi Guj San | RoBERTa Hindi |
|
84 |
|-------------------------|----------------------|-----------|------------|-------------------------------|-----------------------|---------------|
|
85 |
-
| BBC News Classification | Genre Classification |
|
86 |
-
| WikiNER | Token Classification |
|
87 |
-
| IITP Product Reviews | Sentiment Analysis |
|
88 |
-
| IITP Movie Reviews | Sentiment Analysis |
|
89 |
|
90 |
## Team Members
|
91 |
- Kartik Godawat ([dk-crazydiv](https://huggingface.co/dk-crazydiv))
|
|
|
15 |
[Flax/Jax Community Week](https://discuss.huggingface.co/t/pretrain-roberta-from-scratch-in-hindi/7091), organized by [HuggingFace](https://huggingface.co/) and TPU usage sponsored by Google.
|
16 |
|
17 |
## Model description
|
|
|
18 |
|
19 |
+
RoBERTa Hindi is a transformers model pretrained on a large corpus of Hindi data in a self-supervised fashion.
|
20 |
|
21 |
### How to use
|
22 |
+
|
23 |
You can use this model directly with a pipeline for masked language modeling:
|
24 |
```python
|
25 |
>>> from transformers import pipeline
|
|
|
49 |
```
|
50 |
|
51 |
## Training data
|
52 |
+
|
53 |
The RoBERTa model was pretrained on the reunion of the following datasets:
|
54 |
- [OSCAR](https://huggingface.co/datasets/oscar) is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
|
55 |
- [mC4](https://huggingface.co/datasets/mc4) is a multilingual colossal, cleaned version of Common Crawl's web crawl corpus.
|
|
|
60 |
- [Hindi Text Short Summarization Corpus](https://www.kaggle.com/disisbig/hindi-text-short-summarization-corpus) is a collection of ~330k articles with their headlines collected from Hindi News Websites.
|
61 |
- [Old Newspapers Hindi](https://www.kaggle.com/crazydiv/oldnewspapershindi) is a cleaned subset of HC Corpora newspapers.
|
62 |
|
63 |
+
## Training procedure
|
64 |
### Preprocessing
|
65 |
+
|
66 |
The texts are tokenized using a byte version of Byte-Pair Encoding (BPE) and a vocabulary size of 50265. The inputs of
|
67 |
the model take pieces of 512 contiguous token that may span over documents. The beginning of a new document is marked
|
68 |
with `<s>` and the end of one by `</s>`
|
|
|
85 |
|
86 |
| Task | Task Type | IndicBERT | HindiBERTa | Indic Transformers Hindi BERT | RoBERTa Hindi Guj San | RoBERTa Hindi |
|
87 |
|-------------------------|----------------------|-----------|------------|-------------------------------|-----------------------|---------------|
|
88 |
+
| BBC News Classification | Genre Classification | **76.44** | 66.86 | **77.6** | 64.9 | 73.67 |
|
89 |
+
| WikiNER | Token Classification | - | 90.68 | **95.09** | 89.61 | **92.76** |
|
90 |
+
| IITP Product Reviews | Sentiment Analysis | **78.01** | 73.23 | **78.39** | 66.16 | 75.53 |
|
91 |
+
| IITP Movie Reviews | Sentiment Analysis | 60.97 | 52.26 | **70.65** | 49.35 | **61.29** |
|
92 |
|
93 |
## Team Members
|
94 |
- Kartik Godawat ([dk-crazydiv](https://huggingface.co/dk-crazydiv))
|