Update README.md
Browse files
README.md
CHANGED
@@ -4,7 +4,7 @@
|
|
4 |
{}
|
5 |
---
|
6 |
|
7 |
-
# Varta
|
8 |
|
9 |
<!-- Provide a quick summary of what the model is/does. -->
|
10 |
|
@@ -18,15 +18,11 @@ The dataset and the model are introduced in [this paper](https://arxiv.org/abs/2
|
|
18 |
|
19 |
|
20 |
## Uses
|
21 |
-
|
22 |
-
You can use the raw model for masked language modelling, but it is mostly intended to be fine-tuned on a downstream task.
|
23 |
|
24 |
Note that this model is primarily aimed at being fine-tuned on tasks that use the whole sentence (potentially masked) to make decisions, such as sequence classification, token classification or question answering. For tasks such as text generation, you should look at our [Varta-T5](https://huggingface.co/rahular/varta-t5) model.
|
25 |
|
26 |
## Bias, Risks, and Limitations
|
27 |
-
|
28 |
-
<!-- This section is meant to convey both technical and sociotechnical limitations. -->
|
29 |
-
|
30 |
This work is mainly dedicated to the curation of a new multilingual dataset for Indic languages, many of which are low-resource languages. During data collection, we face several limitations that can potentially result in ethical concerns. Some of the important ones are mentioned below: <br>
|
31 |
|
32 |
- Our dataset contains only those articles written by DailyHunt's partner publishers. This has the potential to result in a bias towards a particular narrative or ideology that can affect the representativeness and diversity of the dataset.
|
@@ -36,7 +32,7 @@ This work is mainly dedicated to the curation of a new multilingual dataset for
|
|
36 |
|
37 |
## How to Get Started with the Model
|
38 |
|
39 |
-
You can use this model directly
|
40 |
|
41 |
```python
|
42 |
from transformers import AutoTokenizer, AutoModelForMaskedLM
|
@@ -63,15 +59,12 @@ With 34.5 million non-English article-headline pairs, it is the largest document
|
|
63 |
- We train the model for a total of 1M steps which takes 10 days to finish.
|
64 |
- We use an effective batch size of 4096 and train the model on TPU v3-128 chips.
|
65 |
|
66 |
-
|
67 |
|
68 |
### Evaluation Results
|
69 |
Please see [the paper](https://arxiv.org/pdf/2305.05858.pdf).
|
70 |
|
71 |
## Citation
|
72 |
-
|
73 |
-
<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
|
74 |
-
|
75 |
```
|
76 |
@misc{aralikatte2023varta,
|
77 |
title={V\=arta: A Large-Scale Headline-Generation Dataset for Indic Languages},
|
|
|
4 |
{}
|
5 |
---
|
6 |
|
7 |
+
# Varta-BERT
|
8 |
|
9 |
<!-- Provide a quick summary of what the model is/does. -->
|
10 |
|
|
|
18 |
|
19 |
|
20 |
## Uses
|
21 |
+
You can use the raw model for masked language modeling, but it is mostly intended to be fine-tuned on a downstream task.
|
|
|
22 |
|
23 |
Note that this model is primarily aimed at being fine-tuned on tasks that use the whole sentence (potentially masked) to make decisions, such as sequence classification, token classification or question answering. For tasks such as text generation, you should look at our [Varta-T5](https://huggingface.co/rahular/varta-t5) model.
|
24 |
|
25 |
## Bias, Risks, and Limitations
|
|
|
|
|
|
|
26 |
This work is mainly dedicated to the curation of a new multilingual dataset for Indic languages, many of which are low-resource languages. During data collection, we face several limitations that can potentially result in ethical concerns. Some of the important ones are mentioned below: <br>
|
27 |
|
28 |
- Our dataset contains only those articles written by DailyHunt's partner publishers. This has the potential to result in a bias towards a particular narrative or ideology that can affect the representativeness and diversity of the dataset.
|
|
|
32 |
|
33 |
## How to Get Started with the Model
|
34 |
|
35 |
+
You can use this model directly for masked language modeling.
|
36 |
|
37 |
```python
|
38 |
from transformers import AutoTokenizer, AutoModelForMaskedLM
|
|
|
59 |
- We train the model for a total of 1M steps which takes 10 days to finish.
|
60 |
- We use an effective batch size of 4096 and train the model on TPU v3-128 chips.
|
61 |
|
62 |
+
Since data sizes across languages in Varta vary from 1.5K (Bhojpuri) to 14.4M articles (Hindi), we use standard temperature-based sampling to upsample data when necessary.
|
63 |
|
64 |
### Evaluation Results
|
65 |
Please see [the paper](https://arxiv.org/pdf/2305.05858.pdf).
|
66 |
|
67 |
## Citation
|
|
|
|
|
|
|
68 |
```
|
69 |
@misc{aralikatte2023varta,
|
70 |
title={V\=arta: A Large-Scale Headline-Generation Dataset for Indic Languages},
|