|
--- |
|
license: apache-2.0 |
|
language: |
|
- af |
|
- ar |
|
- bg |
|
- bn |
|
- de |
|
- el |
|
- en |
|
- es |
|
- et |
|
- eu |
|
- fa |
|
- fi |
|
- fr |
|
- he |
|
- hi |
|
- hu |
|
- id |
|
- it |
|
- ja |
|
- jv |
|
- ka |
|
- kk |
|
- ko |
|
- ml |
|
- mr |
|
- ms |
|
- my |
|
- nl |
|
- pt |
|
- ru |
|
- sw |
|
- ta |
|
- te |
|
- th |
|
- tl |
|
- tr |
|
- ur |
|
- vi |
|
- yo |
|
- zh |
|
--- |
|
|
|
|
|
# Model Card for EntityCS-39-MLM-xlmr-base |
|
|
|
- Paper: https://aclanthology.org/2022.findings-emnlp.499.pdf |
|
- Repository: https://github.com/huawei-noah/noah-research/tree/master/NLP/EntityCS |
|
- Point of Contact: [Fenia Christopoulou](mailto:[email protected]), [Chenxi Whitehouse](mailto:[email protected]) |
|
|
|
## Model Description |
|
|
|
This model has been trained on the EntityCS corpus, an English corpus from Wikipedia with replaced entities in different languages. |
|
The corpus can be found in [https://huggingface.co./huawei-noah/entity_cs](https://huggingface.co./huawei-noah/entity_cs), check the link for more details. |
|
To train models on the corpus, we first employ the conventional 80-10-10 MLM objective, where 15% of sentence subwords are considered as masking candidates. From those, we replace subwords |
|
with [MASK] 80% of the time, with Random subwords (from the entire vocabulary) 10% of the time, and leave the remaining 10% unchanged (Same). |
|
|
|
To integrate entity-level cross-lingual knowledge into the model, we propose Entity Prediction objectives, where we only mask subwords belonging |
|
to an entity. By predicting the masked entities in EntityCS sentences, we expect the model to capture the semantics of the same entity in different |
|
languages. |
|
Two different masking strategies are proposed for predicting entities: Whole Entity Prediction (`WEP`) and Partial Entity Prediction (`PEP`). |
|
|
|
In WEP, motivated by [Sun et al. (2019)](https://arxiv.org/abs/1904.09223) where whole word masking is also adopted, we consider all the words (and consequently subwords) inside |
|
an entity as masking candidates. Then, 80% of the time we mask every subword inside an entity, and |
|
20% of the time we keep the subwords intact. Note that, as our goal is to predict the entire masked |
|
entity, we do not allow replacing with Random subwords, since it can introduce noise and result |
|
in the model predicting incorrect entities. After entities are masked, we remove the entity indicators |
|
`<e>`, `</e>` from the sentences before feeding them to the model. |
|
|
|
For PEP, we also consider all entities as masking candidates. In contrast to WEP, we do not force |
|
subwords belonging to one entity to be either all masked or all unmasked. Instead, each individual |
|
entity subword is masked 80% of the time. For the remaining 20% of the masking candidates, we experiment with three different replacements. First, |
|
PEP<sub>MRS</sub>, corresponds to the conventional 80-10-10 masking strategy, where 10% of the remaining |
|
subwords are replaced with Random subwords and the other 10% are kept unchanged. In the second |
|
setting, PEP<sub>MS</sub>, we remove the 10% Random subwords substitution, i.e. we predict the 80% masked |
|
subwords and 10% Same subwords from the masking candidates. In the third setting, PEP<sub>M</sub>, we |
|
further remove the 10% Same subwords prediction, essentially predicting only the masked subwords. |
|
|
|
Prior work has proven it is effective to combine Entity Prediction with MLM for cross-lingual transfer ([Jiang et al., 2020](https://aclanthology.org/2020.emnlp-main.479/)), therefore we investigate the |
|
combination of the Entity Prediction objectives together with MLM on non-entity subwords. Specifically, when combined with MLM, we lower the |
|
entity masking probability (p) to 50% to roughly keep the same overall masking percentage. |
|
This results into the following objectives: WEP + MLM, PEP<sub>MRS</sub> + MLM, PEP<sub>MS</sub> + MLM, PEP<sub>M</sub> + MLM |
|
|
|
This model was trained with the **MLM** objective on the EntityCS corpus with 39 languages. |
|
|
|
|
|
## Training Details |
|
|
|
We start from the [XLM-R-base](https://huggingface.co./xlm-roberta-base) model and train for 1 epoch on 8 Nvidia V100 32GB GPUs. |
|
We set batch size to 16 and gradient accumulation steps to 2, resulting in an effective batch size of 256. |
|
For speedup we use fp16 mixed precision. |
|
We use the sampling strategy proposed by [Conneau and Lample (2019)](https://dl.acm.org/doi/pdf/10.5555/3454287.3454921), where high resource languages are down-sampled and low |
|
resource languages get sampled more frequently. |
|
We only train the embedding and the last two layers of the model. |
|
We randomly choose 100 sentences from each language to serve as a validation set, on which we measure the perplexity every 10K training steps. |
|
|
|
**This checkpoint corresponds to the one with the lower perplexity on the validation set.** |
|
|
|
|
|
## Usage |
|
|
|
The current model can be used for further fine-tuning on downstream tasks. |
|
In the paper, we focused on entity-related tasks, such as NER, Word Sense Disambiguation and Slot Filling. |
|
|
|
Alternatively, it can be used directly (no fine-tuning) for probing tasks, i.e. predict missing words, such as [X-FACTR](https://aclanthology.org/2020.emnlp-main.479/). |
|
|
|
For results on each downstream task, please refer to the [paper](https://aclanthology.org/2022.findings-emnlp.499.pdf). |
|
|
|
|
|
## How to Get Started with the Model |
|
|
|
Use the code below to get started with training: https://github.com/huawei-noah/noah-research/tree/master/NLP/EntityCS |
|
|
|
## Citation |
|
|
|
**BibTeX** |
|
|
|
```html |
|
@inproceedings{whitehouse-etal-2022-entitycs, |
|
title = "{E}ntity{CS}: Improving Zero-Shot Cross-lingual Transfer with Entity-Centric Code Switching", |
|
author = "Whitehouse, Chenxi and |
|
Christopoulou, Fenia and |
|
Iacobacci, Ignacio", |
|
booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2022", |
|
month = dec, |
|
year = "2022", |
|
address = "Abu Dhabi, United Arab Emirates", |
|
publisher = "Association for Computational Linguistics", |
|
url = "https://aclanthology.org/2022.findings-emnlp.499", |
|
pages = "6698--6714" |
|
} |
|
``` |
|
|
|
**APA** |
|
|
|
```html |
|
Whitehouse, C., Christopoulou, F., & Iacobacci, I. (2022). EntityCS: Improving Zero-Shot Cross-lingual Transfer with Entity-Centric Code Switching. In Findings of the Association for Computational Linguistics: EMNLP 2022. |
|
``` |
|
|