arxiv:1802.06893

Learning Word Vectors for 157 Languages

Published on Feb 19, 2018

Authors:

Abstract

Distributed word representations, or word vectors, have recently been applied to many tasks in natural language processing, leading to state-of-the-art performance. A key ingredient to the successful application of these representations is to train them on very large corpora, and use these pre-trained models in downstream tasks. In this paper, we describe how we trained such high quality word representations for 157 languages. We used two sources of data to train these models: the free online encyclopedia Wikipedia and data from the common crawl project. We also introduce three new word analogy datasets to evaluate these word vectors, for French, Hindi and Polish. Finally, we evaluate our pre-trained word vectors on 10 languages for which evaluation datasets exists, showing very strong performance compared to previous models.

View arXiv page View PDF Add to collection

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 160

Browse 160 models citing this paper

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/1802.06893 in a dataset README.md to link it from this page.

Spaces citing this paper 35

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.