Commit
•
d0d8d15
1
Parent(s):
fb27ada
Update README.md
Browse files
README.md
CHANGED
@@ -24,6 +24,7 @@ language:
|
|
24 |
- ta
|
25 |
- te
|
26 |
- yo
|
|
|
27 |
tags:
|
28 |
- kenlm
|
29 |
- perplexity
|
@@ -37,11 +38,14 @@ datasets:
|
|
37 |
duplicated_from: edugp/kenlm
|
38 |
---
|
39 |
|
|
|
|
|
|
|
|
|
40 |
# KenLM models
|
41 |
This repo contains several KenLM models trained on different tokenized datasets and languages.
|
42 |
KenLM models are probabilistic n-gram languge models that models. One use case of these models consist on fast perplexity estimation for [filtering or sampling large datasets](https://huggingface.co/bertin-project/bertin-roberta-base-spanish). For example, one could use a KenLM model trained on French Wikipedia to run inference on a large dataset and filter out samples that are very unlike to appear on Wikipedia (high perplexity), or very simple non-informative sentences that could appear repeatedly (low perplexity).
|
43 |
|
44 |
-
At the root of this repo you will find different directories named after the dataset models were trained on (e.g. `wikipedia`, `oscar`). Within each directory, you will find several models trained on different language subsets of the dataset (e.g. `en (English)`, `es (Spanish)`, `fr (French)`). For each language you will find three different files
|
45 |
* `{language}.arpa.bin`: The trained KenLM model binary
|
46 |
* `{language}.sp.model`: The trained SentencePiece model used for tokenization
|
47 |
* `{language}.sp.vocab`: The vocabulary file for the SentencePiece model
|
|
|
24 |
- ta
|
25 |
- te
|
26 |
- yo
|
27 |
+
- de
|
28 |
tags:
|
29 |
- kenlm
|
30 |
- perplexity
|
|
|
38 |
duplicated_from: edugp/kenlm
|
39 |
---
|
40 |
|
41 |
+
# Fork of `edugp/kenlm`
|
42 |
+
|
43 |
+
* adds German wikipedia model.
|
44 |
+
|
45 |
# KenLM models
|
46 |
This repo contains several KenLM models trained on different tokenized datasets and languages.
|
47 |
KenLM models are probabilistic n-gram languge models that models. One use case of these models consist on fast perplexity estimation for [filtering or sampling large datasets](https://huggingface.co/bertin-project/bertin-roberta-base-spanish). For example, one could use a KenLM model trained on French Wikipedia to run inference on a large dataset and filter out samples that are very unlike to appear on Wikipedia (high perplexity), or very simple non-informative sentences that could appear repeatedly (low perplexity).
|
48 |
|
|
|
49 |
* `{language}.arpa.bin`: The trained KenLM model binary
|
50 |
* `{language}.sp.model`: The trained SentencePiece model used for tokenization
|
51 |
* `{language}.sp.vocab`: The vocabulary file for the SentencePiece model
|