Update README.md
Browse files
README.md
CHANGED
@@ -14,15 +14,15 @@ For the bert-base models for other tasks, see [here](https://huggingface.co/coll
|
|
14 |
|
15 |
## General guidelines for how the lemmatizer works:
|
16 |
|
17 |
-
Given an input text in Hebrew, it attempts to match up each word with the correct lexeme
|
18 |
|
19 |
-
- If the
|
20 |
|
21 |
- If the lexeme of a given token doesn't appear in the vocabulary, the model will attempt to predict a special token `[BLANK]`. In that case, the word is usually a name of a person or a city, and the lexeme is probably the word after removing prefixes which can be done with the [dictabert-seg](https://huggingface.co/dicta-il/dictabert-seg) tool.
|
22 |
|
23 |
- For verbs the lexeme is the 3rd person past singular form.
|
24 |
|
25 |
-
This method is purely neural-based, so
|
26 |
|
27 |
Sample usage:
|
28 |
|
|
|
14 |
|
15 |
## General guidelines for how the lemmatizer works:
|
16 |
|
17 |
+
Given an input text in Hebrew, it attempts to match up each word with the correct lexeme from within the BERT vocabulary.
|
18 |
|
19 |
+
- If the word is split up into multiple wordpieces it doesn't cause a problem, we still predict the lexeme with a high accuracy.
|
20 |
|
21 |
- If the lexeme of a given token doesn't appear in the vocabulary, the model will attempt to predict a special token `[BLANK]`. In that case, the word is usually a name of a person or a city, and the lexeme is probably the word after removing prefixes which can be done with the [dictabert-seg](https://huggingface.co/dicta-il/dictabert-seg) tool.
|
22 |
|
23 |
- For verbs the lexeme is the 3rd person past singular form.
|
24 |
|
25 |
+
This method is purely neural-based, so in rare instances the predicted lexeme may not be lexically related to the input, but rather a synonym selected from the same semantic space. To handle those edge cases one can implement a filter on top of the prediction to look at the top K matches and choose using a specific set of measures, such as edit distance, to choose the prediction that can more reasonably form a lexeme for the input word.
|
26 |
|
27 |
Sample usage:
|
28 |
|