Shaltiel commited on
Commit
255556f
1 Parent(s): 1e63896

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +3 -3
README.md CHANGED
@@ -14,15 +14,15 @@ For the bert-base models for other tasks, see [here](https://huggingface.co/coll
14
 
15
  ## General guidelines for how the lemmatizer works:
16
 
17
- Given an input text in Hebrew, it attempts to match up each word with the correct lexeme in its vocabulary.
18
 
19
- - If the token is split up into multiple wordpieces it doesn't cause a problem, we still predict the lexeme with a high accuracy.
20
 
21
  - If the lexeme of a given token doesn't appear in the vocabulary, the model will attempt to predict a special token `[BLANK]`. In that case, the word is usually a name of a person or a city, and the lexeme is probably the word after removing prefixes which can be done with the [dictabert-seg](https://huggingface.co/dicta-il/dictabert-seg) tool.
22
 
23
  - For verbs the lexeme is the 3rd person past singular form.
24
 
25
- This method is purely neural-based, so sometimes the predicted lexeme may not match exactly and can be in a similar semantic space. For more accurate results, one can implement rules on top of the prediction to look at the top K matches and choose using a specific set of rules.
26
 
27
  Sample usage:
28
 
 
14
 
15
  ## General guidelines for how the lemmatizer works:
16
 
17
+ Given an input text in Hebrew, it attempts to match up each word with the correct lexeme from within the BERT vocabulary.
18
 
19
+ - If the word is split up into multiple wordpieces it doesn't cause a problem, we still predict the lexeme with a high accuracy.
20
 
21
  - If the lexeme of a given token doesn't appear in the vocabulary, the model will attempt to predict a special token `[BLANK]`. In that case, the word is usually a name of a person or a city, and the lexeme is probably the word after removing prefixes which can be done with the [dictabert-seg](https://huggingface.co/dicta-il/dictabert-seg) tool.
22
 
23
  - For verbs the lexeme is the 3rd person past singular form.
24
 
25
+ This method is purely neural-based, so in rare instances the predicted lexeme may not be lexically related to the input, but rather a synonym selected from the same semantic space. To handle those edge cases one can implement a filter on top of the prediction to look at the top K matches and choose using a specific set of measures, such as edit distance, to choose the prediction that can more reasonably form a lexeme for the input word.
26
 
27
  Sample usage:
28