dicta-il
/

dictabert-lex

Feature Extraction

text-embeddings-inference

Model card Files Files and versions Community

Shaltiel commited on Jan 8

Commit

255556f

•

1 Parent(s): 1e63896

Update README.md

Files changed (1) hide show

README.md +3 -3

README.md CHANGED Viewed

@@ -14,15 +14,15 @@ For the bert-base models for other tasks, see [here](https://huggingface.co/coll
 ## General guidelines for how the lemmatizer works:
-Given an input text in Hebrew, it attempts to match up each word with the correct lexeme in its vocabulary.
-- If the token is split up into multiple wordpieces it doesn't cause a problem, we still predict the lexeme with a high accuracy.
 - If the lexeme of a given token doesn't appear in the vocabulary, the model will attempt to predict a special token `[BLANK]`. In that case, the word is usually a name of a person or a city, and the lexeme is probably the word after removing prefixes which can be done with the [dictabert-seg](https://huggingface.co/dicta-il/dictabert-seg) tool.
 - For verbs the lexeme is the 3rd person past singular form.
-This method is purely neural-based, so sometimes the predicted lexeme may not match exactly and can be in a similar semantic space. For more accurate results, one can implement rules on top of the prediction to look at the top K matches and choose using a specific set of rules.
 Sample usage:

 ## General guidelines for how the lemmatizer works:
+Given an input text in Hebrew, it attempts to match up each word with the correct lexeme from within the BERT vocabulary.
+- If the word is split up into multiple wordpieces it doesn't cause a problem, we still predict the lexeme with a high accuracy.
 - If the lexeme of a given token doesn't appear in the vocabulary, the model will attempt to predict a special token `[BLANK]`. In that case, the word is usually a name of a person or a city, and the lexeme is probably the word after removing prefixes which can be done with the [dictabert-seg](https://huggingface.co/dicta-il/dictabert-seg) tool.
 - For verbs the lexeme is the 3rd person past singular form.
+This method is purely neural-based, so in rare instances the predicted lexeme may not be lexically related to the input, but rather a synonym selected from the same semantic space. To handle those edge cases one can implement a filter on top of the prediction to look at the top K matches and choose using a specific set of measures, such as edit distance, to choose the prediction that can more reasonably form a lexeme for the input word.
 Sample usage: