--- license: cc-by-4.0 language: - he inference: false --- # DictaBERT: A State-of-the-Art BERT Suite for Modern Hebrew State-of-the-art language model for Hebrew, released [here](https://arxiv.org/abs/2403.06970). This is the fine-tuned model for the lemmatization task. For the bert-base models for other tasks, see [here](https://huggingface.co./collections/dicta-il/dictabert-6588e7cc08f83845fc42a18b). ## General guidelines for how the lemmatizer works: Given an input text in Hebrew, it attempts to match up each word with the correct lexeme from within the BERT vocabulary. - If the word is split up into multiple wordpieces it doesn't cause a problem, we still predict the lexeme with a high accuracy. - If the lexeme of a given token doesn't appear in the vocabulary, the model will attempt to predict a special token `[BLANK]`. In that case, the word is usually a name of a person or a city, and the lexeme is probably the word after removing prefixes which can be done with the [dictabert-seg](https://huggingface.co./dicta-il/dictabert-seg) tool. - For verbs the lexeme is the 3rd person past singular form. This method is purely neural-based, so in rare instances the predicted lexeme may not be lexically related to the input, but rather a synonym selected from the same semantic space. To handle those edge cases one can implement a filter on top of the prediction to look at the top K matches and choose using a specific set of measures, such as edit distance, to choose the prediction that can more reasonably form a lexeme for the input word. Sample usage: ```python from transformers import AutoModel, AutoTokenizer tokenizer = AutoTokenizer.from_pretrained('dicta-il/dictabert-lex') model = AutoModel.from_pretrained('dicta-il/dictabert-lex', trust_remote_code=True) model.eval() sentence = 'בשנת 1948 השלים אפרים קישון את לימודיו בפיסול מתכת ובתולדות האמנות והחל לפרסם מאמרים הומוריסטיים' print(model.predict([sentence], tokenizer)) ``` Output: ```json [ [ [ "בשנת", "שנה" ], [ "1948", "1948" ], [ "השלים", "השלים" ], [ "אפרים", "אפרים" ], [ "קישון", "קישון" ], [ "את", "את" ], [ "לימודיו", "לימוד" ], [ "בפיסול", "פיסול" ], [ "מתכת", "מתכת" ], [ "ובתולדות", "תולדה" ], [ "האמנות", "אומנות" ], [ "והחל", "החל" ], [ "לפרסם", "פרסם" ], [ "מאמרים", "מאמר" ], [ "הומוריסטיים", "הומוריסטי" ] ] ] ``` ## Citation If you use DictaBERT-lex in your research, please cite ```MRL Parsing without Tears: The Case of Hebrew``` **BibTeX:** ```bibtex @misc{shmidman2024mrl, title={MRL Parsing Without Tears: The Case of Hebrew}, author={Shaltiel Shmidman and Avi Shmidman and Moshe Koppel and Reut Tsarfaty}, year={2024}, eprint={2403.06970}, archivePrefix={arXiv}, primaryClass={cs.CL} } ``` ## License Shield: [![CC BY 4.0][cc-by-shield]][cc-by] This work is licensed under a [Creative Commons Attribution 4.0 International License][cc-by]. [![CC BY 4.0][cc-by-image]][cc-by] [cc-by]: http://creativecommons.org/licenses/by/4.0/ [cc-by-image]: https://i.creativecommons.org/l/by/4.0/88x31.png [cc-by-shield]: https://img.shields.io/badge/License-CC%20BY%204.0-lightgrey.svg