--- datasets: - oscar-corpus/OSCAR-2301 language: - az library_name: transformers --- Roberta base model trained on Azerbaijani subset of OSCAR corpus as a part of [research](https://peerj.com/articles/cs-1974/) on application of text augentation for low-resource languages. It was developed to enhance text classification tasks in Azerbaijani, a low-resource language in the NLP domain. The model was trained using the Azerbaijani subset of the OSCAR corpus and further fine-tuned on a labeled news dataset. ## Training Data The model was pre-trained on the Azerbaijani subset of the OSCAR corpus, and fine-tuned on approximately 3 million sentences from Azertag News Agency covering diverse topics such as politics, economy, culture, sports, technology, and health. ## Citation ```bibtex @article{ziyaden2024augmentation, title = {Text data augmentation and pre-trained Language Model for enhancing text classification of low-resource languages}, author = {Ziyaden, Atabay and Yelenov, Amir and Hajiyev, Fuad and Rustamov, Samir and Pak, Alexandr}, year = 2024, journal = {PeerJ Computer Science}, doi = {10.7717/peerj-cs.1974}, url = {https://doi.org/10.7717/peerj-cs.1974} } ``` ## Usage ```python from transformers import AutoTokenizer, AutoModelWithLMHead tokenizer = AutoTokenizer.from_pretrained("iamdenay/roberta-azerbaijani") model = AutoModelWithLMHead.from_pretrained("iamdenay/roberta-azerbaijani") ``` ```python from transformers import pipeline model_mask = pipeline('fill-mask', model='iamdenay/roberta-azerbaijani') model_mask("Le tweet .") ``` ## Output ```python [{'sequence': 'azərtac xəbər verir ki', 'score': 0.9791, 'token': 1053, 'token_str': 'verir'}, {'sequence': 'azərtac xəbər verib ki', 'score': 0.0044, 'token': 2313, 'token_str': 'verib'}, ... ] ``` ## Limitations - Language Specificity: The model is trained exclusively on Azerbaijani and may not generalize well to other languages. - Data Bias: The fine-tuning data is sourced from news articles, which may contain biases or specific journalistic styles. - Agglutinative Language Challenges: Azerbaijani's agglutinative nature can lead to sparsity in the word space due to numerous morphological variations. ## Ethical Considerations - Content Sensitivity: The dataset may include sensitive topics. Users should ensure compliance with ethical standards when deploying the model. - Bias and Fairness: Be aware of potential biases in the training data that could affect model predictions. ## Config ```json attention_probs_dropout_prob:0.1 bos_token_id:0 classifier_dropout:null eos_token_id:2 gradient_checkpointing:false hidden_act:"gelu" hidden_dropout_prob:0.1 hidden_size:768 initializer_range:0.02 intermediate_size:3072 layer_norm_eps:1e-12 max_position_embeddings:514 model_type:"roberta" num_attention_heads:12 num_hidden_layers:6 pad_token_id:1 position_embedding_type:"absolute" torch_dtype:"float32" transformers_version:"4.10.0" type_vocab_size:1 use_cache:true vocab_size:52000 ```