Upload folder using huggingface_hub

Browse files

Files changed (8) hide show

README.md +34 -0
config.json +27 -0
merges.txt +0 -0
model.safetensors +3 -0
pytorch_model.bin +3 -0
special_tokens_map.json +51 -0
tokenizer_config.json +57 -0
vocab.json +0 -0

README.md CHANGED Viewed

@@ -1,3 +1,37 @@
 ---
 license: apache-2.0
 ---

 ---
 license: apache-2.0
+datasets:
+- kz-transformers/multidomain-kazakh-dataset
+language:
+- kk
+library_name: transformers
+pipeline_tag: fill-mask
 ---
+# RoBERTa-kaz-large
+## Model Description
+`roberta-kaz-large` is a RoBERTa-based language model for the Kazakh language, trained from scratch using the RobertaForMaskedLM architecture. It has been trained on the "kz-transformers/multidomain-kazakh-dataset" from Hugging Face, which covers diverse domains to ensure broad generalization capabilities.
+## Usage
+The model can be used with the Hugging Face Transformers library:
+```python
+from transformers import RobertaTokenizerFast, RobertaForMaskedLM
+tokenizer = RobertaTokenizerFast.from_pretrained('roberta-kaz-large')
+model = RobertaForMaskedLM.from_pretrained('roberta-kaz-large')
+```
+Or directly with a pipeline for MLM:
+```python
+from transformers import pipeline
+pipe = pipeline('fill-mask', model='kz-transformers/kaz-roberta-conversational')
+predicted = pipe("Қазіргі <mask> әлемдік деңгейдегі <mask> университеттері сапалы білім, зияткерлік және мәдени <mask> беретін <mask> <mask> <mask> ғана емес, сонымен қатар мемлекет үшін <mask> қабілетті адами капиталды құратын <mask>, ғылым және өндірісті интеграциялаудың <mask> <mask> болып табылады.")
+for t in predicted:
+print(t[0]['score'], t[0]['token_str'])
+```
+## Training procedure
+The model was trained using two NVIDIA A100 GPUs on over 5.3 million examples from the "kz-transformers/multidomain-kazakh-dataset." We conducted training across 10 epochs, handling large batches of data efficiently through gradient accumulation. The learning setup included a slow build-up in the learning rate to maximize learning stability and was optimized over 208,100 steps, focusing on improving the model’s ability to understand and generate the Kazakh language.
+## Limitations and Bias
+As with any language model, roberta-kaz-large may inherently learn biases present in the training data. Users should be cautious and evaluate the model in diverse contexts to ensure it performs as expected, especially in sensitive applications.

config.json ADDED Viewed

	@@ -0,0 +1,27 @@

+{
+  "_name_or_path": "model",
+  "architectures": [
+    "RobertaForMaskedLM"
+  ],
+  "attention_probs_dropout_prob": 0.1,
+  "bos_token_id": 0,
+  "classifier_dropout": null,
+  "eos_token_id": 2,
+  "hidden_act": "gelu",
+  "hidden_dropout_prob": 0.1,
+  "hidden_size": 1024,
+  "initializer_range": 0.02,
+  "intermediate_size": 4096,
+  "layer_norm_eps": 1e-05,
+  "max_position_embeddings": 514,
+  "model_type": "roberta",
+  "num_attention_heads": 16,
+  "num_hidden_layers": 24,
+  "pad_token_id": 1,
+  "position_embedding_type": "absolute",
+  "torch_dtype": "float32",
+  "transformers_version": "4.43.1",
+  "type_vocab_size": 1,
+  "use_cache": true,
+  "vocab_size": 50265
+}

merges.txt ADDED Viewed

The diff for this file is too large to render. See raw diff

model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:1525b4630ae946c8ba6a5ad79351b710f9c710a0b54c2f7c340e51e520e6514d
+size 1421696540

pytorch_model.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:92008103ec901c94b87a4039a5f3a85927bd1fa9f5a78023e4f5057a01910489
+size 1421779250

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,51 @@

+{
+  "bos_token": {
+    "content": "<s>",
+    "lstrip": false,
+    "normalized": true,
+    "rstrip": false,
+    "single_word": false
+  },
+  "cls_token": {
+    "content": "<s>",
+    "lstrip": false,
+    "normalized": true,
+    "rstrip": false,
+    "single_word": false
+  },
+  "eos_token": {
+    "content": "</s>",
+    "lstrip": false,
+    "normalized": true,
+    "rstrip": false,
+    "single_word": false
+  },
+  "mask_token": {
+    "content": "<mask>",
+    "lstrip": true,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "pad_token": {
+    "content": "<pad>",
+    "lstrip": false,
+    "normalized": true,
+    "rstrip": false,
+    "single_word": false
+  },
+  "sep_token": {
+    "content": "</s>",
+    "lstrip": false,
+    "normalized": true,
+    "rstrip": false,
+    "single_word": false
+  },
+  "unk_token": {
+    "content": "<unk>",
+    "lstrip": false,
+    "normalized": true,
+    "rstrip": false,
+    "single_word": false
+  }
+}

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,57 @@

+{
+  "add_prefix_space": false,
+  "added_tokens_decoder": {
+    "0": {
+      "content": "<s>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "1": {
+      "content": "<pad>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "2": {
+      "content": "</s>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "3": {
+      "content": "<unk>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "4": {
+      "content": "<mask>",
+      "lstrip": true,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    }
+  },
+  "bos_token": "<s>",
+  "clean_up_tokenization_spaces": true,
+  "cls_token": "<s>",
+  "eos_token": "</s>",
+  "errors": "replace",
+  "mask_token": "<mask>",
+  "model_max_length": 512,
+  "pad_token": "<pad>",
+  "sep_token": "</s>",
+  "tokenizer_class": "RobertaTokenizer",
+  "trim_offsets": true,
+  "unk_token": "<unk>"
+}

vocab.json ADDED Viewed

The diff for this file is too large to render. See raw diff