makiart commited on
Commit
c066437
·
verified ·
1 Parent(s): 2ed9035

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +68 -0
README.md CHANGED
@@ -9,6 +9,74 @@ pipeline_tag: fill-mask
9
  ---
10
  # makiart/multilingual-ModernBert-large-preview
11
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
12
  このモデルは[ABCI 生成AIハッカソン](https://abci.ai/event/2024/12/23/ja_abci_3.0_genai_hackathon.html)にて提供された計算資源によって[Algomatic](https://algomatic.jp/)チームが作成したモデルです。
13
 
14
  - コンテキスト長:8192
 
9
  ---
10
  # makiart/multilingual-ModernBert-large-preview
11
 
12
+ # makiart/multilingual-ModernBert-base-preview
13
+
14
+ This model was developed by the [Algomatic](https://algomatic.jp/) team using computational resources provided by the [ABCI Generative AI Hackathon](https://abci.ai/event/2024/12/23/ja_abci_3.0_genai_hackathon.html).
15
+
16
+ - **Context Length:** 8192
17
+ - **Vocabulary Size:** 151,680
18
+ - **Total Training Tokens:** Approximately 250B tokens
19
+ - **Parameter Count:** 228M
20
+ - **Non-embedding Parameter Count:** 110M
21
+ - Utilizes fineweb and fineweb2
22
+
23
+ ## How to Use
24
+
25
+ Install the required package using:
26
+
27
+ ```bash
28
+ pip install -U transformers>=4.48.0
29
+ ```
30
+
31
+ If your GPU supports FlashAttention, you can achieve more efficient inference by installing:
32
+
33
+ ```bash
34
+ pip install flash-attn --no-build-isolation
35
+ ```
36
+
37
+ ## Example Usage
38
+
39
+ ```python
40
+ import torch
41
+ from transformers import AutoModelForMaskedLM, AutoTokenizer, pipeline
42
+
43
+ model = AutoModelForMaskedLM.from_pretrained("makiart/multilingual-ModernBert-large", torch_dtype=torch.bfloat16)
44
+ tokenizer = AutoTokenizer.from_pretrained("makiart/multilingual-ModernBert-large")
45
+ fill_mask = pipeline("fill-mask", model=model, tokenizer=tokenizer)
46
+
47
+ results = fill_mask("We must learn to [MASK] that we can be nothing other than who we are here and now.")
48
+
49
+ for result in results:
50
+ print(result)
51
+
52
+ # {'score': 0.1328125, 'token': 4193, 'token_str': ' accept', 'sequence': 'We must learn to accept that we can be nothing other than who we are here and now.'}
53
+ # {'score': 0.1171875, 'token': 4411, 'token_str': ' believe', 'sequence': 'We must learn to believe that we can be nothing other than who we are here and now.'}
54
+ # {'score': 0.09130859375, 'token': 3535, 'token_str': ' understand', 'sequence': 'We must learn to understand that we can be nothing other than who we are here and now.'}
55
+ # {'score': 0.0712890625, 'token': 15282, 'token_str': ' recognize', 'sequence': 'We must learn to recognize that we can be nothing other than who we are here and now.'}
56
+ # {'score': 0.06298828125, 'token': 6099, 'token_str': ' remember', 'sequence': 'We must learn to remember that we can be nothing other than who we are here and now.'}
57
+ ```
58
+
59
+ ## Model Description
60
+
61
+ - **Training Approach:**The model was trained using a two-stage Masked Language Modeling (MLM) process:
62
+ - **Masking Rate:** 30%
63
+ - **Training Data:** Approximately 200B tokens with a context length of 1024 and 50B tokens with a context length of 8192.
64
+ - **Tokenizer:**Based on Qwen2.5, the tokenizer features:
65
+ - A vocabulary size of 151,680 tokens.
66
+ - Customizations that allow it to distinguish indentations in code, enabling better handling of programming texts.
67
+ - **Dataset:**
68
+ - Utilizes the fineweb and fineweb2 datasets.
69
+ - For languages with an abundance of data, the volume has been reduced.
70
+ - **Computational Resources:**Training was conducted using one node (H200 x 8) provided by ABCI, over the course of approximately 3 days.
71
+
72
+ ## Evaluation
73
+
74
+ A comprehensive evaluation has not been performed yet 😭.
75
+
76
+ Based on the total training token count, it is anticipated that the model might be less competitive compared to existing models.
77
+
78
+ ---
79
+
80
  このモデルは[ABCI 生成AIハッカソン](https://abci.ai/event/2024/12/23/ja_abci_3.0_genai_hackathon.html)にて提供された計算資源によって[Algomatic](https://algomatic.jp/)チームが作成したモデルです。
81
 
82
  - コンテキスト長:8192