sagorsarker commited on
Commit
91f838a
·
verified ·
1 Parent(s): 62fba91

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +100 -41
README.md CHANGED
@@ -1,61 +1,120 @@
1
  ---
2
- library_name: transformers
3
- license: gemma
4
- base_model: google/gemma-2-2b
5
  tags:
6
- - llama-factory
7
- - full
8
- - generated_from_trainer
9
- model-index:
10
- - name: gemma-2-2B-4096-sample-2-22GB
11
- results: []
 
 
12
  ---
13
 
14
- <!-- This model card has been generated automatically according to the information the Trainer had access to. You
15
- should probably proofread and complete it, then remove this comment. -->
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
16
 
17
- # gemma-2-2B-4096-sample-2-22GB
 
 
 
 
 
 
 
 
18
 
19
- This model is a fine-tuned version of [google/gemma-2-2b](https://huggingface.co/google/gemma-2-2b) on the sample_2 dataset.
20
 
21
- ## Model description
22
 
23
- More information needed
24
 
25
- ## Intended uses & limitations
 
26
 
27
- More information needed
 
 
 
 
 
 
28
 
29
- ## Training and evaluation data
 
 
 
 
 
30
 
31
- More information needed
32
 
33
- ## Training procedure
 
 
 
 
34
 
35
- ### Training hyperparameters
 
 
 
 
 
36
 
37
- The following hyperparameters were used during training:
38
- - learning_rate: 4e-05
39
- - train_batch_size: 3
40
- - eval_batch_size: 8
41
- - seed: 42
42
- - distributed_type: multi-GPU
43
- - num_devices: 8
44
- - gradient_accumulation_steps: 8
45
- - total_train_batch_size: 192
46
- - total_eval_batch_size: 64
47
- - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
48
- - lr_scheduler_type: cosine
49
- - lr_scheduler_warmup_ratio: 0.01
50
- - num_epochs: 1.0
51
 
52
- ### Training results
53
 
 
54
 
55
 
56
- ### Framework versions
 
 
 
57
 
58
- - Transformers 4.44.2
59
- - Pytorch 2.4.1+cu121
60
- - Datasets 2.21.0
61
- - Tokenizers 0.19.1
 
1
  ---
2
+ language:
3
+ - bn
 
4
  tags:
5
+ - hishab
6
+ - titulm
7
+ - pytorch
8
+ - gemma
9
+ - gemma-2
10
+ license: gemma
11
+ library_name: transformers
12
+ pipeline_tag: text-generation
13
  ---
14
 
15
+ ## Model Information
16
+
17
+ This model is a continually pretrained version of the [google/gemma-2-2b](https://huggingface.co/google/gemma-2-2b) architecture, fine-tuned on extensive Bangla datasets. The primary goal of the continual pretraining was to enhance the model's ability to generate high-quality Bangla text. By extending the pretraining process specifically on Bangla data, the model has demonstrated superior performance in tasks related to Bangla language understanding evaluation benchmarks and text generation.
18
+
19
+ **Model Architecture:** Gemma 2 is an auto-regressive language model that uses an optimized transformer architecture.
20
+
21
+ | | Training Data | Params | Input modalities | Output modalities | Context Length | Token count |
22
+ | :---- | :---- | :---- | :---- | :---- | :---- | :---- |
23
+ | Gemma 2 | Hishab curated Bangla text corpus | 2B | Monolingual Text(Bangla) | Monolingual Text(Bangla) | 4096 | 3B tokens | |
24
+
25
+ ### How To Use
26
+
27
+ Below we share some code snippets on how to get quickly started with running the model. First, install the Transformers library with:
28
+ ```sh
29
+ pip install -U transformers
30
+ ```
31
+
32
+ Then, copy the snippet from the section that is relevant for your usecase.
33
+
34
+ #### Running with the `pipeline` API
35
+
36
+ ```python
37
+ import torch
38
+ from transformers import pipeline
39
+
40
+ pipe = pipeline(
41
+ "text-generation",
42
+ model="titulm-gemma-2-2b-v1.0",
43
+ device="cuda",
44
+ )
45
+
46
+ text = "আমাদের দেশের নাম"
47
+ outputs = pipe(text, max_new_tokens=2048)
48
+ response = outputs[0]["generated_text"]
49
+ print(response)
50
+ ```
51
+
52
+
53
+ ## Hardware and Software
54
+
55
+ **Training Factors:** We used [llama-factory]() training library, Cloud GPU cluster, and production infrastructure for pretraining. Fine-tuning, annotation, and evaluation were also performed on cloud infrastructure.
56
+
57
+
58
+ ## Training Data
59
+
60
+ **Overview:** We have collected a large Bangla raw dataset of text data from a wide variety of sources. Our collected data so far includes a mix of web documents, books, translated text, transliterated text, transcribe text, code-mixed text, conversations, and open sources raw data. The dataset is cleaned and filtered by different filtering criteria to ensure the quality of the data. Our collected data size roughly around 268 GB. We separated __22GB__ data from that using a ratio of the data actual data size. Total trained tokens are __3B__ tokens.
61
 
62
+ Data sources summary:
63
+ - Web documents: Extract, clean, filter common crawl data
64
+ - Books: Extract, clean, filter books data
65
+ - Transcribed text: Used in-house Bangla ASR model to transcribe Bangla audio data
66
+ - Translation data: We trained a Bangla-English translation LLM model and used it to translate English data to Bangla
67
+ - Code-mixed data: We trained a Bangla-English code-mixed LLM model and used it to generate code-mixed data
68
+ - Transliteration data: We trained a Bangla-English transliteration LLM model and used it to generate transliterated data
69
+ - Synthetic data: We generated synthetic data using a Bangla LLM model
70
+ - Others: We scrap some selected websites data, used open-sources data, and used some other data sources
71
 
 
72
 
73
+ ## Benchmarks \- Bangla Text
74
 
75
+ In this section, we report the results for __titulm-gemma-2-2b-v1.0__ models on standard automatic benchmarks. For all these evaluations, we used [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) evaluations library.
76
 
77
+ ### Evaluation Datasets
78
+ We evaluated our pretrained models on both Bangla and English benchmark datasets. Although the model is trained on Bangla data, it's English capability is also evaluated on English benchmark datasets. The evaluation datasets are as follows:
79
 
80
+ #### Bangla Benchmark datasets
81
+ We evaluated the models on the following datasets:
82
+ - [Bangla MMLU](): A privated multiple choice questions datasets developed by Hishab curated from various sources.
83
+ - [CommonsenseQa Bangla](https://huggingface.co/datasets/hishab/commonsenseqa-bn): A Bangla translation of the CommonsenseQA dataset. The dataset was translated using a new method called Expressive Semantic Translation (EST), which combines Google Machine Translation with LLM-based rewriting modifications.
84
+ - [OpenbookQA Bangla](https://huggingface.co/datasets/hishab/openbookqa-bn): A Bangla translation of the OpenbookQA dataset. The dataset was translated using a new method called Expressive Semantic Translation (EST), which combines Google Machine Translation with LLM-based rewriting modifications.
85
+ - [Piqa Bangla](https://huggingface.co/datasets/hishab/piqa-bn): A Bangla translation of the Piqa dataset. The dataset was translated using a new method called Expressive Semantic Translation (EST), which combines Google Machine Translation with LLM-based rewriting modifications.
86
+ - [BoolQ Bangla](https://huggingface.co/datasets/hishab/boolq_bn): The dataset contains 15,942 examples, with each entry consisting of a triplet: (question, passage, answer). The questions are naturally occurring, generated from unprompted and unconstrained settings. Input passages were sourced from Bangla Wikipedia, Banglapedia, and News Articles, and GPT-4 was used to generate corresponding yes/no questions with answers.
87
 
88
+ #### English Benchmark datasets
89
+ - [MMLU](https://huggingface.co/datasets/cais/mmlu): This is a massive multitask test consisting of multiple-choice questions from various branches of knowledge.
90
+ - [CommonseQa](https://huggingface.co/datasets/tau/commonsense_qa): CommonsenseQA is a new multiple-choice question answering dataset that requires different types of commonsense knowledge to predict the correct answers .
91
+ - [OpenbookQA](https://huggingface.co/datasets/allenai/openbookqa): OpenBookQA aims to promote research in advanced question-answering, probing a deeper understanding of both the topic (with salient facts summarized as an open book, also provided with the dataset) and the language it is expressed in.
92
+ - [Piqa](https://huggingface.co/datasets/ybisk/piqa): The PIQA dataset focuses on physical commonsense reasoning, challenging AI to handle everyday situations requiring practical knowledge and unconventional solutions. Inspired by instructables.com, it aims to enhance AI's ability to understand and reason about physical interactions.
93
+ - [BoolQ](https://huggingface.co/datasets/google/boolq): BoolQ is a question answering dataset for yes/no questions containing 15942 examples. These questions are naturally occurring ---they are generated in unprompted and unconstrained settings. Each example is a triplet of (question, passage, answer), with the title of the page as optional additional context. The text-pair classification setup is similar to existing natural language inference tasks.
94
 
95
+ ### Evaluation Results
96
 
97
+ #### Evaluation on Bangla Benchmark datasets
98
+ - **gemma-2-2b** performs better in **Bangla MMLU** and **BoolQ BN** in the 0-shot setting.
99
+ - **titulm-gemma-2-2b-v1.0** outperforms in **Commonsense QA BN**, **OpenBook QA BN**, and **PIQA BN** across both 0-shot and 5-shot settings.
100
+ - In the 5-shot setting, **titulm-gemma-2-2b-v1.0** achieves the highest scores in **BoolQ BN**, **Commonsense QA BN**, and **OpenBook QA BN**.
101
+ - **PIQA BN** shows consistent performance across both models, with **titulm-gemma-2-2b-v1.0** leading in both settings.
102
 
103
+ | Model | Shots | Bangla MMLU | BoolQ BN | Commonsense QA BN | OpenBook QA BN | PIQA BN |
104
+ |--------------------------|---------|-------------|----------|-------------------|----------------|---------|
105
+ | gemma-2-2b | 0-shot | **0.32** | **0.63** | 0.26 | 0.34 | 0.56 |
106
+ | | 5-shot | **0.35** | 0.46 | 0.28 | 0.33 | 0.56 |
107
+ | titulm-gemma-2-2b-v1.0 | 0-shot | 0.31 | 0.59 | **0.31** | **0.36** | **0.63**|
108
+ | | 5-shot | 0.35 | **0.59** | **0.41** | **0.37** | **0.62**|
109
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
110
 
111
+ #### Evaluation on English Benchmark datasets
112
 
113
+ ### Instruction Tuned Models
114
 
115
 
116
+ ### Intended Use
117
+ - Bangla text generation
118
+ - Bangla language understanding tasks
119
+ - Bangla instruction fine-tuning tasks
120