zicsx commited on
Commit
7474839
1 Parent(s): af57bb0

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +369 -102
README.md CHANGED
@@ -1,6 +1,108 @@
1
  ---
2
- library_name: transformers
3
- tags: []
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4
  widget:
5
  - example_title: वर्तमान प्रधानमंत्री
6
  messages:
@@ -14,197 +116,362 @@ widget:
14
  होली का महत्व क्या है?
15
  ---
16
 
17
- # Model Card for Model ID
18
 
19
  <!-- Provide a quick summary of what the model is/does. -->
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
20
 
 
 
21
 
 
 
 
 
22
 
23
- ## Model Details
24
 
25
- ### Model Description
 
 
26
 
27
- <!-- Provide a longer summary of what this model is. -->
28
 
29
- This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.
 
30
 
31
- - **Developed by:** [More Information Needed]
32
- - **Funded by [optional]:** [More Information Needed]
33
- - **Shared by [optional]:** [More Information Needed]
34
- - **Model type:** [More Information Needed]
35
- - **Language(s) (NLP):** [More Information Needed]
36
- - **License:** [More Information Needed]
37
- - **Finetuned from model [optional]:** [More Information Needed]
38
 
39
- ### Model Sources [optional]
 
 
 
40
 
41
- <!-- Provide the basic links for the model. -->
42
 
43
- - **Repository:** [More Information Needed]
44
- - **Paper [optional]:** [More Information Needed]
45
- - **Demo [optional]:** [More Information Needed]
46
 
47
- ## Uses
48
 
49
- <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
50
 
51
- ### Direct Use
 
 
 
 
52
 
53
- <!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
54
 
55
- [More Information Needed]
56
 
57
- ### Downstream Use [optional]
58
 
59
- <!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
60
 
61
- [More Information Needed]
62
 
63
- ### Out-of-Scope Use
64
 
65
- <!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
 
66
 
67
- [More Information Needed]
68
 
69
- ## Bias, Risks, and Limitations
 
70
 
71
- <!-- This section is meant to convey both technical and sociotechnical limitations. -->
72
 
73
- [More Information Needed]
74
 
75
- ### Recommendations
 
 
76
 
77
- <!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
78
 
79
- Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
80
 
81
- ## How to Get Started with the Model
82
 
83
- Use the code below to get started with the model.
84
 
85
- [More Information Needed]
86
 
87
- ## Training Details
 
88
 
89
- ### Training Data
90
 
91
- <!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
 
 
 
92
 
93
- [More Information Needed]
 
94
 
95
- ### Training Procedure
96
 
97
- <!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
98
 
99
- #### Preprocessing [optional]
 
 
100
 
101
- [More Information Needed]
102
 
 
103
 
104
- #### Training Hyperparameters
 
 
105
 
106
- - **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
107
 
108
- #### Speeds, Sizes, Times [optional]
109
 
110
- <!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
111
 
112
- [More Information Needed]
113
 
114
- ## Evaluation
115
 
116
- <!-- This section describes the evaluation protocols and provides the results. -->
 
 
 
 
 
 
 
117
 
118
- ### Testing Data, Factors & Metrics
119
 
120
- #### Testing Data
121
 
122
- <!-- This should link to a Dataset Card if possible. -->
123
 
124
- [More Information Needed]
 
 
 
 
 
 
125
 
126
- #### Factors
127
 
128
- <!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
129
 
130
- [More Information Needed]
131
 
132
- #### Metrics
133
 
134
- <!-- These are the evaluation metrics being used, ideally with a description of why. -->
135
 
136
- [More Information Needed]
137
 
138
- ### Results
139
 
140
- [More Information Needed]
141
 
142
- #### Summary
 
 
 
 
143
 
 
144
 
 
145
 
146
- ## Model Examination [optional]
147
 
148
- <!-- Relevant interpretability work for the model goes here -->
149
 
150
- [More Information Needed]
 
151
 
152
- ## Environmental Impact
153
 
154
- <!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
 
 
155
 
156
- Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
157
 
158
- - **Hardware Type:** [More Information Needed]
159
- - **Hours used:** [More Information Needed]
160
- - **Cloud Provider:** [More Information Needed]
161
- - **Compute Region:** [More Information Needed]
162
- - **Carbon Emitted:** [More Information Needed]
163
 
164
- ## Technical Specifications [optional]
 
 
165
 
166
- ### Model Architecture and Objective
167
 
168
- [More Information Needed]
169
 
170
- ### Compute Infrastructure
171
 
172
- [More Information Needed]
 
 
 
 
 
 
 
 
173
 
174
- #### Hardware
 
175
 
176
- [More Information Needed]
177
 
178
- #### Software
179
 
180
- [More Information Needed]
181
 
182
- ## Citation [optional]
 
 
 
 
183
 
184
- <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
185
 
186
- **BibTeX:**
187
 
188
- [More Information Needed]
 
 
189
 
190
- **APA:**
191
 
192
- [More Information Needed]
193
 
194
- ## Glossary [optional]
195
 
196
- <!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
197
 
198
- [More Information Needed]
199
 
200
- ## More Information [optional]
201
 
202
- [More Information Needed]
203
 
204
- ## Model Card Authors [optional]
205
 
206
- [More Information Needed]
207
 
208
- ## Model Card Contact
 
 
209
 
210
- [More Information Needed]
 
1
  ---
2
+ license: osl-3.0
3
+ model-index:
4
+ - name: indus_1.175B
5
+ results:
6
+ - task:
7
+ type: text-generation
8
+ name: Text Generation
9
+ dataset:
10
+ name: AI2 Reasoning Challenge (25-Shot)
11
+ type: ai2_arc
12
+ config: ARC-Challenge
13
+ split: test
14
+ args:
15
+ num_few_shot: 25
16
+ metrics:
17
+ - type: acc_norm
18
+ value: 22.7
19
+ name: normalized accuracy
20
+ source:
21
+ url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=nickmalhotra/ProjectIndus
22
+ name: Open LLM Leaderboard
23
+ - task:
24
+ type: text-generation
25
+ name: Text Generation
26
+ dataset:
27
+ name: HellaSwag (10-Shot)
28
+ type: hellaswag
29
+ split: validation
30
+ args:
31
+ num_few_shot: 10
32
+ metrics:
33
+ - type: acc_norm
34
+ value: 25.04
35
+ name: normalized accuracy
36
+ source:
37
+ url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=nickmalhotra/indus_1.175B
38
+ name: Open LLM Leaderboard
39
+ - task:
40
+ type: text-generation
41
+ name: Text Generation
42
+ dataset:
43
+ name: MMLU (5-Shot)
44
+ type: cais/mmlu
45
+ config: all
46
+ split: test
47
+ args:
48
+ num_few_shot: 5
49
+ metrics:
50
+ - type: acc
51
+ value: 23.12
52
+ name: accuracy
53
+ source:
54
+ url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=nickmalhotra/indus_1.175B
55
+ name: Open LLM Leaderboard
56
+ - task:
57
+ type: text-generation
58
+ name: Text Generation
59
+ dataset:
60
+ name: TruthfulQA (0-shot)
61
+ type: truthful_qa
62
+ config: multiple_choice
63
+ split: validation
64
+ args:
65
+ num_few_shot: 0
66
+ metrics:
67
+ - type: mc2
68
+ value: 0.0
69
+ source:
70
+ url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=nickmalhotra/indus_1.175B
71
+ name: Open LLM Leaderboard
72
+ - task:
73
+ type: text-generation
74
+ name: Text Generation
75
+ dataset:
76
+ name: Winogrande (5-shot)
77
+ type: winogrande
78
+ config: winogrande_xl
79
+ split: validation
80
+ args:
81
+ num_few_shot: 5
82
+ metrics:
83
+ - type: acc
84
+ value: 49.57
85
+ name: accuracy
86
+ source:
87
+ url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=nickmalhotra/indus_1.175B
88
+ name: Open LLM Leaderboard
89
+ - task:
90
+ type: text-generation
91
+ name: Text Generation
92
+ dataset:
93
+ name: GSM8k (5-shot)
94
+ type: gsm8k
95
+ config: main
96
+ split: test
97
+ args:
98
+ num_few_shot: 5
99
+ metrics:
100
+ - type: acc
101
+ value: 0.0
102
+ name: accuracy
103
+ source:
104
+ url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=nickmalhotra/indus_1.175B
105
+ name: Open LLM Leaderboard
106
  widget:
107
  - example_title: वर्तमान प्रधानमंत्री
108
  messages:
 
116
  होली का महत्व क्या है?
117
  ---
118
 
119
+ # Project Indus
120
 
121
  <!-- Provide a quick summary of what the model is/does. -->
122
+ Project Indus LLM is a groundbreaking open-source language model tailored for Hindi and its dialects, designed to enhance natural language processing and generation across diverse Indian linguistic applications.
123
+
124
+ # Table of Contents
125
+
126
+ - [Table of Contents](#table-of-contents)
127
+ - [Model Details](#model-details)
128
+ - [Model Description](#model-description)
129
+ - [Uses](#uses)
130
+ - [Direct Use](#direct-use)
131
+ - [Downstream Use](#downstream-use)
132
+ - [Out-of-Scope Use](#out-of-scope-use)
133
+ - [Bias, Risks, and Limitations](#bias-risks-and-limitations)
134
+ - [Recommendations](#recommendations)
135
+ - [Training Details](#training-details)
136
+ - [Training Data](#training-data)
137
+ - [Training Procedure](#training-procedure)
138
+ - [Preprocessing](#preprocessing)
139
+ - [Evaluation](#evaluation)
140
+ - [Testing Data, Factors & Metrics](#testing-data-factors--metrics)
141
+ - [Testing Data](#testing-data)
142
+ - [Factors](#factors)
143
+ - [Metrics](#metrics)
144
+ - [Results](#results)
145
+ - [Model Examination](#model-examination)
146
+ - [Technical Specifications](#technical-specifications)
147
+ - [Model Architecture and Objective](#model-architecture-and-objective)
148
+ - [Compute Infrastructure](#compute-infrastructure)
149
+ - [Hardware](#hardware)
150
+ - [Software](#software)
151
+ - [Citation](#citation)
152
+ - [Glossary](#glossary)
153
+ - [More Information](#more-information)
154
+ - [Authors](#authors)
155
+ - [Model Card Contact](#model-card-contact)
156
+ - [How to Get Started with the Model](#how-to-get-started-with-the-model)
157
+
158
+ # Model Details
159
+
160
+ ## Model Description
161
+
162
+ Project Indus LLM aims to provide a robust language model for Indian languages, starting with Hindi and its dialects. This open-source foundational model, hosted on Hugging Face, is tailored for easy integration and further development by researchers and developers focusing on Indian linguistic diversity.
163
+
164
+ <!-- Provide a longer summary of what this model is/does. -->
165
+ The model is a pretrained model in Hindi and dialects which is instruct tuned.
166
+
167
+ - **Developed by:** Nikhil Malhotra, Nilesh Brahme, Satish Mishra, Vinay Sharma (Makers Lab, TechMahindra)
168
+ - **Model type:** Foundational Language model
169
+ - **Language(s) (NLP):** hin, bho, mai, doi
170
+ - **License:** other
171
+ - **Parent Model:** It is a grounds up model built on GPT-2 architecture starting from tokenizer to decoder
172
+ - **Resources for more information:** <https://www.techmahindra.com/en-in/innovation/the-indus-project/>
173
+
174
+ # Uses
175
 
176
+ <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
177
+ Uses include question and answeting and conversation in Hindi and Dialects. The model would be reward tuned to be used across various industries
178
 
179
+ 1. Call center
180
+ 2. Healthcare
181
+ 3. Automotive
182
+ 4. Telecom
183
 
184
+ ## Direct Use
185
 
186
+ <!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
187
+ <!-- If the user enters content, print that. If not, but they enter a task in the list, use that. If neither, say "more info needed." -->
188
+ Project Indus can be directly used for generating text, simulating conversation, and other text generation tasks without additional training.
189
 
190
+ ## Downstream Use
191
 
192
+ <!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
193
+ <!-- If the user enters content, print that. If not, but they enter a task in the list, use that. If neither, say "more info needed." -->
194
 
195
+ Uses include question and answeting and conversation in Hindi and Dialects. The model would be reward tuned to be used across various industries
 
 
 
 
 
 
196
 
197
+ 1. Call center
198
+ 2. Healthcare
199
+ 3. Automotive
200
+ 4. Telecom
201
 
202
+ ## Out-of-Scope Use
203
 
204
+ <!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
205
+ <!-- If the user enters content, print that. If not, but they enter a task in the list, use that. If neither, say "more info needed." -->
206
+ Project Indus is not designed for high-stakes decision-making tasks such as medical diagnosis or legal advice, nor can it be used for fill-in-the-blank exercises, multiple Q&A, and similar applications at the moment.
207
 
208
+ # Bias, Risks, and Limitations
209
 
210
+ <!-- This section is meant to convey both technical and sociotechnical limitations. -->
211
 
212
+ Significant research has explored bias and fairness issues with language models
213
+ (see, e.g., [Sheng et al. (2021)](https://aclanthology.org/2021.acl-long.330.pdf) and [Bender et al. (2021)](https://dl.acm.org/doi/pdf/10.1145/3442188.3445922)).
214
+ Predictions generated by the model may include disturbing and harmful stereotypes across protected classes; identity characteristics; and sensitive, social, and occupational groups.
215
+ We have taken care across various biases by trying to remove them from training data. However since the model is a generative model, it would tend to produce hallucinations.
216
+ Any disturbing or harmful sterotype produced by the model is purely un-intentional and coincidental.
217
 
218
+ ## Recommendations
219
 
220
+ <!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
221
 
222
+ It is recommended to avoid biases and negative connotations in the model, and regular updates along with community feedback are crucial for addressing any emergent bias or misuse scenarios.
223
 
224
+ # Training Details
225
 
226
+ The model was trained on a curated dataset comprising various sources of Hindi text, including literature, news articles, and web content.
227
 
228
+ ## Infrastructure
229
 
230
+ - **Training Infrastructure:** Utilized high-performance computing resources provided by CDAC, featuring NVIDIA A100 GPUs.
231
+ - **Running Infrastructure:** Tested for both GPU (NVIDIA GeForce RTX 3070 or higher) and CPU (Intel Xeon Platinum 8580) environments.
232
 
233
+ ## Training Data
234
 
235
+ <!-- This should link to a Data Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
236
+ The Project Indus LLM was trained on a diverse and extensive dataset comprising various sources of Hindi text and its dialects. The data collection and curation process was meticulously designed to cater to the linguistic diversity and complexity of Indian languages, particularly focusing on Hindi and its 37 dialects.
237
 
238
+ ### Data Sources and Collection
239
 
240
+ Data was collected in three main buckets:
241
 
242
+ 1. **Open-Source Hindi Data**: This included publicly available sources from the internet across different categories such as news, and non-news. Automated scripts were used to scrape and extract text from web pages. Here are some of the sources:
243
+ - **News**: Articles from major Hindi news portals like Amar Ujala, Bhaskar, India TV, Jagran, Live Hindustan, and Patrika.
244
+ - **Non-News**: Diverse sources including Wikipedia, commoncrawl.org, and other culturally significant content like 'Man ki Baat' from AIR.
245
 
246
+ 2. **Translated Data**: A portion of the Pile dataset, which is a large English dataset used for training AI models, was translated into Hindi using three different translation models. IndicTrans2 (AI4Bharat) was selected as the best model for this purpose based on its accuracy and efficiency.
247
 
248
+ 3. **Dialects**: Data collection for dialects presented a unique challenge due to the limited material available on the internet. Data for major dialects like Maithili, Bhojpuri, Magahi, and Braj Bhasha was collected from multiple sources, including fieldwork where representatives collected old books and other texts, which were then digitized and converted into text data.
249
 
250
+ ## Training Procedure
251
 
252
+ <!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
253
 
254
+ Training involved extensive preprocessing to clean and standardize the text, followed by supervised learning on a high-performance computing setup.
255
 
256
+ - **Pre-training:** Conducted on a dataset of 22 billion tokens using advanced tokenization techniques.
257
+ - **Fine-Tuning:** Supervised fine-tuning performed with a focus on Indian languages, utilizing datasets specifically tailored for cultural, political, and social contexts.
258
 
259
+ Below is a table summarizing the datasets used for pre-training and fine-tuning the model:
260
 
261
+ | Phase | Data Source | Tokens | Notes |
262
+ |---------------|-----------------------------------------|-----------|-----------------------------------------------------|
263
+ | Pre-training | Cleaned dataset of Hindi and dialects | 22 billion| Utilized advanced tokenization |
264
+ | Fine-tuning | Custom datasets tailored for Indian languages | Varied | Focus on cultural, political, and social contexts |
265
 
266
+ - **Training Infrastructure:** Utilized high-performance computing resources provided by CDAC, featuring NVIDIA A100 GPUs.
267
+ - **Running Infrastructure:** Tested for both GPU (NVIDIA GeForce RTX 3070 or higher) and CPU (Intel Xeon Platinum 8580) environments.
268
 
269
+ ### Preprocessing
270
 
271
+ The collected data underwent several stages of cleaning and preprocessing to ensure high quality and usability for training:
272
 
273
+ - **Cleaning**: The data was cleaned of unwanted text, characters, and personal information like mobile numbers. Transliteration was performed where necessary, and unwanted tags from scraped web pages were removed.
274
+ - **Bias Removal**: A Bias Removal Toolkit was developed to detect and remove biased language from the training data. This toolkit helped in ensuring that the text used for training the model was ethical, correct, and socially responsible.
275
+ - **Tokenization**: The data was tokenized using a custom tokenizer developed specifically for Hindi and its dialects. This tokenizer was based on Byte Pair Encoding (BPE) with additional mechanisms like byte fallback to handle the peculiarities of Hindi script efficiently.
276
 
277
+ #### Summary
278
 
279
+ The final dataset used for training consisted of:
280
 
281
+ - **Raw Data Size**: Over 500 GB of raw data collected.
282
+ - **Cleaned and Curated Data**: Approximately 200 GB of clean Hindi and dialect text data.
283
+ - **Tokenization**: Utilized 22 billion tokens created from the cleaned data for pre-training.
284
 
285
+ This diverse and extensive training data foundation allowed Project Indus LLM to develop robust capabilities for understanding and generating Hindi text, making it a powerful tool for applications requiring Indian language processing.
286
 
287
+ # Evaluation
288
 
289
+ ### Indic LLM Leaderboard Results
290
 
291
+ Project Indus LLM has been evaluated using the Indic LLM Leaderboard, which employs the `indic_eval` evaluation framework specifically designed for assessing models on Indian language tasks. This framework provides a comprehensive view of model performance across a variety of benchmarks tailored to Indian languages.
292
 
293
+ Detailed results from the Indic LLM Leaderboard (α), accessible at [Hugging Face Indic LLM Leaderboard](https://huggingface.co/spaces/Cognitive-Lab/indic_llm_leaderboard), are shown below:
294
 
295
+ | Task | Version | Metric | Value | | Stderr |
296
+ |--------------------------------|---------|----------|-------|---|--------|
297
+ | All | | acc | 0.2891| ± | 0.0109 |
298
+ | | | acc_norm | 0.3013| ± | 0.0112 |
299
+ | indiceval:ARC-Challenge:hindi:10 | 0 | acc | 0.2167| ± | 0.0120 |
300
+ | | | acc_norm | 0.2474| ± | 0.0126 |
301
+ | indiceval:ARC-Easy:hindi:5 | 0 | acc | 0.3615| ± | 0.0099 |
302
+ | | | acc_norm | 0.3552| ± | 0.0098 |
303
 
304
+ These results highlight the model's capabilities in understanding and generating Hindi language text under controlled testing conditions. The standard error values indicate the variance observed during the evaluation, providing insights into the consistency of the model's performance across different evaluation runs.
305
 
306
+ ### Open LLM Leaderboard Evaluation Results
307
 
308
+ Additionally, Project Indus LLM has been evaluated on the Open LLM Leaderboard, which provides another layer of benchmarking by comparing the model's performance against other state-of-the-art language models. Below are the summarized results from the Open LLM Leaderboard:
309
 
310
+ | Metric |Value|
311
+ |---------------------------------|----:|
312
+ |Avg. |20.07|
313
+ |AI2 Reasoning Challenge (25-Shot)|22.70|
314
+ |HellaSwag (10-Shot) |25.04|
315
+ |MMLU (5-Shot) |23.12|
316
+ |Winogrande (5-shot) |49.57|
317
 
318
+ These benchmark results can be explored further on [Hugging Face Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard).
319
 
320
+ ### Evaluation Context
321
 
322
+ The evaluation metrics `acc` (accuracy) and `acc_norm` (normalized accuracy) are used to quantify the model's performance. The tasks are differentiated by their difficulty and the specific dataset used, such as the ARC Challenge and ARC Easy sets, both adapted to Hindi language conditions to ensure relevant assessment. This structured evaluation ensures that the Indus LLM not only performs well in generalized text generation tasks but also in more specialized, context-specific scenarios pertinent to the Indian linguistic framework.
323
 
324
+ ## Results
325
 
326
+ Project Indus demonstrates competitive performance, particularly in text generation tasks, as evidenced by its scores on standardized benchmarks.
327
 
328
+ # Technical Specifications
329
 
330
+ ## Model Architecture and Objective
331
 
332
+ Project Indus LLM is based on a GPT-2.0-like architecture, tailored to handle the complexities of the Hindi language and its dialects. This model was designed to serve as a foundational model that can be fine-tuned for various applications, making it highly versatile and adaptable to different domains within the Indian context.
333
 
334
+ - **Architecture Details**:
335
+ - **Layers**: 22 transformer layers, which provide a deep neural network capable of understanding complex language patterns.
336
+ - **Heads**: 32 attention heads per layer, facilitating a broad attention mechanism across different parts of the input data.
337
+ - **Embedding Size**: 2048, which allows the model to represent a wide variety of information and nuances in the data.
338
+ - **Vocabulary Size**: 32,300, tailored to include a comprehensive set of Hindi words and common phrases found in the training data.
339
 
340
+ The objective of this model is to provide a robust tool for text generation and understanding in Hindi and its dialects, supporting the development of applications that require natural language processing in these languages. It also aims to bridge the gap in technology where Indian languages are underrepresented, providing a platform for further linguistic research and technological inclusion.
341
 
342
+ ## Compute Infrastructure
343
 
344
+ ##### Hardware
345
 
346
+ The pre-training and fine-tuning of Project Indus LLM were conducted on high-performance computing infrastructure provided by the Centre for Development of Advanced Computing (CDAC). This setup included:
347
 
348
+ - **Nodes and GPUs**: Utilization of six nodes, each equipped with eight NVIDIA A100 GPUs. These GPUs are state-of-the-art for machine learning tasks and provide the necessary computational power to handle the large volumes of data and complex model architectures.
349
+ - **Memory and Storage**: Each node was equipped with ample memory and storage to handle the datasets and model parameters efficiently. Specific configurations included 40 GB of GPU memory per card, essential for training large models.
350
 
 
351
 
352
+ Inference performance was tested on GPU as well as CPU.
353
+ - **GPU**: On GPU NVIDIA GeForce RTX 3070 we have seen for 250-350 tokens inference time around ~5-10s.
354
+ - **CPU**: On Intel CPU Xeon(R) Platinum 8580 we have seen performance comparable to GPU with throughput of > 30 token/second.
355
 
356
+ ##### Software
357
 
358
+ The software environment was crucial for efficiently training and running the model. Key components included:
 
 
 
 
359
 
360
+ - **Operating System**: Linux, chosen for its stability and support for high-performance computing tasks.
361
+ - **Machine Learning Frameworks**: PyTorch, used for its flexibility and efficiency in training deep learning models. It supports extensive parallel processing and GPU acceleration, which are critical for training large models like Project Indus LLM.
362
+ - **Job Scheduler**: SLURM (Simple Linux Utility for Resource Management) was used to manage and allocate resources effectively across the distributed system. This ensured optimal scheduling of training jobs without resource contention.
363
 
364
+ # Citation
365
 
366
+ <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
367
 
368
+ The detailed citation information will help in acknowledging the work and efforts of the team behind Project Indus LLM when it is used or referenced in academic or professional settings.
369
 
370
+ ```bibtex
371
+ @article{malhotra2024projectindus,
372
+ title={Project Indus: A Foundational Model for Indian Languages},
373
+ author={Malhotra, Nikhil and Brahme, Nilesh and Mishra, Satish and Sharma, Vinay},
374
+ journal={Tech Mahindra Makers Lab},
375
+ year={2024},
376
+ url={https://www.techmahindra.com/en-in/innovation/the-indus-project/}
377
+ }
378
+ ```
379
 
380
+ **APA:**
381
+ Malhotra, N., Brahme, N., Mishra, S., & Sharma, V. (2024). Project Indus: A Foundational Model for Indian Languages. *Tech Mahindra Makers Lab*. Available at <https://www.techmahindra.com/en-in/innovation/the-indus-project/>
382
 
383
+ # Glossary
384
 
385
+ <!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
386
 
387
+ This glossary section explains key terms used throughout the model documentation and technical details, helping users unfamiliar with certain concepts to better understand the content.
388
 
389
+ - **Transformer Layers**: Part of a neural network architecture that uses self-attention mechanisms to process sequential data such as text. Essential for NLP tasks.
390
+ - **Attention Heads**: Sub-units of a model layer that allow the model to focus on different parts of the input sequence when making predictions.
391
+ - **Embedding Size**: The size of the vector used to represent each token or word in a dense numerical form. Larger embeddings can capture more detailed information.
392
+ - **Block Size**: The maximum length of the input tokens the model can process in one operation.
393
+ - **Vocabulary Size**: The total number of unique words or tokens that the model can understand and generate.
394
 
395
+ # More Information
396
 
397
+ For further details on Project Indus LLM, including additional documentation, tutorials, and community discussions, visit the following resources:
398
 
399
+ - **Project Repository**: [Hugging Face Repository](https://huggingface.co/nickmalhotra/ProjectIndus)
400
+ - **Tech Mahindra Makers Lab**: Insights into the research and development behind Project Indus can be found on the [Tech Mahindra Innovation page](https://www.techmahindra.com/en-in/innovation/makers-lab/).
401
+ - **Community Forums**: Engage with the community on [Hugging Face Forums](https://huggingface.co/nickmalhotra/ProjectIndus/discussions?status=open&type=discussion) for support, brainstorming, and sharing of new ideas related to Project Indus.
402
 
403
+ # Authors
404
 
405
+ <!-- This section provides another layer of transparency and accountability. Whose views is this model card representing? How many voices were included in its construction? Etc. -->
406
 
407
+ The model card and documentation for Project Indus LLM were collaboratively authored by:
408
 
409
+ - [**Nikhil Malhotra**](https://huggingface.co/nickmalhotra): Chief Innovation Officer at Tech Mahindra.
410
+ - [**Nilesh Brahme**](https://huggingface.co/nbrahme): Senior AI Research Scientist and one of the primary contributors to the Project Indus development.
411
+ - [**Satish Mishra**](https://huggingface.co/zicsx): AI Architect, whose insights have significantly shaped the model's capabilities.
412
+ - **Vinay Sharma**: LLM Engineer focused on the linguistic data processing and model training aspects of Project Indus.
413
+
414
+ # Model Card Contact
415
+
416
+ For inquiries, support, or further information regarding Project Indus LLM, please reach out through the following channels:
417
+
418
+ - **Email**: [[email protected]](mailto:[email protected]) - For direct queries and professional engagements.
419
+ - **GitHub Issues**: For technical issues, feature requests, or contributions, please use the Issues section of the [Project Indus GitHub repository](https://github.com/Tech-Mahindra-Makers-Lab/Indus-1.1B).
420
+ - **Hugging Face Spaces**: Questions and discussions related to model implementation and community projects can be posted in our dedicated space on Hugging Face.
421
+
422
+ # How to Get Started with the Model
423
+
424
+ To begin using Project Indus LLM for your projects, follow these steps to set up and run the model:
425
+
426
+ ```python
427
+ from transformers import AutoModelForCausalLM, AutoTokenizer
428
+
429
+ model = AutoModelForCausalLM.from_pretrained("nickmalhotra/ProjectIndus")
430
+ tokenizer = AutoTokenizer.from_pretrained("nickmalhotra/ProjectIndus")
431
+
432
+ # Example inference
433
+ def format_template(user_prompt):
434
+ messages = [
435
+ {"role": "user", "content": user_prompt},
436
+ ]
437
+ response = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt")
438
+ return response
439
+
440
+ user_prompt = """भारत के वर्तमान प्रधानमंत्री कौन हैं?"""
441
+
442
+ input_ids = format_template(user_prompt)
443
+
444
+ # Generate text using the model
445
+ output = model.generate(input_ids,
446
+ eos_token_id=tokenizer.eos_token_id,
447
+ pad_token_id=tokenizer.eos_token_id,
448
+ max_length=1024,
449
+ num_beams=5,
450
+ do_sample=True,
451
+ early_stopping=True,
452
+ temperature=0.7,
453
+ top_k=50,
454
+ top_p=0.95,
455
+ repetition_penalty=1.2,
456
+ no_repeat_ngram_size=3,
457
+ num_return_sequences=1,
458
+ )
459
+ print(tokenizer.decode(output[0], skip_special_tokens=False))
460
+ ```
461
+ ## Disclaimer
462
 
463
+ #### Model Limitations
464
 
465
+ Project Indus LLM is trained with single instruction tuning, which may result in hallucinations—instances where the model generates plausible but inaccurate information. Users should exercise caution, especially in scenarios requiring high factual accuracy.
466
 
467
+ #### Adaptation for Specific Use Cases
468
 
469
+ Project Indus LLM is designed as a foundational model suitable for further development and fine-tuning. Users are encouraged to adapt and refine the model to meet specific requirements of their applications.
470
 
471
+ #### Recommendations for Fine-Tuning
472
 
473
+ - **Identify Specific Needs**: Clearly define the requirements of your use case to guide the fine-tuning process.
474
+ - **Curate Targeted Data**: Ensure the training data is relevant and of high quality to improve model performance.
475
+ - **Continuous Evaluation**: Regularly assess the model's performance during and after fine-tuning to maintain accuracy and reduce biases.
476
 
477
+ This disclaimer aims to provide users with a clear understanding of the model's capabilities and limitations, facilitating its effective application and development.