zicsx commited on
Commit
82d5c4e
1 Parent(s): 0fce1f0

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +181 -116
README.md CHANGED
@@ -105,24 +105,20 @@ model-index:
105
  name: Open LLM Leaderboard
106
  ---
107
  ---
108
- # Model Card for Project Indus
109
-
110
- <!-- Provide a quick summary of what the model is/does. [Optional] -->
111
- The model is a pretrained model in Hindi and dialects which is instruct tuned .
112
-
113
 
 
114
 
 
 
115
 
116
- # Table of Contents
117
 
118
- - [Model Card for Indus](#model-card-for--model_id-)
119
  - [Table of Contents](#table-of-contents)
120
- - [Table of Contents](#table-of-contents-1)
121
  - [Model Details](#model-details)
122
  - [Model Description](#model-description)
123
  - [Uses](#uses)
124
  - [Direct Use](#direct-use)
125
- - [Downstream Use [Optional]](#downstream-use-optional)
126
  - [Out-of-Scope Use](#out-of-scope-use)
127
  - [Bias, Risks, and Limitations](#bias-risks-and-limitations)
128
  - [Recommendations](#recommendations)
@@ -130,7 +126,6 @@ The model is a pretrained model in Hindi and dialects which is instruct tuned .
130
  - [Training Data](#training-data)
131
  - [Training Procedure](#training-procedure)
132
  - [Preprocessing](#preprocessing)
133
- - [Speeds, Sizes, Times](#speeds-sizes-times)
134
  - [Evaluation](#evaluation)
135
  - [Testing Data, Factors & Metrics](#testing-data-factors--metrics)
136
  - [Testing Data](#testing-data)
@@ -138,41 +133,39 @@ The model is a pretrained model in Hindi and dialects which is instruct tuned .
138
  - [Metrics](#metrics)
139
  - [Results](#results)
140
  - [Model Examination](#model-examination)
141
- - [Environmental Impact](#environmental-impact)
142
- - [Technical Specifications [optional]](#technical-specifications-optional)
143
  - [Model Architecture and Objective](#model-architecture-and-objective)
144
  - [Compute Infrastructure](#compute-infrastructure)
145
  - [Hardware](#hardware)
146
  - [Software](#software)
147
  - [Citation](#citation)
148
- - [Glossary [optional]](#glossary-optional)
149
- - [More Information [optional]](#more-information-optional)
150
- - [Model Card Authors [optional]](#model-card-authors-optional)
151
  - [Model Card Contact](#model-card-contact)
152
  - [How to Get Started with the Model](#how-to-get-started-with-the-model)
153
 
154
-
155
  # Model Details
156
 
157
  ## Model Description
158
 
 
 
159
  <!-- Provide a longer summary of what this model is/does. -->
160
- TThe model is a pretrained model in Hindi and dialects which is instruct tuned.
161
 
162
  - **Developed by:** Nikhil Malhotra, Nilesh Brahme, Satish Mishra, Vinay Sharma (Makers Lab, TechMahindra)
163
  - **Model type:** Foundational Language model
164
  - **Language(s) (NLP):** hin, bho, mai, doi
165
  - **License:** other
166
  - **Parent Model:** It is a grounds up model built on GPT-2 architecture starting from tokenizer to decoder
167
- - **Resources for more information:** https://www.techmahindra.com/en-in/innovation/the-indus-project/
168
 
169
-
170
-
171
-
172
  # Uses
173
 
174
  <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
175
  Uses include question and answeting and conversation in Hindi and Dialects. The model would be reward tuned to be used across various industries
 
176
  1. Call center
177
  2. Healthcare
178
  3. Automotive
@@ -182,200 +175,272 @@ Uses include question and answeting and conversation in Hindi and Dialects. The
182
 
183
  <!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
184
  <!-- If the user enters content, print that. If not, but they enter a task in the list, use that. If neither, say "more info needed." -->
185
- Direct use is as a foundationla model on Hindi and dialects
186
-
187
-
188
 
189
- ## Downstream Use [Optional]
190
 
191
  <!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
192
  <!-- If the user enters content, print that. If not, but they enter a task in the list, use that. If neither, say "more info needed." -->
193
- Uses include question and answeting and conversation in Hindi and Dialects. The model would be reward tuned to be used across various industries
 
 
194
  1. Call center
195
  2. Healthcare
196
  3. Automotive
197
  4. Telecom
198
 
199
-
200
-
201
-
202
  ## Out-of-Scope Use
203
 
204
  <!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
205
  <!-- If the user enters content, print that. If not, but they enter a task in the list, use that. If neither, say "more info needed." -->
206
- Cannot be used for fill in the blanks, Multiple Q&A etc. at the moment
207
-
208
-
209
 
210
  # Bias, Risks, and Limitations
211
 
212
  <!-- This section is meant to convey both technical and sociotechnical limitations. -->
213
 
214
  Significant research has explored bias and fairness issues with language models
215
- (see, e.g., [Sheng et al. (2021)](https://aclanthology.org/2021.acl-long.330.pdf) and [Bender et al. (2021)](https://dl.acm.org/doi/pdf/10.1145/3442188.3445922)).
216
  Predictions generated by the model may include disturbing and harmful stereotypes across protected classes; identity characteristics; and sensitive, social, and occupational groups.
217
  We have taken care across various biases by trying to remove them from training data. However since the model is a generative model, it would tend to produce hallucinations.
218
- Any disturbing or harmful sterotype produced by the model is purely un-intentional and coincidental.
219
-
220
 
221
  ## Recommendations
222
 
223
  <!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
224
- Recommendation is to not use biases and negative connotation for the model
225
 
 
226
 
227
- ##Indic EVAL Leaderboard
228
 
229
- ![image/png](https://cdn-uploads.huggingface.co/production/uploads/63a336a6b5fc9ab9f63e0805/flXcUdwLvi4YH3lfx7Sir.png)
230
 
231
- # Training Details
 
 
 
232
 
233
  ## Training Data
234
 
235
  <!-- This should link to a Data Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
 
 
 
 
 
 
 
 
 
236
 
237
- More information on training data needed
238
 
 
239
 
240
  ## Training Procedure
241
 
242
  <!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
243
 
244
- ### Preprocessing
245
 
246
- More information needed
 
247
 
248
- ### Speeds, Sizes, Times
249
 
250
- <!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
 
 
 
251
 
252
- More information needed
253
-
254
- # Evaluation
 
255
 
256
- <!-- This section describes the evaluation protocols and provides the results. -->
257
 
258
- ## Testing Data, Factors & Metrics
 
 
259
 
260
- ### Testing Data
261
 
262
- <!-- This should link to a Data Card if possible. -->
263
 
264
- More information needed
 
 
265
 
 
266
 
267
- ### Factors
268
 
269
- <!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
270
 
271
- More information needed
272
 
273
- ### Metrics
274
 
275
- <!-- These are the evaluation metrics being used, ideally with a description of why. -->
 
 
 
 
 
 
 
276
 
277
- More information needed
278
 
279
- ## Results
280
 
281
- More information needed
282
 
283
- # Model Examination
 
 
 
 
 
 
284
 
285
- More information needed
286
 
287
- # Environmental Impact
288
 
289
- <!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
290
 
291
- Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
292
 
293
- - **Hardware Type:** More information needed
294
- - **Hours used:** More information needed
295
- - **Cloud Provider:** More information needed
296
- - **Compute Region:** More information needed
297
- - **Carbon Emitted:** More information needed
298
 
299
- # Technical Specifications [optional]
300
 
301
  ## Model Architecture and Objective
302
 
303
- More information needed
 
 
 
 
 
 
 
 
304
 
305
  ## Compute Infrastructure
306
 
307
- More information needed
 
 
308
 
309
- ### Hardware
 
310
 
311
- More information needed
312
 
313
- ### Software
314
 
315
- More information needed
 
 
316
 
317
  # Citation
318
 
319
  <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
320
 
321
- **BibTeX:**
322
 
323
- More information needed
 
 
 
 
 
 
 
 
324
 
325
  **APA:**
 
326
 
327
- More information needed
328
-
329
- # Glossary [optional]
330
 
331
  <!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
332
 
333
- More information needed
334
-
335
- # More Information [optional]
336
-
337
- More information needed
338
 
339
- # Model Card Authors [optional]
 
 
 
 
340
 
341
- <!-- This section provides another layer of transparency and accountability. Whose views is this model card representing? How many voices were included in its construction? Etc. -->
342
 
343
- Nikhil Malhotra, Nilesh Brahme, Vinay Sharma, Satish Mishra
344
 
345
- # Model Card Contact
 
 
346
 
347
- More information needed
348
 
349
- # How to Get Started with the Model
350
 
351
- Use the code below to get started with the model.
352
- # Use a pipeline as a high-level helper
353
- from transformers import pipeline
354
- pipe = pipeline("text-generation", model="nickmalhotra/Indus_1.175B")
355
 
356
- # Load model directly
357
- from transformers import AutoTokenizer, AutoModelForCausalLM
 
 
358
 
359
- tokenizer = AutoTokenizer.from_pretrained("nickmalhotra/Indus_1.175B")
360
 
361
- model = AutoModelForCausalLM.from_pretrained("nickmalhotra/Indus_1.175B")
362
 
363
- <details>
364
- <summary> Click to expand </summary>
 
365
 
366
- More information needed
367
 
368
- </details>
369
- # [Open LLM Leaderboard Evaluation Results](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)
370
- Detailed results can be found [here](https://huggingface.co/datasets/open-llm-leaderboard/details_nickmalhotra__indus_1.175B)
371
 
372
- | Metric |Value|
373
- |---------------------------------|----:|
374
- |Avg. |20.07|
375
- |AI2 Reasoning Challenge (25-Shot)|22.70|
376
- |HellaSwag (10-Shot) |25.04|
377
- |MMLU (5-Shot) |23.12|
378
- |TruthfulQA (0-shot) | 0.00|
379
- |Winogrande (5-shot) |49.57|
380
- |GSM8k (5-shot) | 0.00|
381
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
105
  name: Open LLM Leaderboard
106
  ---
107
  ---
 
 
 
 
 
108
 
109
+ # Model Card for Project Indus
110
 
111
+ <!-- Provide a quick summary of what the model is/does. -->
112
+ Project Indus LLM is a groundbreaking open-source language model tailored for Hindi and its dialects, designed to enhance natural language processing and generation across diverse Indian linguistic applications.
113
 
114
+ # Table of Contents
115
 
 
116
  - [Table of Contents](#table-of-contents)
 
117
  - [Model Details](#model-details)
118
  - [Model Description](#model-description)
119
  - [Uses](#uses)
120
  - [Direct Use](#direct-use)
121
+ - [Downstream Use](#downstream-use)
122
  - [Out-of-Scope Use](#out-of-scope-use)
123
  - [Bias, Risks, and Limitations](#bias-risks-and-limitations)
124
  - [Recommendations](#recommendations)
 
126
  - [Training Data](#training-data)
127
  - [Training Procedure](#training-procedure)
128
  - [Preprocessing](#preprocessing)
 
129
  - [Evaluation](#evaluation)
130
  - [Testing Data, Factors & Metrics](#testing-data-factors--metrics)
131
  - [Testing Data](#testing-data)
 
133
  - [Metrics](#metrics)
134
  - [Results](#results)
135
  - [Model Examination](#model-examination)
136
+ - [Technical Specifications](#technical-specifications)
 
137
  - [Model Architecture and Objective](#model-architecture-and-objective)
138
  - [Compute Infrastructure](#compute-infrastructure)
139
  - [Hardware](#hardware)
140
  - [Software](#software)
141
  - [Citation](#citation)
142
+ - [Glossary](#glossary)
143
+ - [More Information](#more-information)
144
+ - [Model Card Authors](#model-card-authors)
145
  - [Model Card Contact](#model-card-contact)
146
  - [How to Get Started with the Model](#how-to-get-started-with-the-model)
147
 
 
148
  # Model Details
149
 
150
  ## Model Description
151
 
152
+ Project Indus LLM aims to provide a robust language model for Indian languages, starting with Hindi and its dialects. This open-source foundational model, hosted on Hugging Face, is tailored for easy integration and further development by researchers and developers focusing on Indian linguistic diversity.
153
+
154
  <!-- Provide a longer summary of what this model is/does. -->
155
+ The model is a pretrained model in Hindi and dialects which is instruct tuned.
156
 
157
  - **Developed by:** Nikhil Malhotra, Nilesh Brahme, Satish Mishra, Vinay Sharma (Makers Lab, TechMahindra)
158
  - **Model type:** Foundational Language model
159
  - **Language(s) (NLP):** hin, bho, mai, doi
160
  - **License:** other
161
  - **Parent Model:** It is a grounds up model built on GPT-2 architecture starting from tokenizer to decoder
162
+ - **Resources for more information:** <https://www.techmahindra.com/en-in/innovation/the-indus-project/>
163
 
 
 
 
164
  # Uses
165
 
166
  <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
167
  Uses include question and answeting and conversation in Hindi and Dialects. The model would be reward tuned to be used across various industries
168
+
169
  1. Call center
170
  2. Healthcare
171
  3. Automotive
 
175
 
176
  <!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
177
  <!-- If the user enters content, print that. If not, but they enter a task in the list, use that. If neither, say "more info needed." -->
178
+ Project Indus can be directly used for generating text, simulating conversation, and other text generation tasks without additional training.
 
 
179
 
180
+ ## Downstream Use
181
 
182
  <!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
183
  <!-- If the user enters content, print that. If not, but they enter a task in the list, use that. If neither, say "more info needed." -->
184
+
185
+ Uses include question and answeting and conversation in Hindi and Dialects. The model would be reward tuned to be used across various industries
186
+
187
  1. Call center
188
  2. Healthcare
189
  3. Automotive
190
  4. Telecom
191
 
 
 
 
192
  ## Out-of-Scope Use
193
 
194
  <!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
195
  <!-- If the user enters content, print that. If not, but they enter a task in the list, use that. If neither, say "more info needed." -->
196
+ Project Indus is not designed for high-stakes decision-making tasks such as medical diagnosis or legal advice, nor can it be used for fill-in-the-blank exercises, multiple Q&A, and similar applications at the moment.
 
 
197
 
198
  # Bias, Risks, and Limitations
199
 
200
  <!-- This section is meant to convey both technical and sociotechnical limitations. -->
201
 
202
  Significant research has explored bias and fairness issues with language models
203
+ (see, e.g., [Sheng et al. (2021)](https://aclanthology.org/2021.acl-long.330.pdf) and [Bender et al. (2021)](https://dl.acm.org/doi/pdf/10.1145/3442188.3445922)).
204
  Predictions generated by the model may include disturbing and harmful stereotypes across protected classes; identity characteristics; and sensitive, social, and occupational groups.
205
  We have taken care across various biases by trying to remove them from training data. However since the model is a generative model, it would tend to produce hallucinations.
206
+ Any disturbing or harmful sterotype produced by the model is purely un-intentional and coincidental.
 
207
 
208
  ## Recommendations
209
 
210
  <!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
 
211
 
212
+ It is recommended to avoid biases and negative connotations in the model, and regular updates along with community feedback are crucial for addressing any emergent bias or misuse scenarios.
213
 
214
+ # Training Details
215
 
216
+ The model was trained on a curated dataset comprising various sources of Hindi text, including literature, news articles, and web content.
217
 
218
+ ## Infrastructure
219
+
220
+ - **Training Infrastructure:** Utilized high-performance computing resources provided by CDAC, featuring NVIDIA A100 GPUs.
221
+ - **Running Infrastructure:** Tested for both GPU (NVIDIA GeForce RTX 3070 or higher) and CPU (Intel Xeon Platinum 8580) environments.
222
 
223
  ## Training Data
224
 
225
  <!-- This should link to a Data Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
226
+ The Project Indus LLM was trained on a diverse and extensive dataset comprising various sources of Hindi text and its dialects. The data collection and curation process was meticulously designed to cater to the linguistic diversity and complexity of Indian languages, particularly focusing on Hindi and its 37 dialects.
227
+
228
+ ### Data Sources and Collection
229
+
230
+ Data was collected in three main buckets:
231
+
232
+ 1. **Open-Source Hindi Data**: This included publicly available sources from the internet across different categories such as news, and non-news. Automated scripts were used to scrape and extract text from web pages. Here are some of the sources:
233
+ - **News**: Articles from major Hindi news portals like Amar Ujala, Bhaskar, India TV, Jagran, Live Hindustan, and Patrika.
234
+ - **Non-News**: Diverse sources including Wikipedia, commoncrawl.org, and other culturally significant content like 'Man ki Baat' from AIR.
235
 
236
+ 2. **Translated Data**: A portion of the Pile dataset, which is a large English dataset used for training AI models, was translated into Hindi using three different translation models. IndicTrans2 (AI4Bharat) was selected as the best model for this purpose based on its accuracy and efficiency.
237
 
238
+ 3. **Dialects**: Data collection for dialects presented a unique challenge due to the limited material available on the internet. Data for major dialects like Maithili, Bhojpuri, Magahi, and Braj Bhasha was collected from multiple sources, including fieldwork where representatives collected old books and other texts, which were then digitized and converted into text data.
239
 
240
  ## Training Procedure
241
 
242
  <!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
243
 
244
+ Training involved extensive preprocessing to clean and standardize the text, followed by supervised learning on a high-performance computing setup.
245
 
246
+ - **Pre-training:** Conducted on a dataset of 22 billion tokens using advanced tokenization techniques.
247
+ - **Fine-Tuning:** Supervised fine-tuning performed with a focus on Indian languages, utilizing datasets specifically tailored for cultural, political, and social contexts.
248
 
249
+ Below is a table summarizing the datasets used for pre-training and fine-tuning the model:
250
 
251
+ | Phase | Data Source | Tokens | Notes |
252
+ |---------------|-----------------------------------------|-----------|-----------------------------------------------------|
253
+ | Pre-training | Cleaned dataset of Hindi and dialects | 22 billion| Utilized advanced tokenization |
254
+ | Fine-tuning | Custom datasets tailored for Indian languages | Varied | Focus on cultural, political, and social contexts |
255
 
256
+ - **Training Infrastructure:** Utilized high-performance computing resources provided by CDAC, featuring NVIDIA A100 GPUs.
257
+ - **Running Infrastructure:** Tested for both GPU (NVIDIA GeForce RTX 3070 or higher) and CPU (Intel Xeon Platinum 8580) environments.
258
+
259
+ ### Preprocessing
260
 
261
+ The collected data underwent several stages of cleaning and preprocessing to ensure high quality and usability for training:
262
 
263
+ - **Cleaning**: The data was cleaned of unwanted text, characters, and personal information like mobile numbers. Transliteration was performed where necessary, and unwanted tags from scraped web pages were removed.
264
+ - **Bias Removal**: A Bias Removal Toolkit was developed to detect and remove biased language from the training data. This toolkit helped in ensuring that the text used for training the model was ethical, correct, and socially responsible.
265
+ - **Tokenization**: The data was tokenized using a custom tokenizer developed specifically for Hindi and its dialects. This tokenizer was based on Byte Pair Encoding (BPE) with additional mechanisms like byte fallback to handle the peculiarities of Hindi script efficiently.
266
 
267
+ #### Summary
268
 
269
+ The final dataset used for training consisted of:
270
 
271
+ - **Raw Data Size**: Over 500 GB of raw data collected.
272
+ - **Cleaned and Curated Data**: Approximately 200 GB of clean Hindi and dialect text data.
273
+ - **Tokenization**: Utilized 22 billion tokens created from the cleaned data for pre-training.
274
 
275
+ This diverse and extensive training data foundation allowed Project Indus LLM to develop robust capabilities for understanding and generating Hindi text, making it a powerful tool for applications requiring Indian language processing.
276
 
277
+ # Evaluation
278
 
279
+ ### Indic LLM Leaderboard Results
280
 
281
+ Project Indus LLM has been evaluated using the Indic LLM Leaderboard, which employs the `indic_eval` evaluation framework specifically designed for assessing models on Indian language tasks. This framework provides a comprehensive view of model performance across a variety of benchmarks tailored to Indian languages.
282
 
283
+ Detailed results from the Indic LLM Leaderboard (α), accessible at [Hugging Face Indic LLM Leaderboard](https://huggingface.co/spaces/Cognitive-Lab/indic_llm_leaderboard), are shown below:
284
 
285
+ | Task | Version | Metric | Value | | Stderr |
286
+ |--------------------------------|---------|----------|-------|---|--------|
287
+ | All | | acc | 0.2891| ± | 0.0109 |
288
+ | | | acc_norm | 0.3013| ± | 0.0112 |
289
+ | indiceval:ARC-Challenge:hindi:10 | 0 | acc | 0.2167| ± | 0.0120 |
290
+ | | | acc_norm | 0.2474| ± | 0.0126 |
291
+ | indiceval:ARC-Easy:hindi:5 | 0 | acc | 0.3615| ± | 0.0099 |
292
+ | | | acc_norm | 0.3552| ± | 0.0098 |
293
 
294
+ These results highlight the model's capabilities in understanding and generating Hindi language text under controlled testing conditions. The standard error values indicate the variance observed during the evaluation, providing insights into the consistency of the model's performance across different evaluation runs.
295
 
296
+ ### Open LLM Leaderboard Evaluation Results
297
 
298
+ Additionally, Project Indus LLM has been evaluated on the Open LLM Leaderboard, which provides another layer of benchmarking by comparing the model's performance against other state-of-the-art language models. Below are the summarized results from the Open LLM Leaderboard:
299
 
300
+ | Metric |Value|
301
+ |---------------------------------|----:|
302
+ |Avg. |20.07|
303
+ |AI2 Reasoning Challenge (25-Shot)|22.70|
304
+ |HellaSwag (10-Shot) |25.04|
305
+ |MMLU (5-Shot) |23.12|
306
+ |Winogrande (5-shot) |49.57|
307
 
308
+ These benchmark results can be explored further on [Hugging Face Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard).
309
 
310
+ ### Evaluation Context
311
 
312
+ The evaluation metrics `acc` (accuracy) and `acc_norm` (normalized accuracy) are used to quantify the model's performance. The tasks are differentiated by their difficulty and the specific dataset used, such as the ARC Challenge and ARC Easy sets, both adapted to Hindi language conditions to ensure relevant assessment. This structured evaluation ensures that the Indus LLM not only performs well in generalized text generation tasks but also in more specialized, context-specific scenarios pertinent to the Indian linguistic framework.
313
 
314
+ ## Results
315
 
316
+ Project Indus demonstrates competitive performance, particularly in text generation tasks, as evidenced by its scores on standardized benchmarks.
 
 
 
 
317
 
318
+ # Technical Specifications
319
 
320
  ## Model Architecture and Objective
321
 
322
+ Project Indus LLM is based on a GPT-2.0-like architecture, tailored to handle the complexities of the Hindi language and its dialects. This model was designed to serve as a foundational model that can be fine-tuned for various applications, making it highly versatile and adaptable to different domains within the Indian context.
323
+
324
+ - **Architecture Details**:
325
+ - **Layers**: 22 transformer layers, which provide a deep neural network capable of understanding complex language patterns.
326
+ - **Heads**: 32 attention heads per layer, facilitating a broad attention mechanism across different parts of the input data.
327
+ - **Embedding Size**: 2048, which allows the model to represent a wide variety of information and nuances in the data.
328
+ - **Vocabulary Size**: 32,300, tailored to include a comprehensive set of Hindi words and common phrases found in the training data.
329
+
330
+ The objective of this model is to provide a robust tool for text generation and understanding in Hindi and its dialects, supporting the development of applications that require natural language processing in these languages. It also aims to bridge the gap in technology where Indian languages are underrepresented, providing a platform for further linguistic research and technological inclusion.
331
 
332
  ## Compute Infrastructure
333
 
334
+ ##### Hardware
335
+
336
+ The pre-training and fine-tuning of Project Indus LLM were conducted on high-performance computing infrastructure provided by the Centre for Development of Advanced Computing (CDAC). This setup included:
337
 
338
+ - **Nodes and GPUs**: Utilization of six nodes, each equipped with eight NVIDIA A100 GPUs. These GPUs are state-of-the-art for machine learning tasks and provide the necessary computational power to handle the large volumes of data and complex model architectures.
339
+ - **Memory and Storage**: Each node was equipped with ample memory and storage to handle the datasets and model parameters efficiently. Specific configurations included 40 GB of GPU memory per card, essential for training large models.
340
 
341
+ ##### Software
342
 
343
+ The software environment was crucial for efficiently training and running the model. Key components included:
344
 
345
+ - **Operating System**: Linux, chosen for its stability and support for high-performance computing tasks.
346
+ - **Machine Learning Frameworks**: PyTorch, used for its flexibility and efficiency in training deep learning models. It supports extensive parallel processing and GPU acceleration, which are critical for training large models like Project Indus LLM.
347
+ - **Job Scheduler**: SLURM (Simple Linux Utility for Resource Management) was used to manage and allocate resources effectively across the distributed system. This ensured optimal scheduling of training jobs without resource contention.
348
 
349
  # Citation
350
 
351
  <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
352
 
353
+ The detailed citation information will help in acknowledging the work and efforts of the team behind Project Indus LLM when it is used or referenced in academic or professional settings.
354
 
355
+ ```bibtex
356
+ @article{malhotra2024projectindus,
357
+ title={Project Indus: A Foundational Model for Indian Languages},
358
+ author={Malhotra, Nikhil and Brahme, Nilesh and Mishra, Satish and Sharma, Vinay},
359
+ journal={Tech Mahindra Makers Lab},
360
+ year={2024},
361
+ url={https://www.techmahindra.com/en-in/innovation/the-indus-project/}
362
+ }
363
+ ```
364
 
365
  **APA:**
366
+ Malhotra, N., Brahme, N., Mishra, S., & Sharma, V. (2024). Project Indus: A Foundational Model for Indian Languages. *Tech Mahindra Makers Lab*. Available at <https://www.techmahindra.com/en-in/innovation/the-indus-project/>
367
 
368
+ # Glossary
 
 
369
 
370
  <!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
371
 
372
+ This glossary section explains key terms used throughout the model documentation and technical details, helping users unfamiliar with certain concepts to better understand the content.
 
 
 
 
373
 
374
+ - **Transformer Layers**: Part of a neural network architecture that uses self-attention mechanisms to process sequential data such as text. Essential for NLP tasks.
375
+ - **Attention Heads**: Sub-units of a model layer that allow the model to focus on different parts of the input sequence when making predictions.
376
+ - **Embedding Size**: The size of the vector used to represent each token or word in a dense numerical form. Larger embeddings can capture more detailed information.
377
+ - **Block Size**: The maximum length of the input tokens the model can process in one operation.
378
+ - **Vocabulary Size**: The total number of unique words or tokens that the model can understand and generate.
379
 
380
+ # More Information
381
 
382
+ For further details on Project Indus LLM, including additional documentation, tutorials, and community discussions, visit the following resources:
383
 
384
+ - **Project Repository**: [Hugging Face Repository](https://huggingface.co/nickmalhotra/ProjectIndus)
385
+ - **Tech Mahindra Makers Lab**: Insights into the research and development behind Project Indus can be found on the [Tech Mahindra Innovation page](https://www.techmahindra.com/en-in/innovation/makers-lab/).
386
+ - **Community Forums**: Engage with the community on [Hugging Face Forums](https://huggingface.co/nickmalhotra/ProjectIndus/discussions?status=open&type=discussion) for support, brainstorming, and sharing of new ideas related to Project Indus.
387
 
388
+ # Model Card Authors
389
 
390
+ <!-- This section provides another layer of transparency and accountability. Whose views is this model card representing? How many voices were included in its construction? Etc. -->
391
 
392
+ The model card and documentation for Project Indus LLM were collaboratively authored by:
 
 
 
393
 
394
+ - **Nikhil Malhotra**: Chief Innovation Officer at Tech Mahindra.
395
+ - **Nilesh Brahme**: Senior AI Research Scientist and one of the primary contributors to the Project Indus development.
396
+ - [**Satish Mishra**](https://huggingface.co/zicsx): AI Architect, whose insights have significantly shaped the model's capabilities.
397
+ - **Vinay Sharma**: LLM Engineer focused on the linguistic data processing and model training aspects of Project Indus.
398
 
399
+ # Model Card Contact
400
 
401
+ For inquiries, support, or further information regarding Project Indus LLM, please reach out through the following channels:
402
 
403
+ - **Email**: [[email protected]](mailto:[email protected]) - For direct queries and professional engagements.
404
+ - **GitHub Issues**: For technical issues, feature requests, or contributions, please use the Issues section of the [Project Indus GitHub repository](https://github.com/Tech-Mahindra-Makers-Lab/Indus-1.1B).
405
+ - **Hugging Face Spaces**: Questions and discussions related to model implementation and community projects can be posted in our dedicated space on Hugging Face.
406
 
407
+ # How to Get Started with the Model
408
 
409
+ To begin using Project Indus LLM for your projects, follow these steps to set up and run the model:
 
 
410
 
411
+ # Load model directly
 
 
 
 
 
 
 
 
412
 
413
+ ```python
414
+ from transformers import AutoModel, AutoTokenizer
415
+
416
+ model = AutoModel.from_pretrained("makers-lab/Indus-1.1B-IT")
417
+ tokenizer = AutoTokenizer.from_pretrained("makers-lab/Indus-1.1B-IT")
418
+
419
+ # Example inference
420
+ def format_template(user_prompt):
421
+ messages = [
422
+ {"role": "user", "content": user_prompt},
423
+ ]
424
+ response = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt")
425
+ return response
426
+
427
+ user_prompt = """भारत के वर्तमान प्रधानमंत्री कौन हैं?"""
428
+
429
+ input_ids = format_template(user_prompt)
430
+
431
+ # Generate text using the model
432
+ output = model.generate(input_ids,
433
+ eos_token_id=tokenizer.eos_token_id,
434
+ pad_token_id=tokenizer.eos_token_id,
435
+ max_length=1024,
436
+ num_beams=5,
437
+ do_sample=True,
438
+ early_stopping=True,
439
+ temperature=0.7,
440
+ top_k=50,
441
+ top_p=0.95,
442
+ repetition_penalty=1.2,
443
+ no_repeat_ngram_size=3,
444
+ num_return_sequences=1,
445
+ )
446
+ print(tokenizer.decode(output[0], skip_special_tokens=False))