nickmalhotra
commited on
Commit
•
a510b5a
1
Parent(s):
309a82d
Update README.md
Browse files
README.md
CHANGED
@@ -240,7 +240,7 @@ The Project Indus LLM was trained on a diverse and extensive dataset comprising
|
|
240 |
Data was collected in three main buckets:
|
241 |
|
242 |
1. **Open-Source Hindi Data**: This included publicly available sources from the internet across different categories such as news, and non-news. Automated scripts were used to scrape and extract text from web pages. Here are some of the sources:
|
243 |
-
- **News**: Articles from
|
244 |
- **Non-News**: Diverse sources including Wikipedia, commoncrawl.org, and other culturally significant content like 'Man ki Baat' from AIR.
|
245 |
|
246 |
2. **Translated Data**: A portion of the Pile dataset, which is a large English dataset used for training AI models, was translated into Hindi using three different translation models. IndicTrans2 (AI4Bharat) was selected as the best model for this purpose based on its accuracy and efficiency.
|
|
|
240 |
Data was collected in three main buckets:
|
241 |
|
242 |
1. **Open-Source Hindi Data**: This included publicly available sources from the internet across different categories such as news, and non-news. Automated scripts were used to scrape and extract text from web pages. Here are some of the sources:
|
243 |
+
- **News**: Articles from news portals.
|
244 |
- **Non-News**: Diverse sources including Wikipedia, commoncrawl.org, and other culturally significant content like 'Man ki Baat' from AIR.
|
245 |
|
246 |
2. **Translated Data**: A portion of the Pile dataset, which is a large English dataset used for training AI models, was translated into Hindi using three different translation models. IndicTrans2 (AI4Bharat) was selected as the best model for this purpose based on its accuracy and efficiency.
|