nickmalhotra commited on
Commit
a510b5a
1 Parent(s): 309a82d

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +1 -1
README.md CHANGED
@@ -240,7 +240,7 @@ The Project Indus LLM was trained on a diverse and extensive dataset comprising
240
  Data was collected in three main buckets:
241
 
242
  1. **Open-Source Hindi Data**: This included publicly available sources from the internet across different categories such as news, and non-news. Automated scripts were used to scrape and extract text from web pages. Here are some of the sources:
243
- - **News**: Articles from major Hindi news portals like Amar Ujala, Bhaskar, India TV, Jagran, Live Hindustan, and Patrika.
244
  - **Non-News**: Diverse sources including Wikipedia, commoncrawl.org, and other culturally significant content like 'Man ki Baat' from AIR.
245
 
246
  2. **Translated Data**: A portion of the Pile dataset, which is a large English dataset used for training AI models, was translated into Hindi using three different translation models. IndicTrans2 (AI4Bharat) was selected as the best model for this purpose based on its accuracy and efficiency.
 
240
  Data was collected in three main buckets:
241
 
242
  1. **Open-Source Hindi Data**: This included publicly available sources from the internet across different categories such as news, and non-news. Automated scripts were used to scrape and extract text from web pages. Here are some of the sources:
243
+ - **News**: Articles from news portals.
244
  - **Non-News**: Diverse sources including Wikipedia, commoncrawl.org, and other culturally significant content like 'Man ki Baat' from AIR.
245
 
246
  2. **Translated Data**: A portion of the Pile dataset, which is a large English dataset used for training AI models, was translated into Hindi using three different translation models. IndicTrans2 (AI4Bharat) was selected as the best model for this purpose based on its accuracy and efficiency.