nickmalhotra
/

ProjectIndus

Text Generation

text-generation-inference

Inference Endpoints

Model card Files Files and versions Community

nickmalhotra commited on Jun 30

Commit

a510b5a

•

1 Parent(s): 309a82d

Update README.md

Files changed (1) hide show

README.md +1 -1

README.md CHANGED Viewed

@@ -240,7 +240,7 @@ The Project Indus LLM was trained on a diverse and extensive dataset comprising
 Data was collected in three main buckets:
 1. **Open-Source Hindi Data**: This included publicly available sources from the internet across different categories such as news, and non-news. Automated scripts were used to scrape and extract text from web pages. Here are some of the sources:
-   - **News**: Articles from major Hindi news portals like Amar Ujala, Bhaskar, India TV, Jagran, Live Hindustan, and Patrika.
    - **Non-News**: Diverse sources including Wikipedia, commoncrawl.org, and other culturally significant content like 'Man ki Baat' from AIR.
 2. **Translated Data**: A portion of the Pile dataset, which is a large English dataset used for training AI models, was translated into Hindi using three different translation models. IndicTrans2 (AI4Bharat) was selected as the best model for this purpose based on its accuracy and efficiency.

 Data was collected in three main buckets:
 1. **Open-Source Hindi Data**: This included publicly available sources from the internet across different categories such as news, and non-news. Automated scripts were used to scrape and extract text from web pages. Here are some of the sources:
+   - **News**: Articles from news portals.
    - **Non-News**: Diverse sources including Wikipedia, commoncrawl.org, and other culturally significant content like 'Man ki Baat' from AIR.
 2. **Translated Data**: A portion of the Pile dataset, which is a large English dataset used for training AI models, was translated into Hindi using three different translation models. IndicTrans2 (AI4Bharat) was selected as the best model for this purpose based on its accuracy and efficiency.