Papers
arxiv:2406.09559

Decoding the Diversity: A Review of the Indic AI Research Landscape

Published on Jun 13
ยท Submitted by amanchadha on Jun 17
Authors:
,
,
,

Abstract

This review paper provides a comprehensive overview of large language model (LLM) research directions within Indic languages. Indic languages are those spoken in the Indian subcontinent, including India, Pakistan, Bangladesh, Sri Lanka, Nepal, and Bhutan, among others. These languages have a rich cultural and linguistic heritage and are spoken by over 1.5 billion people worldwide. With the tremendous market potential and growing demand for natural language processing (NLP) based applications in diverse languages, generative applications for Indic languages pose unique challenges and opportunities for research. Our paper deep dives into the recent advancements in Indic generative modeling, contributing with a taxonomy of research directions, tabulating 84 recent publications. Research directions surveyed in this paper include LLM development, fine-tuning existing LLMs, development of corpora, benchmarking and evaluation, as well as publications around specific techniques, tools, and applications. We found that researchers across the publications emphasize the challenges associated with limited data availability, lack of standardization, and the peculiar linguistic complexities of Indic languages. This work aims to serve as a valuable resource for researchers and practitioners working in the field of NLP, particularly those focused on Indic languages, and contributes to the development of more accurate and efficient LLM applications for these languages.

Community

Paper author Paper submitter

The paper provides a comprehensive overview of recent advancements in Indic language large language models (LLMs), presenting a detailed taxonomy, highlighting challenges, and identifying future research directions.

  • Taxonomy and Overview: The paper categorizes and summarizes 84 recent studies on Indic LLMs into five broad categories: LLMs, Corpora, Benchmarks and Evaluation, Techniques, and Tools and Applications.
  • Identified Challenges: It highlights key challenges such as limited high-quality datasets, complex linguistic diversity, code-mixing, standardization issues, resource constraints, and lack of comprehensive evaluation frameworks.
  • Future Research Directions: The paper suggests focusing on scalable methods for low-resource settings, exploring transfer learning and cross-lingual approaches, and developing detailed evaluation frameworks to advance NLP applications for Indic languages.

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2406.09559 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2406.09559 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2406.09559 in a Space README.md to link it from this page.

Collections including this paper 1