large-traversaal
/

Phi-4-Hindi

@@ -162,7 +162,7 @@ While Phi-4-Hindi is a powerful bilingual model designed for Hindi and English,
 <!-- This section is meant to convey both technical and sociotechnical limitations. -->
 <!-- The model is trained on publicly available data which was in part curated by Inception. -->
-~~While efforts have been made to minimize biases, it is likely that the model, as with all LLM models, will exhibit some bias.
 The model is trained as an AI assistant for Hindi and English speakers. The model is limited to produce responses for queries in these two languages
 and may not produce appropriate responses to other language queries.
@@ -175,17 +175,16 @@ We are continuously working to develop models with greater capabilities, and as
 We evaluated our models on multiple well-known benchmarks to measure their effectiveness against other leading models, and the results are as follows:
-| Model                           | ARC-C | ARC-E | BoolQ | CMCQ  | MMLU  | Average* | MMLU-Pro | GPQA | MuSR  | BBH   | MATH  |
 |---------------------------------|-------|-------|-------|-------|-------|----------|----------|------|-------|-------|-------|
 | AryaBhatta-GemmaUltra-8.5B      | 22.70 | 25.04 | 22.95 | 62.23 | 23.70 | 31.32    | 22.66    | 25.34| 42.72 | 41.12 | 2.95  |
 | Airavata-7B                     | 25.09 | 30.47 | 25.31 | 62.17 | 33.20 | 35.25    | 16.35    | 27.43| 37.57 | 36.00 | 13.60 |
 | sarvam-1-2B                     | 30.03 | 33.25 | 62.17 | 42.80 | 27.90 | 39.23    | -        | -    | -     | -     | -     |
 | Nemotron-4-Mini-Hindi-Instruct  | 55.80 | 71.63 | 62.11 | 68.10 | 43.20 | 60.17    | 25.95    | 30.87| 41.53 | 40.11 | 2.04  |
-| Llama-3-Nanda-10B-Chat          | 65.36 | 80.64 | 82.29 | 67.60 | 50.61 | 69.30    | -        | -    | -     | -     | -     |
 | Krutrim-2-12b-instruct          | 67.32 | 81.10 | 84.74 | 76.30 | 56.10 | 73.11    | -        | -    | -     | -     | -     |
 | aya-expanse-8b                  | 74.06 | 87.08 | 86.45 | 83.30 | 56.89 | 77.56    | 30.04    | 30.29| 37.17 | 49.42 | 7.02  |
 | aya-expanse-32B                 | 85.41 | **95.08** | **90.43** | 89.80 | 69.71 | 86.08    | 41.30    | 32.55| 38.62 | 56.29 | 13.37 |
-|---------------------------------|-------|-------|-------|-------|-------|----------|----------|------|-------|-------|-------|
 | **Our Qwen Model (14b)**        | 90.61 | 94.82 | 88.53 | **90.70** | 75.00 | 87.93 | **52.63** | 36.24 | 44.84 | 64.97 | **25.08** |
 | **Our Phi Model (14b)**         | **97.39** | 92.24 | 87.65 | 87.40 | **75.59** | **88.05** | 52.39 | **39.77** | **49.07** | **66.97** | 23.11 |
@@ -203,8 +202,7 @@ We evaluated our models on multiple well-known benchmarks to measure their effec
 | Krutrim-2-12b-instruct             | 56.83 | 70.66 | 78.86 | 64.10 | 46.51 | 63.39   |
 | aya-expanse-8b                     | 57.42 | 72.90 | 80.42 | 69.00 | 43.39 | 64.63   |
 | aya-expanse-32B                    | 73.29 | 85.48 | **87.73** | **79.70** | **56.96** | 76.63   |
-|------------------------------------|-------|-------|-------|-------|-------|---------|
-| **Our Qwen Model (14b)**           | 74.06 | 81.23 | 84.07 | 78.20 | 53.85 | **74.82** |
 | **Our Phi Model (14b)**            | **81.74** | **89.06** | 86.02 | 78.70 | 56.39 | **78.38** |
 **Table 2: Metrics (.2f) of our models and other LLMs over several Hindi benchmarks**
@@ -248,3 +246,8 @@ It is advisable for users to:
 - Continuously assess the model to ensure compliance with ethical standards.
 - Be mindful of potential biases and unintended outputs, especially in critical applications.

 <!-- This section is meant to convey both technical and sociotechnical limitations. -->
 <!-- The model is trained on publicly available data which was in part curated by Inception. -->
+While efforts have been made to minimize biases, it is likely that the model, as with all LLM models, will exhibit some bias.
 The model is trained as an AI assistant for Hindi and English speakers. The model is limited to produce responses for queries in these two languages
 and may not produce appropriate responses to other language queries.
 We evaluated our models on multiple well-known benchmarks to measure their effectiveness against other leading models, and the results are as follows:
+| Model                           | ARC-C | ARC-E | BoolQ | CMCQ  | MMLU  | Average* | MMLU-Pro | GPQA | MuSR  | BBH   | MATH-Hard  |
 |---------------------------------|-------|-------|-------|-------|-------|----------|----------|------|-------|-------|-------|
 | AryaBhatta-GemmaUltra-8.5B      | 22.70 | 25.04 | 22.95 | 62.23 | 23.70 | 31.32    | 22.66    | 25.34| 42.72 | 41.12 | 2.95  |
 | Airavata-7B                     | 25.09 | 30.47 | 25.31 | 62.17 | 33.20 | 35.25    | 16.35    | 27.43| 37.57 | 36.00 | 13.60 |
 | sarvam-1-2B                     | 30.03 | 33.25 | 62.17 | 42.80 | 27.90 | 39.23    | -        | -    | -     | -     | -     |
 | Nemotron-4-Mini-Hindi-Instruct  | 55.80 | 71.63 | 62.11 | 68.10 | 43.20 | 60.17    | 25.95    | 30.87| 41.53 | 40.11 | 2.04  |
+| Llama-3-Nanda-10B-Chat          | 65.36 | 80.64 | 82.29 | 67.60 | 50.61 | 69.30    | 31.57    | 30.12| 43.52 | 49.38 | 5.59  |
 | Krutrim-2-12b-instruct          | 67.32 | 81.10 | 84.74 | 76.30 | 56.10 | 73.11    | -        | -    | -     | -     | -     |
 | aya-expanse-8b                  | 74.06 | 87.08 | 86.45 | 83.30 | 56.89 | 77.56    | 30.04    | 30.29| 37.17 | 49.42 | 7.02  |
 | aya-expanse-32B                 | 85.41 | **95.08** | **90.43** | 89.80 | 69.71 | 86.08    | 41.30    | 32.55| 38.62 | 56.29 | 13.37 |
 | **Our Qwen Model (14b)**        | 90.61 | 94.82 | 88.53 | **90.70** | 75.00 | 87.93 | **52.63** | 36.24 | 44.84 | 64.97 | **25.08** |
 | **Our Phi Model (14b)**         | **97.39** | 92.24 | 87.65 | 87.40 | **75.59** | **88.05** | 52.39 | **39.77** | **49.07** | **66.97** | 23.11 |
 | Krutrim-2-12b-instruct             | 56.83 | 70.66 | 78.86 | 64.10 | 46.51 | 63.39   |
 | aya-expanse-8b                     | 57.42 | 72.90 | 80.42 | 69.00 | 43.39 | 64.63   |
 | aya-expanse-32B                    | 73.29 | 85.48 | **87.73** | **79.70** | **56.96** | 76.63   |
+| **Our Qwen Model (14b)**           | 74.06 | 81.23 | 84.07 | 78.20 | 53.85 | 74.82 |
 | **Our Phi Model (14b)**            | **81.74** | **89.06** | 86.02 | 78.70 | 56.39 | **78.38** |
 **Table 2: Metrics (.2f) of our models and other LLMs over several Hindi benchmarks**
 - Continuously assess the model to ensure compliance with ethical standards.
 - Be mindful of potential biases and unintended outputs, especially in critical applications.
+### Team
+- Ram Mohan Rao Kadiyala (@1024m)
+- Siddartha Pullakhandam
+-