Yingxu He
commited on
Update README.md
Browse files
README.md
CHANGED
@@ -13,12 +13,11 @@ tags:
|
|
13 |
|
14 |
# MERaLiON
|
15 |
|
16 |
-
MERaLiON-AudioLLM is a Speech-Text Large Language Model tailored for Singapore’s multilingual and multicultural landscape. Integrating a localised [Whisper-large-v2](https://huggingface.co/openai/whisper-large-v2) speech encoder and [SEA-LION V3](https://huggingface.co/aisingapore/gemma2-9b-cpt-sea-lionv3-instruct) text decoder, MERaLiON-AudioLLM is finetuned on **260,000 hours of speech and audio data**, **
|
17 |
|
18 |
MERaLiON stands for **M**ultimodal **E**mpathetic **R**easoning **a**nd **L**earning **i**n **O**ne **N**etwork.
|
19 |
|
20 |
- **Developed by:** I<sup>2</sup>R, A\*STAR
|
21 |
-
- **Funded by:** Singapore NRF
|
22 |
- **Model type:** MultiModal LLM
|
23 |
- **Language(s) (Speech):** English (Global & Singapore)
|
24 |
- **Language(s) (NLP):** English, Chinese, Vietnamese, Indonesian, Thai, Filipino, Tamil, Malay, Khmer, Lao, Burmese, Javanese, Sundanese
|
@@ -28,7 +27,7 @@ For more details, please refer to our [report]().
|
|
28 |
|
29 |
## Model Description
|
30 |
|
31 |
-
MERaLiON-AudioLLM is designed to take in an **audio-text pair** as input and
|
32 |
|
33 |
The architecture comprises three key components: an **audio encoder** that transforms speech or audio inputs into sequences of vector representations, a **text decoder** that interprets and responds to natural language instructions, and an **adaptor module** that compresses the encoder representations while aligning the encoder’s hidden dimension with the text decoder’s embedding size.
|
34 |
|
@@ -38,15 +37,349 @@ Specifically, we fine-tuned the **MERaLiON-Whisper** encoder from Whisper-large-
|
|
38 |
|
39 |
## Capabilities
|
40 |
|
41 |
-
MERaLiON-AudioLLM is trained to address
|
42 |
-
|
43 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
44 |
|
45 |
## Uses
|
46 |
|
47 |
Here we provide a code snippet illustrating the process of loading both the processor and model, alongside detailed instructions on executing the MERaLiON-AudioLLM model for content generation.
|
48 |
|
49 |
-
|
|
|
50 |
|
51 |
### Inference
|
52 |
|
|
|
13 |
|
14 |
# MERaLiON
|
15 |
|
16 |
+
MERaLiON-AudioLLM is a Speech-Text Large Language Model tailored for Singapore’s multilingual and multicultural landscape. Integrating a localised [Whisper-large-v2](https://huggingface.co/openai/whisper-large-v2) speech encoder and [SEA-LION V3](https://huggingface.co/aisingapore/gemma2-9b-cpt-sea-lionv3-instruct) text decoder, MERaLiON-AudioLLM is finetuned on **260,000 hours of speech and audio data**, **6 various tasks**, to address the diverse linguistic nuances of Singapore's local accents and dialects.
|
17 |
|
18 |
MERaLiON stands for **M**ultimodal **E**mpathetic **R**easoning **a**nd **L**earning **i**n **O**ne **N**etwork.
|
19 |
|
20 |
- **Developed by:** I<sup>2</sup>R, A\*STAR
|
|
|
21 |
- **Model type:** MultiModal LLM
|
22 |
- **Language(s) (Speech):** English (Global & Singapore)
|
23 |
- **Language(s) (NLP):** English, Chinese, Vietnamese, Indonesian, Thai, Filipino, Tamil, Malay, Khmer, Lao, Burmese, Javanese, Sundanese
|
|
|
27 |
|
28 |
## Model Description
|
29 |
|
30 |
+
MERaLiON-AudioLLM is designed to take in an **audio-text pair** as input and generate a **text output**.
|
31 |
|
32 |
The architecture comprises three key components: an **audio encoder** that transforms speech or audio inputs into sequences of vector representations, a **text decoder** that interprets and responds to natural language instructions, and an **adaptor module** that compresses the encoder representations while aligning the encoder’s hidden dimension with the text decoder’s embedding size.
|
33 |
|
|
|
37 |
|
38 |
## Capabilities
|
39 |
|
40 |
+
MERaLiON-AudioLLM is trained to mainly address 6 tasks, namely `Automatic Speech Recognition` (ASR),
|
41 |
+
`Speech Translation` (ST), `Spoken Question Answering` (SQA),
|
42 |
+
`Spoken Dialogue Summarization` (SDS), `Speech Instruction` (SI), `Paralinguistics` (PARA).
|
43 |
+
|
44 |
+
We benchmark MERaLiON-AudioLLM with a series of test sets from the [AudioBench benchmark](https://github.com/AudioLLMs/AudioBench)
|
45 |
+
against three well-known AudioLLMs: `Qwen2-Audio 7B`, `WavLLM`, and `SALMONN`. We also compared with a cascaded model,
|
46 |
+
which feeds the transcriptions recognized by Whisper-large-v2 and the instruction prompts to a Gemma2 9B CPT SEA-LIONv3 Instruct model to
|
47 |
+
get the responses. We tuned its hyperparameters and prompt template to optimise performance across
|
48 |
+
various speech-to-text tasks. As is shown in the following table, MERaLiON-AudioLLM performs better in the Singapore local context,
|
49 |
+
as evidenced by evaluation results on Singapore's [Multitask National Speech Corpus](MERaLiON/MNSC) (MNSC) datasets.
|
50 |
+
|
51 |
+
> [!NOTE]
|
52 |
+
> MNSC is a multitask speech understanding dataset derived and further annotated from [IMDA NSC Corpus](https://www.imda.gov.sg/how-we-can-help/national-speech-corpus).
|
53 |
+
> It focuses on the knowledge of Singapore's local accent, localised terms, and code-switching.
|
54 |
+
|
55 |
+
> [!NOTE]
|
56 |
+
> We assess ASR and ST tasks using Word Error Rate (WER) and BLEU scores, respectively.
|
57 |
+
> For other tasks, we employ the LLM-as-a-Judge framework,
|
58 |
+
> which uses a pre-trained large language model to evaluate task performance
|
59 |
+
> by generating and scoring responses based on criteria such as relevance, coherence, and accuracy.
|
60 |
+
|
61 |
+
<div class="table*">
|
62 |
+
<table>
|
63 |
+
<thead>
|
64 |
+
<tr>
|
65 |
+
<th style="text-align: center;"><strong>Task</strong></th>
|
66 |
+
<th style="text-align: center;"><strong>Dataset</strong></th>
|
67 |
+
<th style="text-align: center;"><strong>MERaLiON</strong></th>
|
68 |
+
<th style="text-align: center;"><strong>Qwen2-Audio 7B</strong></th>
|
69 |
+
<th style="text-align: center;"><strong>WavLLM</strong></th>
|
70 |
+
<th style="text-align: center;"><strong>SALMONN-7B</strong></th>
|
71 |
+
<th style="text-align: center;"><strong>Cascaded Model</strong></th>
|
72 |
+
</tr>
|
73 |
+
</thead>
|
74 |
+
<tbody>
|
75 |
+
<tr>
|
76 |
+
<td style="text-align: center;" rowspan="11"><em>Automatic Speech Recognition<br>WER (<span
|
77 |
+
class="math inline">↓</span>)</em></td>
|
78 |
+
<td style="text-align: center;">LibriSpeech-Test-Clean</td>
|
79 |
+
<td style="text-align: center;">0.03</td>
|
80 |
+
<td style="text-align: center;">0.03</td>
|
81 |
+
<td style="text-align: center;"><strong><u>0.02</u></strong></td>
|
82 |
+
<td style="text-align: center;">0.10</td>
|
83 |
+
<td style="text-align: center;">0.03</td>
|
84 |
+
</tr>
|
85 |
+
<tr>
|
86 |
+
<td style="text-align: center;">LibriSpeech-Test-Other</td>
|
87 |
+
<td style="text-align: center;"><strong><u>0.05</u></strong></td>
|
88 |
+
<td style="text-align: center;">0.06</td>
|
89 |
+
<td style="text-align: center;"><strong><u>0.05</u></strong></td>
|
90 |
+
<td style="text-align: center;">0.10</td>
|
91 |
+
<td style="text-align: center;"><u>0.05</u></td>
|
92 |
+
</tr>
|
93 |
+
<tr>
|
94 |
+
<td style="text-align: center;">Common-Voice-15-En-Test</td>
|
95 |
+
<td style="text-align: center;"><strong><u>0.10</u></strong></td>
|
96 |
+
<td style="text-align: center;">0.11</td>
|
97 |
+
<td style="text-align: center;">0.15</td>
|
98 |
+
<td style="text-align: center;">0.31</td>
|
99 |
+
<td style="text-align: center;">0.11</td>
|
100 |
+
</tr>
|
101 |
+
<tr>
|
102 |
+
<td style="text-align: center;">Earnings21-Test</td>
|
103 |
+
<td style="text-align: center;"><strong>0.17</strong></td>
|
104 |
+
<td style="text-align: center;">0.19</td>
|
105 |
+
<td style="text-align: center;">0.65</td>
|
106 |
+
<td style="text-align: center;">0.26</td>
|
107 |
+
<td style="text-align: center;"><u>0.11</u></td>
|
108 |
+
</tr>
|
109 |
+
<tr>
|
110 |
+
<td style="text-align: center;">Earnings22-Test</td>
|
111 |
+
<td style="text-align: center;"><strong>0.20</strong></td>
|
112 |
+
<td style="text-align: center;">0.24</td>
|
113 |
+
<td style="text-align: center;">0.67</td>
|
114 |
+
<td style="text-align: center;">0.36</td>
|
115 |
+
<td style="text-align: center;"><u>0.14</u></td>
|
116 |
+
</tr>
|
117 |
+
<tr>
|
118 |
+
<td style="text-align: center;">MNSC-ASR-Part 1</td>
|
119 |
+
<td style="text-align: center;"><u><strong>0.05</strong></u></td>
|
120 |
+
<td style="text-align: center;">0.07</td>
|
121 |
+
<td style="text-align: center;">-</td>
|
122 |
+
<td style="text-align: center;">0.09</td>
|
123 |
+
<td style="text-align: center;">0.07</td>
|
124 |
+
</tr>
|
125 |
+
<tr>
|
126 |
+
<td style="text-align: center;">MNSC-ASR-Part 2</td>
|
127 |
+
<td style="text-align: center;"><u><strong>0.05</strong></u></td>
|
128 |
+
<td style="text-align: center;">0.19</td>
|
129 |
+
<td style="text-align: center;">-</td>
|
130 |
+
<td style="text-align: center;">0.42</td>
|
131 |
+
<td style="text-align: center;">0.33</td>
|
132 |
+
</tr>
|
133 |
+
<tr>
|
134 |
+
<td style="text-align: center;">MNSC-ASR-Part 3</td>
|
135 |
+
<td style="text-align: center;"><u><strong>0.28</strong></u></td>
|
136 |
+
<td style="text-align: center;">0.35</td>
|
137 |
+
<td style="text-align: center;">-</td>
|
138 |
+
<td style="text-align: center;">0.66</td>
|
139 |
+
<td style="text-align: center;">0.30</td>
|
140 |
+
</tr>
|
141 |
+
<tr>
|
142 |
+
<td style="text-align: center;">MNSC-ASR-Part 4</td>
|
143 |
+
<td style="text-align: center;"><u><strong>0.40</strong></u></td>
|
144 |
+
<td style="text-align: center;">0.56</td>
|
145 |
+
<td style="text-align: center;">-</td>
|
146 |
+
<td style="text-align: center;">0.76</td>
|
147 |
+
<td style="text-align: center;">0.48</td>
|
148 |
+
</tr>
|
149 |
+
<tr>
|
150 |
+
<td style="text-align: center;">MNSC-ASR-Part 5</td>
|
151 |
+
<td style="text-align: center;"><u><strong>0.21</strong></u></td>
|
152 |
+
<td style="text-align: center;">0.28</td>
|
153 |
+
<td style="text-align: center;">-</td>
|
154 |
+
<td style="text-align: center;">0.35</td>
|
155 |
+
<td style="text-align: center;">0.23</td>
|
156 |
+
</tr>
|
157 |
+
<tr>
|
158 |
+
<td style="text-align: center;">MNSC-ASR-Part 6</td>
|
159 |
+
<td style="text-align: center;"><u><strong>0.15</strong></u></td>
|
160 |
+
<td style="text-align: center;">0.22</td>
|
161 |
+
<td style="text-align: center;">-</td>
|
162 |
+
<td style="text-align: center;">0.25</td>
|
163 |
+
<td style="text-align: center;">0.18</td>
|
164 |
+
</tr>
|
165 |
+
<tr>
|
166 |
+
<td style="text-align: center;" rowspan="6"><em>Speech Translation<br>BLEU (<span
|
167 |
+
class="math inline">↑</span>)</em></td>
|
168 |
+
<td style="text-align: center;">CoVoST 2 En <span
|
169 |
+
class="math inline">→</span> Id</td>
|
170 |
+
<td style="text-align: center;"><strong><u>32.62</u></strong></td>
|
171 |
+
<td style="text-align: center;">16.33</td>
|
172 |
+
<td style="text-align: center;">13.84</td>
|
173 |
+
<td style="text-align: center;">14.14</td>
|
174 |
+
<td style="text-align: center;">27.62</td>
|
175 |
+
</tr>
|
176 |
+
<tr>
|
177 |
+
<td style="text-align: center;">CoVoST 2 En <span
|
178 |
+
class="math inline">→</span> Zh</td>
|
179 |
+
<td style="text-align: center;"><strong><u>37.98</u></strong></td>
|
180 |
+
<td style="text-align: center;">25.77</td>
|
181 |
+
<td style="text-align: center;">31.96</td>
|
182 |
+
<td style="text-align: center;">33.89</td>
|
183 |
+
<td style="text-align: center;">35.27</td>
|
184 |
+
</tr>
|
185 |
+
<tr>
|
186 |
+
<td style="text-align: center;">CoVoST 2 En <span
|
187 |
+
class="math inline">→</span> Ta</td>
|
188 |
+
<td style="text-align: center;"><strong><u>8.50</u></strong></td>
|
189 |
+
<td style="text-align: center;">0.03</td>
|
190 |
+
<td style="text-align: center;">0.00</td>
|
191 |
+
<td style="text-align: center;">0.00</td>
|
192 |
+
<td style="text-align: center;">8.46</td>
|
193 |
+
</tr>
|
194 |
+
<tr>
|
195 |
+
<td style="text-align: center;">CoVoST 2 Id <span
|
196 |
+
class="math inline">→</span> En</td>
|
197 |
+
<td style="text-align: center;"><strong>37.07</strong></td>
|
198 |
+
<td style="text-align: center;">6.33</td>
|
199 |
+
<td style="text-align: center;">5.93</td>
|
200 |
+
<td style="text-align: center;">26.89</td>
|
201 |
+
<td style="text-align: center;"><u>46.80</u></td>
|
202 |
+
</tr>
|
203 |
+
<tr>
|
204 |
+
<td style="text-align: center;">CoVoST 2 Zh <span
|
205 |
+
class="math inline">→</span> En</td>
|
206 |
+
<td style="text-align: center;">15.01</td>
|
207 |
+
<td style="text-align: center;"><strong><u>16.47</u></strong></td>
|
208 |
+
<td style="text-align: center;">2.37</td>
|
209 |
+
<td style="text-align: center;">5.30</td>
|
210 |
+
<td style="text-align: center;">15.21</td>
|
211 |
+
</tr>
|
212 |
+
<tr>
|
213 |
+
<td style="text-align: center;">CoVoST 2 Ta <span
|
214 |
+
class="math inline">→</span> En</td>
|
215 |
+
<td style="text-align: center;"><strong><u>3.97</u></strong></td>
|
216 |
+
<td style="text-align: center;">0.04</td>
|
217 |
+
<td style="text-align: center;">0.17</td>
|
218 |
+
<td style="text-align: center;">0.36</td>
|
219 |
+
<td style="text-align: center;">2.83</td>
|
220 |
+
</tr>
|
221 |
+
<tr>
|
222 |
+
<td style="text-align: center;" rowspan="8"><em>Spoken Question Answering<br>LLM-as-a-Judge (<span
|
223 |
+
class="math inline">↑</span>)</em></td>
|
224 |
+
<td style="text-align: center;">SLUE-SQA-5</td>
|
225 |
+
<td style="text-align: center;">82.94</td>
|
226 |
+
<td style="text-align: center;">80.05</td>
|
227 |
+
<td style="text-align: center;"><strong>83.92</strong></td>
|
228 |
+
<td style="text-align: center;">83.48</td>
|
229 |
+
<td style="text-align: center;"><u>88.58</u></td>
|
230 |
+
</tr>
|
231 |
+
<tr>
|
232 |
+
<td style="text-align: center;">Spoken-SQuAD</td>
|
233 |
+
<td style="text-align: center;">70.33</td>
|
234 |
+
<td style="text-align: center;">64.86</td>
|
235 |
+
<td style="text-align: center;"><strong>77.65</strong></td>
|
236 |
+
<td style="text-align: center;">66.40</td>
|
237 |
+
<td style="text-align: center;"><u>88.62</u></td>
|
238 |
+
</tr>
|
239 |
+
<tr>
|
240 |
+
<td style="text-align: center;">CN-College-Listen-Test</td>
|
241 |
+
<td style="text-align: center;"><strong>85.03</strong></td>
|
242 |
+
<td style="text-align: center;">74.51</td>
|
243 |
+
<td style="text-align: center;">65.43</td>
|
244 |
+
<td style="text-align: center;">50.90</td>
|
245 |
+
<td style="text-align: center;"><u>91.85</u></td>
|
246 |
+
</tr>
|
247 |
+
<tr>
|
248 |
+
<td style="text-align: center;">Singapore-Public-Speech-SQA</td>
|
249 |
+
<td style="text-align: center;"><strong>60.32</strong></td>
|
250 |
+
<td style="text-align: center;">58.31</td>
|
251 |
+
<td style="text-align: center;">58.55</td>
|
252 |
+
<td style="text-align: center;">59.24</td>
|
253 |
+
<td style="text-align: center;"><u>73.11</u></td>
|
254 |
+
</tr>
|
255 |
+
<tr>
|
256 |
+
<td style="text-align: center;">MNSC-SQA-Part 3</td>
|
257 |
+
<td style="text-align: center;"><strong>51.4</strong></td>
|
258 |
+
<td style="text-align: center;">42.0</td>
|
259 |
+
<td style="text-align: center;">-</td>
|
260 |
+
<td style="text-align: center;">40.60</td>
|
261 |
+
<td style="text-align: center;"><u>53.20</u></td>
|
262 |
+
</tr>
|
263 |
+
<tr>
|
264 |
+
<td style="text-align: center;">MNSC-SQA-Part 4</td>
|
265 |
+
<td style="text-align: center;"><strong>49.0</strong></td>
|
266 |
+
<td style="text-align: center;">39.6</td>
|
267 |
+
<td style="text-align: center;">-</td>
|
268 |
+
<td style="text-align: center;">36.60</td>
|
269 |
+
<td style="text-align: center;"><u>60.20</u></td>
|
270 |
+
</tr>
|
271 |
+
<tr>
|
272 |
+
<td style="text-align: center;">MNSC-SQA-Part 5</td>
|
273 |
+
<td style="text-align: center;"><strong>58.2</strong></td>
|
274 |
+
<td style="text-align: center;">51.6</td>
|
275 |
+
<td style="text-align: center;">-</td>
|
276 |
+
<td style="text-align: center;">44.60</td>
|
277 |
+
<td style="text-align: center;"><u>67.20</u></td>
|
278 |
+
</tr>
|
279 |
+
<tr>
|
280 |
+
<td style="text-align: center;">MNSC-SQA-Part 6</td>
|
281 |
+
<td style="text-align: center;"><strong>65.2</strong></td>
|
282 |
+
<td style="text-align: center;">53.6</td>
|
283 |
+
<td style="text-align: center;">-</td>
|
284 |
+
<td style="text-align: center;">46.80</td>
|
285 |
+
<td style="text-align: center;"><u>71.60</u></td>
|
286 |
+
</tr>
|
287 |
+
<tr>
|
288 |
+
<td style="text-align: center;" rowspan="4"><em>Spoken Dialogue Summarization<br>LLM-as-a-Judge (<span
|
289 |
+
class="math inline">↑</span>)</em></td>
|
290 |
+
<td style="text-align: center;">MNSC-SDS-Part 3</td>
|
291 |
+
<td style="text-align: center;"><u><strong>46.80</strong></u></td>
|
292 |
+
<td style="text-align: center;">33.80</td>
|
293 |
+
<td style="text-align: center;">-</td>
|
294 |
+
<td style="text-align: center;">9.0</td>
|
295 |
+
<td style="text-align: center;">45.40</td>
|
296 |
+
</tr>
|
297 |
+
<tr>
|
298 |
+
<td style="text-align: center;">MNSC-SDS-Part 4</td>
|
299 |
+
<td style="text-align: center;"><u><strong>45.80</strong></u></td>
|
300 |
+
<td style="text-align: center;">24.80</td>
|
301 |
+
<td style="text-align: center;">-</td>
|
302 |
+
<td style="text-align: center;">7.0</td>
|
303 |
+
<td style="text-align: center;">44.00</td>
|
304 |
+
</tr>
|
305 |
+
<tr>
|
306 |
+
<td style="text-align: center;">MNSC-SDS-Part 5</td>
|
307 |
+
<td style="text-align: center;"><strong>55.2</strong></td>
|
308 |
+
<td style="text-align: center;">40.4</td>
|
309 |
+
<td style="text-align: center;">-</td>
|
310 |
+
<td style="text-align: center;">17.2</td>
|
311 |
+
<td style="text-align: center;"><u>58.00</u></td>
|
312 |
+
</tr>
|
313 |
+
<tr>
|
314 |
+
<td style="text-align: center;">MNSC-SDS-Part 6</td>
|
315 |
+
<td style="text-align: center;"><strong>61.8</strong></td>
|
316 |
+
<td style="text-align: center;">46.2</td>
|
317 |
+
<td style="text-align: center;">-</td>
|
318 |
+
<td style="text-align: center;">24.2</td>
|
319 |
+
<td style="text-align: center;"><u>65.40</u></td>
|
320 |
+
</tr>
|
321 |
+
<tr>
|
322 |
+
<td style="text-align: center;" rowspan="2"><em>Speech Instruction<br>LLM-as-a-Judge (<span
|
323 |
+
class="math inline">↑</span>)</em></td>
|
324 |
+
<td style="text-align: center;">OpenHermes-Audio</td>
|
325 |
+
<td style="text-align: center;"><strong>71.4</strong></td>
|
326 |
+
<td style="text-align: center;">44.8</td>
|
327 |
+
<td style="text-align: center;">22.40</td>
|
328 |
+
<td style="text-align: center;">15.80</td>
|
329 |
+
<td style="text-align: center;"><u>72.20</u></td>
|
330 |
+
</tr>
|
331 |
+
<tr>
|
332 |
+
<td style="text-align: center;">Alpaca-GPT4-Audio</td>
|
333 |
+
<td style="text-align: center;"><strong>73.4</strong></td>
|
334 |
+
<td style="text-align: center;">52.6</td>
|
335 |
+
<td style="text-align: center;">21.60</td>
|
336 |
+
<td style="text-align: center;">17.20</td>
|
337 |
+
<td style="text-align: center;"><u>73.80</u></td>
|
338 |
+
</tr>
|
339 |
+
<tr>
|
340 |
+
<td style="text-align: center;" rowspan="4"><em>Paralinguistics<br>LLM-as-a-Judge (<span
|
341 |
+
class="math inline">↑</span>)</em></td>
|
342 |
+
<td style="text-align: center;">VoxCeleb-Gender-Test</td>
|
343 |
+
<td style="text-align: center;"><strong><u>99.53</u></strong></td>
|
344 |
+
<td style="text-align: center;">99.12</td>
|
345 |
+
<td style="text-align: center;">69.68</td>
|
346 |
+
<td style="text-align: center;">88.81</td>
|
347 |
+
<td style="text-align: center;">35.25</td>
|
348 |
+
</tr>
|
349 |
+
<tr>
|
350 |
+
<td style="text-align: center;">VoxCeleb-Accent-Test</td>
|
351 |
+
<td style="text-align: center;"><strong><u>46.35</u></strong></td>
|
352 |
+
<td style="text-align: center;">29.18</td>
|
353 |
+
<td style="text-align: center;">-</td>
|
354 |
+
<td style="text-align: center;">34.22</td>
|
355 |
+
<td style="text-align: center;">24.64</td>
|
356 |
+
</tr>
|
357 |
+
<tr>
|
358 |
+
<td style="text-align: center;">MELD-Sentiment-Test</td>
|
359 |
+
<td style="text-align: center;">42.26</td>
|
360 |
+
<td style="text-align: center;"><strong>53.49</strong></td>
|
361 |
+
<td style="text-align: center;">50.08</td>
|
362 |
+
<td style="text-align: center;">42.07</td>
|
363 |
+
<td style="text-align: center;"><u>56.67</u></td>
|
364 |
+
</tr>
|
365 |
+
<tr>
|
366 |
+
<td style="text-align: center;">MELD-Emotion-Test</td>
|
367 |
+
<td style="text-align: center;">30.15</td>
|
368 |
+
<td style="text-align: center;">40.54</td>
|
369 |
+
<td style="text-align: center;"><strong>41.07</strong></td>
|
370 |
+
<td style="text-align: center;">30.73</td>
|
371 |
+
<td style="text-align: center;"><u>47.39</u></td>
|
372 |
+
</tr>
|
373 |
+
</tbody>
|
374 |
+
</table>
|
375 |
+
</div>
|
376 |
|
377 |
## Uses
|
378 |
|
379 |
Here we provide a code snippet illustrating the process of loading both the processor and model, alongside detailed instructions on executing the MERaLiON-AudioLLM model for content generation.
|
380 |
|
381 |
+
> [!WARNING]
|
382 |
+
> This model has not been trained to use a system prompt or to use tool calling.
|
383 |
|
384 |
### Inference
|
385 |
|