MERaLiON
/

MERaLiON-AudioLLM-Whisper-SEA-LION

@@ -10,8 +10,11 @@ tags:
 - chat
 - audio
 - safetensors
 datasets:
 - MERaLiON/MNSC
 ---
 # MERaLiON
@@ -26,7 +29,7 @@ MERaLiON stands for **M**ultimodal **E**mpathetic **R**easoning **a**nd **L**ear
 - **Language(s) (NLP):** English, Chinese, Vietnamese, Indonesian, Thai, Filipino, Tamil, Malay, Khmer, Lao, Burmese, Javanese, Sundanese
 - **License:** MIT
-We support model inference using the [Huggingface](#inference) and [VLLM](#vllm-inference) frameworks. For more technical details, please refer to our [report]().
 ## Model Description
@@ -42,7 +45,7 @@ Specifically, we fine-tuned the **MERaLiON-Whisper** encoder from Whisper-large-
 MERaLiON-AudioLLM is trained to mainly address 6 tasks, namely `Automatic Speech Recognition` (ASR),
 `Speech Translation` (ST), `Spoken Question Answering` (SQA),
-`Spoken Dialogue Summarization` (SDS), `Speech Instruction` (SI), `Paralinguistics` (PARA).
 We benchmark MERaLiON-AudioLLM with a series of test sets from the [AudioBench benchmark](https://github.com/AudioLLMs/AudioBench)
 against three well-known AudioLLMs: `Qwen2-Audio 7B`, `WavLLM`, and `SALMONN`. We also compared with a cascaded model,
@@ -59,7 +62,7 @@ as evidenced by evaluation results on Singapore's [Multitask National Speech Cor
 > We assess ASR and ST tasks using Word Error Rate (WER) and BLEU scores, respectively.
 > For other tasks, we employ the LLM-as-a-Judge framework,
 > which uses a pre-trained large language model to evaluate task performance
-> by generating and scoring responses based on criteria such as relevance, coherence, and accuracy.
 > Refer to the [AudioBench paper](https://arxiv.org/abs/2406.16020) for more details.
 <div class="table*">
@@ -417,7 +420,7 @@ chat_prompt = processor.tokenizer.apply_chat_template(
 libri_data = load_dataset("distil-whisper/librispeech_long", "clean", split="validation")
 audio_array = libri_data[0]["audio"]["array"]
-inputs = processor(text=chat_prompt, audios=audio_array, time_duration_limit=30)
 outputs = model.generate(**inputs, max_new_tokens=128)
 generated_ids = outputs[:, inputs['input_ids'].size(1):]
@@ -461,22 +464,22 @@ chat_prompt = processor.tokenizer.apply_chat_template(
 libri_data = load_dataset("distil-whisper/librispeech_long", "clean", split="validation")
 audio_array = [libri_data[0]["audio"]["array"]]*2
-inputs = processor(text=chat_prompt, audios=audio_array, time_duration_limit=30)
 outputs = model.generate(**inputs, max_new_tokens=128)
 generated_ids = outputs[:, inputs['input_ids'].size(1):]
 response = processor.batch_decode(generated_ids, skip_special_tokens=True)
 ```
-### VLLM Inference
-MERaLiON-AudioLLM requires vllm version `0.6.4.post1`.
 ```
 pip install vllm==0.6.4.post1
 ```
-Here is an example of offline inference using our custom vllm class.
 ```python
 import torch
@@ -536,7 +539,7 @@ for o in outputs:
 The current MERaLiON-AudioLLM has not been aligned for safety. Developers and users should perform their own safety fine-tuning and related security measures. In no event shall the authors be held liable for any claim, damages, or other liability arising from the use of the released weights and codes.
-This research is supported by the National Research Foundation, Singapore and Infocomm Media Development Authority, Singapore under its National Large Language Models Funding Initiative. Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not reflect the views of National Research Foundation, Singapore and Infocomm Media Development Authority, Singapore.
 ## Technical Specifications

 - chat
 - audio
 - safetensors
+- vllm
 datasets:
 - MERaLiON/MNSC
+base_model:
+- openai/whisper-large-v2
 ---
 # MERaLiON
 - **Language(s) (NLP):** English, Chinese, Vietnamese, Indonesian, Thai, Filipino, Tamil, Malay, Khmer, Lao, Burmese, Javanese, Sundanese
 - **License:** MIT
+We support model inference using the [Huggingface](#inference) and [vLLM](#vllm-inference) frameworks. For more technical details, please refer to our [report]().
 ## Model Description
 MERaLiON-AudioLLM is trained to mainly address 6 tasks, namely `Automatic Speech Recognition` (ASR),
 `Speech Translation` (ST), `Spoken Question Answering` (SQA),
+`Spoken Dialogue Summarization` (SDS), `Speech Instruction` (SI), and `Paralinguistics` (PARA).
 We benchmark MERaLiON-AudioLLM with a series of test sets from the [AudioBench benchmark](https://github.com/AudioLLMs/AudioBench)
 against three well-known AudioLLMs: `Qwen2-Audio 7B`, `WavLLM`, and `SALMONN`. We also compared with a cascaded model,
 > We assess ASR and ST tasks using Word Error Rate (WER) and BLEU scores, respectively.
 > For other tasks, we employ the LLM-as-a-Judge framework,
 > which uses a pre-trained large language model to evaluate task performance
+> by generating and scoring responses based on relevance, coherence, and accuracy criteria.
 > Refer to the [AudioBench paper](https://arxiv.org/abs/2406.16020) for more details.
 <div class="table*">
 libri_data = load_dataset("distil-whisper/librispeech_long", "clean", split="validation")
 audio_array = libri_data[0]["audio"]["array"]
+inputs = processor(text=chat_prompt, audios=audio_array)
 outputs = model.generate(**inputs, max_new_tokens=128)
 generated_ids = outputs[:, inputs['input_ids'].size(1):]
 libri_data = load_dataset("distil-whisper/librispeech_long", "clean", split="validation")
 audio_array = [libri_data[0]["audio"]["array"]]*2
+inputs = processor(text=chat_prompt, audios=audio_array)
 outputs = model.generate(**inputs, max_new_tokens=128)
 generated_ids = outputs[:, inputs['input_ids'].size(1):]
 response = processor.batch_decode(generated_ids, skip_special_tokens=True)
 ```
+### vLLM Inference
+MERaLiON-AudioLLM requires vLLM version `0.6.4.post1`.
 ```
 pip install vllm==0.6.4.post1
 ```
+Here is an example of offline inference using our custom vLLM class.
 ```python
 import torch
 The current MERaLiON-AudioLLM has not been aligned for safety. Developers and users should perform their own safety fine-tuning and related security measures. In no event shall the authors be held liable for any claim, damages, or other liability arising from the use of the released weights and codes.
+This research is supported by the National Research Foundation, Singapore, and Infocomm Media Development Authority, Singapore under its National Large Language Models Funding Initiative. Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not reflect the views of National Research Foundation, Singapore and Infocomm Media Development Authority, Singapore.
 ## Technical Specifications