Yingxu He
commited on
Update README.md
Browse files
README.md
CHANGED
@@ -10,8 +10,11 @@ tags:
|
|
10 |
- chat
|
11 |
- audio
|
12 |
- safetensors
|
|
|
13 |
datasets:
|
14 |
- MERaLiON/MNSC
|
|
|
|
|
15 |
---
|
16 |
|
17 |
# MERaLiON
|
@@ -26,7 +29,7 @@ MERaLiON stands for **M**ultimodal **E**mpathetic **R**easoning **a**nd **L**ear
|
|
26 |
- **Language(s) (NLP):** English, Chinese, Vietnamese, Indonesian, Thai, Filipino, Tamil, Malay, Khmer, Lao, Burmese, Javanese, Sundanese
|
27 |
- **License:** MIT
|
28 |
|
29 |
-
We support model inference using the [Huggingface](#inference) and [
|
30 |
|
31 |
## Model Description
|
32 |
|
@@ -42,7 +45,7 @@ Specifically, we fine-tuned the **MERaLiON-Whisper** encoder from Whisper-large-
|
|
42 |
|
43 |
MERaLiON-AudioLLM is trained to mainly address 6 tasks, namely `Automatic Speech Recognition` (ASR),
|
44 |
`Speech Translation` (ST), `Spoken Question Answering` (SQA),
|
45 |
-
`Spoken Dialogue Summarization` (SDS), `Speech Instruction` (SI), `Paralinguistics` (PARA).
|
46 |
|
47 |
We benchmark MERaLiON-AudioLLM with a series of test sets from the [AudioBench benchmark](https://github.com/AudioLLMs/AudioBench)
|
48 |
against three well-known AudioLLMs: `Qwen2-Audio 7B`, `WavLLM`, and `SALMONN`. We also compared with a cascaded model,
|
@@ -59,7 +62,7 @@ as evidenced by evaluation results on Singapore's [Multitask National Speech Cor
|
|
59 |
> We assess ASR and ST tasks using Word Error Rate (WER) and BLEU scores, respectively.
|
60 |
> For other tasks, we employ the LLM-as-a-Judge framework,
|
61 |
> which uses a pre-trained large language model to evaluate task performance
|
62 |
-
> by generating and scoring responses based on
|
63 |
> Refer to the [AudioBench paper](https://arxiv.org/abs/2406.16020) for more details.
|
64 |
|
65 |
<div class="table*">
|
@@ -417,7 +420,7 @@ chat_prompt = processor.tokenizer.apply_chat_template(
|
|
417 |
|
418 |
libri_data = load_dataset("distil-whisper/librispeech_long", "clean", split="validation")
|
419 |
audio_array = libri_data[0]["audio"]["array"]
|
420 |
-
inputs = processor(text=chat_prompt, audios=audio_array
|
421 |
|
422 |
outputs = model.generate(**inputs, max_new_tokens=128)
|
423 |
generated_ids = outputs[:, inputs['input_ids'].size(1):]
|
@@ -461,22 +464,22 @@ chat_prompt = processor.tokenizer.apply_chat_template(
|
|
461 |
|
462 |
libri_data = load_dataset("distil-whisper/librispeech_long", "clean", split="validation")
|
463 |
audio_array = [libri_data[0]["audio"]["array"]]*2
|
464 |
-
inputs = processor(text=chat_prompt, audios=audio_array
|
465 |
|
466 |
outputs = model.generate(**inputs, max_new_tokens=128)
|
467 |
generated_ids = outputs[:, inputs['input_ids'].size(1):]
|
468 |
response = processor.batch_decode(generated_ids, skip_special_tokens=True)
|
469 |
```
|
470 |
|
471 |
-
###
|
472 |
|
473 |
-
MERaLiON-AudioLLM requires
|
474 |
|
475 |
```
|
476 |
pip install vllm==0.6.4.post1
|
477 |
```
|
478 |
|
479 |
-
Here is an example of offline inference using our custom
|
480 |
|
481 |
```python
|
482 |
import torch
|
@@ -536,7 +539,7 @@ for o in outputs:
|
|
536 |
|
537 |
The current MERaLiON-AudioLLM has not been aligned for safety. Developers and users should perform their own safety fine-tuning and related security measures. In no event shall the authors be held liable for any claim, damages, or other liability arising from the use of the released weights and codes.
|
538 |
|
539 |
-
This research is supported by the National Research Foundation, Singapore and Infocomm Media Development Authority, Singapore under its National Large Language Models Funding Initiative. Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not reflect the views of National Research Foundation, Singapore and Infocomm Media Development Authority, Singapore.
|
540 |
|
541 |
## Technical Specifications
|
542 |
|
|
|
10 |
- chat
|
11 |
- audio
|
12 |
- safetensors
|
13 |
+
- vllm
|
14 |
datasets:
|
15 |
- MERaLiON/MNSC
|
16 |
+
base_model:
|
17 |
+
- openai/whisper-large-v2
|
18 |
---
|
19 |
|
20 |
# MERaLiON
|
|
|
29 |
- **Language(s) (NLP):** English, Chinese, Vietnamese, Indonesian, Thai, Filipino, Tamil, Malay, Khmer, Lao, Burmese, Javanese, Sundanese
|
30 |
- **License:** MIT
|
31 |
|
32 |
+
We support model inference using the [Huggingface](#inference) and [vLLM](#vllm-inference) frameworks. For more technical details, please refer to our [report]().
|
33 |
|
34 |
## Model Description
|
35 |
|
|
|
45 |
|
46 |
MERaLiON-AudioLLM is trained to mainly address 6 tasks, namely `Automatic Speech Recognition` (ASR),
|
47 |
`Speech Translation` (ST), `Spoken Question Answering` (SQA),
|
48 |
+
`Spoken Dialogue Summarization` (SDS), `Speech Instruction` (SI), and `Paralinguistics` (PARA).
|
49 |
|
50 |
We benchmark MERaLiON-AudioLLM with a series of test sets from the [AudioBench benchmark](https://github.com/AudioLLMs/AudioBench)
|
51 |
against three well-known AudioLLMs: `Qwen2-Audio 7B`, `WavLLM`, and `SALMONN`. We also compared with a cascaded model,
|
|
|
62 |
> We assess ASR and ST tasks using Word Error Rate (WER) and BLEU scores, respectively.
|
63 |
> For other tasks, we employ the LLM-as-a-Judge framework,
|
64 |
> which uses a pre-trained large language model to evaluate task performance
|
65 |
+
> by generating and scoring responses based on relevance, coherence, and accuracy criteria.
|
66 |
> Refer to the [AudioBench paper](https://arxiv.org/abs/2406.16020) for more details.
|
67 |
|
68 |
<div class="table*">
|
|
|
420 |
|
421 |
libri_data = load_dataset("distil-whisper/librispeech_long", "clean", split="validation")
|
422 |
audio_array = libri_data[0]["audio"]["array"]
|
423 |
+
inputs = processor(text=chat_prompt, audios=audio_array)
|
424 |
|
425 |
outputs = model.generate(**inputs, max_new_tokens=128)
|
426 |
generated_ids = outputs[:, inputs['input_ids'].size(1):]
|
|
|
464 |
|
465 |
libri_data = load_dataset("distil-whisper/librispeech_long", "clean", split="validation")
|
466 |
audio_array = [libri_data[0]["audio"]["array"]]*2
|
467 |
+
inputs = processor(text=chat_prompt, audios=audio_array)
|
468 |
|
469 |
outputs = model.generate(**inputs, max_new_tokens=128)
|
470 |
generated_ids = outputs[:, inputs['input_ids'].size(1):]
|
471 |
response = processor.batch_decode(generated_ids, skip_special_tokens=True)
|
472 |
```
|
473 |
|
474 |
+
### vLLM Inference
|
475 |
|
476 |
+
MERaLiON-AudioLLM requires vLLM version `0.6.4.post1`.
|
477 |
|
478 |
```
|
479 |
pip install vllm==0.6.4.post1
|
480 |
```
|
481 |
|
482 |
+
Here is an example of offline inference using our custom vLLM class.
|
483 |
|
484 |
```python
|
485 |
import torch
|
|
|
539 |
|
540 |
The current MERaLiON-AudioLLM has not been aligned for safety. Developers and users should perform their own safety fine-tuning and related security measures. In no event shall the authors be held liable for any claim, damages, or other liability arising from the use of the released weights and codes.
|
541 |
|
542 |
+
This research is supported by the National Research Foundation, Singapore, and Infocomm Media Development Authority, Singapore under its National Large Language Models Funding Initiative. Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not reflect the views of National Research Foundation, Singapore and Infocomm Media Development Authority, Singapore.
|
543 |
|
544 |
## Technical Specifications
|
545 |
|