Yingxu He commited on
Commit
a5b451f
·
verified ·
1 Parent(s): bd6d721

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +12 -9
README.md CHANGED
@@ -10,8 +10,11 @@ tags:
10
  - chat
11
  - audio
12
  - safetensors
 
13
  datasets:
14
  - MERaLiON/MNSC
 
 
15
  ---
16
 
17
  # MERaLiON
@@ -26,7 +29,7 @@ MERaLiON stands for **M**ultimodal **E**mpathetic **R**easoning **a**nd **L**ear
26
  - **Language(s) (NLP):** English, Chinese, Vietnamese, Indonesian, Thai, Filipino, Tamil, Malay, Khmer, Lao, Burmese, Javanese, Sundanese
27
  - **License:** MIT
28
 
29
- We support model inference using the [Huggingface](#inference) and [VLLM](#vllm-inference) frameworks. For more technical details, please refer to our [report]().
30
 
31
  ## Model Description
32
 
@@ -42,7 +45,7 @@ Specifically, we fine-tuned the **MERaLiON-Whisper** encoder from Whisper-large-
42
 
43
  MERaLiON-AudioLLM is trained to mainly address 6 tasks, namely `Automatic Speech Recognition` (ASR),
44
  `Speech Translation` (ST), `Spoken Question Answering` (SQA),
45
- `Spoken Dialogue Summarization` (SDS), `Speech Instruction` (SI), `Paralinguistics` (PARA).
46
 
47
  We benchmark MERaLiON-AudioLLM with a series of test sets from the [AudioBench benchmark](https://github.com/AudioLLMs/AudioBench)
48
  against three well-known AudioLLMs: `Qwen2-Audio 7B`, `WavLLM`, and `SALMONN`. We also compared with a cascaded model,
@@ -59,7 +62,7 @@ as evidenced by evaluation results on Singapore's [Multitask National Speech Cor
59
  > We assess ASR and ST tasks using Word Error Rate (WER) and BLEU scores, respectively.
60
  > For other tasks, we employ the LLM-as-a-Judge framework,
61
  > which uses a pre-trained large language model to evaluate task performance
62
- > by generating and scoring responses based on criteria such as relevance, coherence, and accuracy.
63
  > Refer to the [AudioBench paper](https://arxiv.org/abs/2406.16020) for more details.
64
 
65
  <div class="table*">
@@ -417,7 +420,7 @@ chat_prompt = processor.tokenizer.apply_chat_template(
417
 
418
  libri_data = load_dataset("distil-whisper/librispeech_long", "clean", split="validation")
419
  audio_array = libri_data[0]["audio"]["array"]
420
- inputs = processor(text=chat_prompt, audios=audio_array, time_duration_limit=30)
421
 
422
  outputs = model.generate(**inputs, max_new_tokens=128)
423
  generated_ids = outputs[:, inputs['input_ids'].size(1):]
@@ -461,22 +464,22 @@ chat_prompt = processor.tokenizer.apply_chat_template(
461
 
462
  libri_data = load_dataset("distil-whisper/librispeech_long", "clean", split="validation")
463
  audio_array = [libri_data[0]["audio"]["array"]]*2
464
- inputs = processor(text=chat_prompt, audios=audio_array, time_duration_limit=30)
465
 
466
  outputs = model.generate(**inputs, max_new_tokens=128)
467
  generated_ids = outputs[:, inputs['input_ids'].size(1):]
468
  response = processor.batch_decode(generated_ids, skip_special_tokens=True)
469
  ```
470
 
471
- ### VLLM Inference
472
 
473
- MERaLiON-AudioLLM requires vllm version `0.6.4.post1`.
474
 
475
  ```
476
  pip install vllm==0.6.4.post1
477
  ```
478
 
479
- Here is an example of offline inference using our custom vllm class.
480
 
481
  ```python
482
  import torch
@@ -536,7 +539,7 @@ for o in outputs:
536
 
537
  The current MERaLiON-AudioLLM has not been aligned for safety. Developers and users should perform their own safety fine-tuning and related security measures. In no event shall the authors be held liable for any claim, damages, or other liability arising from the use of the released weights and codes.
538
 
539
- This research is supported by the National Research Foundation, Singapore and Infocomm Media Development Authority, Singapore under its National Large Language Models Funding Initiative. Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not reflect the views of National Research Foundation, Singapore and Infocomm Media Development Authority, Singapore.
540
 
541
  ## Technical Specifications
542
 
 
10
  - chat
11
  - audio
12
  - safetensors
13
+ - vllm
14
  datasets:
15
  - MERaLiON/MNSC
16
+ base_model:
17
+ - openai/whisper-large-v2
18
  ---
19
 
20
  # MERaLiON
 
29
  - **Language(s) (NLP):** English, Chinese, Vietnamese, Indonesian, Thai, Filipino, Tamil, Malay, Khmer, Lao, Burmese, Javanese, Sundanese
30
  - **License:** MIT
31
 
32
+ We support model inference using the [Huggingface](#inference) and [vLLM](#vllm-inference) frameworks. For more technical details, please refer to our [report]().
33
 
34
  ## Model Description
35
 
 
45
 
46
  MERaLiON-AudioLLM is trained to mainly address 6 tasks, namely `Automatic Speech Recognition` (ASR),
47
  `Speech Translation` (ST), `Spoken Question Answering` (SQA),
48
+ `Spoken Dialogue Summarization` (SDS), `Speech Instruction` (SI), and `Paralinguistics` (PARA).
49
 
50
  We benchmark MERaLiON-AudioLLM with a series of test sets from the [AudioBench benchmark](https://github.com/AudioLLMs/AudioBench)
51
  against three well-known AudioLLMs: `Qwen2-Audio 7B`, `WavLLM`, and `SALMONN`. We also compared with a cascaded model,
 
62
  > We assess ASR and ST tasks using Word Error Rate (WER) and BLEU scores, respectively.
63
  > For other tasks, we employ the LLM-as-a-Judge framework,
64
  > which uses a pre-trained large language model to evaluate task performance
65
+ > by generating and scoring responses based on relevance, coherence, and accuracy criteria.
66
  > Refer to the [AudioBench paper](https://arxiv.org/abs/2406.16020) for more details.
67
 
68
  <div class="table*">
 
420
 
421
  libri_data = load_dataset("distil-whisper/librispeech_long", "clean", split="validation")
422
  audio_array = libri_data[0]["audio"]["array"]
423
+ inputs = processor(text=chat_prompt, audios=audio_array)
424
 
425
  outputs = model.generate(**inputs, max_new_tokens=128)
426
  generated_ids = outputs[:, inputs['input_ids'].size(1):]
 
464
 
465
  libri_data = load_dataset("distil-whisper/librispeech_long", "clean", split="validation")
466
  audio_array = [libri_data[0]["audio"]["array"]]*2
467
+ inputs = processor(text=chat_prompt, audios=audio_array)
468
 
469
  outputs = model.generate(**inputs, max_new_tokens=128)
470
  generated_ids = outputs[:, inputs['input_ids'].size(1):]
471
  response = processor.batch_decode(generated_ids, skip_special_tokens=True)
472
  ```
473
 
474
+ ### vLLM Inference
475
 
476
+ MERaLiON-AudioLLM requires vLLM version `0.6.4.post1`.
477
 
478
  ```
479
  pip install vllm==0.6.4.post1
480
  ```
481
 
482
+ Here is an example of offline inference using our custom vLLM class.
483
 
484
  ```python
485
  import torch
 
539
 
540
  The current MERaLiON-AudioLLM has not been aligned for safety. Developers and users should perform their own safety fine-tuning and related security measures. In no event shall the authors be held liable for any claim, damages, or other liability arising from the use of the released weights and codes.
541
 
542
+ This research is supported by the National Research Foundation, Singapore, and Infocomm Media Development Authority, Singapore under its National Large Language Models Funding Initiative. Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not reflect the views of National Research Foundation, Singapore and Infocomm Media Development Authority, Singapore.
543
 
544
  ## Technical Specifications
545