LLaSE-G1: Incentivizing Generalization Capability for LLaMA-based Speech Enhancement

Introduction

LLaSE-G1 is a unified speech enhancement model capable of handling multiple tasks without extra task prompts, including:

Noise Suppression (SE)
Target Speaker Extraction (TSE)
Packet Loss Concealment (PLC)
Acoustic Echo Cancellation (AEC)
Speech Separation (SS)

To mitigate acoustic inconsistency, LLaSE-G1 employs continuous representations from WavLM as input and predicts speech tokens using X-Codec2, maximizing acoustic preservation. The model surpasses prior task-specific discriminative and generative speech enhancement models, demonstrating scaling effects at test time and emerging capabilities for unseen speech enhancement tasks.

For more details, refer to our paper: LLaSE-G1 Paper

Demo

You can listen to the enhancement results on our Demo Page.

Installation

1. Clone the repository

git clone https://github.com/your-repo/LLaSE-G1.git
cd LLaSE-G1

2. Create a Conda environment and install dependencies

conda create -n llase python=3.10
conda activate llase
pip install -r requirements.txt

3. Download Pretrained Models

LLaSE-G1 requires three additional pre-trained models to function properly. You can download them using the provided shell script:

bash ./ckpt/download.sh

Alternatively, you can download them manually and place them in the ./ckpt/ directory.

Inference

The main inference script is inference.py. The inference process consists of two stages:

Extract the 6th-layer features from WavLM.
Use the language model (LM) to predict speech tokens, and then decode them into audio using X-Codec2.

Running Inference

To run inference, configure the parameters in ./config/test.yml:

Parameter	Description
`infer_feat_too`	Whether to extract WavLM features during inference.
`inference_time`	Number of inference iterations.
`feat_dir`	Directory containing extracted features.
`wav_dir`	Directory of processed audio files.
`task`	Task type: `SE` (Noise Suppression), `TSE` (Target Speaker Extraction), `PLC` (Packet Loss Concealment), `AEC` (Acoustic Echo Cancellation), `SS` (Speech Separation).

Command to run inference:

bash inference.sh

Results

Samples processed by LLaSE-G1 can be found on our Demo Page.

Model Checkpoints

Our pretrained model is available on Hugging Face.

Hints

Our approach focuses on leveraging the LLM's comprehension capabilities to enable autonomous determination of task types, though this may exhibit instability in certain scenarios. A more stable and robust iteration will be released in the upcoming version.

Citation

If you find this work useful, please cite our paper:

@misc{kang2025llaseg1incentivizinggeneralizationcapability,
      title={LLaSE-G1: Incentivizing Generalization Capability for LLaMA-based Speech Enhancement}, 
      author={Boyi Kang and Xinfa Zhu and Zihan Zhang and Zhen Ye and Mingshuai Liu and Ziqian Wang and Yike Zhu and Guobin Ma and Jun Chen and Longshuai Xiao and Chao Weng and Wei Xue and Lei Xie},
      year={2025},
      eprint={2503.00493},
      archivePrefix={arXiv},
      primaryClass={eess.AS},
      url={https://arxiv.org/abs/2503.00493}, 
}

License

This project is released under the Apache-2.0.

Contact

For any questions, please contact: [email protected]

ASLP-lab
/

LLaSE-G1