LLaSE-G1: Incentivizing Generalization Capability for LLaMA-based Speech Enhancement
Introduction
LLaSE-G1 is a unified speech enhancement model capable of handling multiple tasks without extra task prompts, including:
- Noise Suppression (SE)
- Target Speaker Extraction (TSE)
- Packet Loss Concealment (PLC)
- Acoustic Echo Cancellation (AEC)
- Speech Separation (SS)
To mitigate acoustic inconsistency, LLaSE-G1 employs continuous representations from WavLM as input and predicts speech tokens using X-Codec2, maximizing acoustic preservation. The model surpasses prior task-specific discriminative and generative speech enhancement models, demonstrating scaling effects at test time and emerging capabilities for unseen speech enhancement tasks.
For more details, refer to our paper: LLaSE-G1 Paper
Demo
You can listen to the enhancement results on our Demo Page.
Installation
1. Clone the repository
git clone https://github.com/your-repo/LLaSE-G1.git
cd LLaSE-G1
2. Create a Conda environment and install dependencies
conda create -n llase python=3.10
conda activate llase
pip install -r requirements.txt
3. Download Pretrained Models
LLaSE-G1 requires three additional pre-trained models to function properly. You can download them using the provided shell script:
bash ./ckpt/download.sh
Alternatively, you can download them manually and place them in the ./ckpt/
directory.
Inference
The main inference script is inference.py
. The inference process consists of two stages:
- Extract the 6th-layer features from WavLM.
- Use the language model (LM) to predict speech tokens, and then decode them into audio using X-Codec2.
Running Inference
To run inference, configure the parameters in ./config/test.yml
:
Parameter | Description |
---|---|
infer_feat_too |
Whether to extract WavLM features during inference. |
inference_time |
Number of inference iterations. |
feat_dir |
Directory containing extracted features. |
wav_dir |
Directory of processed audio files. |
task |
Task type: SE (Noise Suppression), TSE (Target Speaker Extraction), PLC (Packet Loss Concealment), AEC (Acoustic Echo Cancellation), SS (Speech Separation). |
Command to run inference:
bash inference.sh
Results
Samples processed by LLaSE-G1 can be found on our Demo Page.
Model Checkpoints
Our pretrained model is available on Hugging Face.
Hints
Our approach focuses on leveraging the LLM's comprehension capabilities to enable autonomous determination of task types, though this may exhibit instability in certain scenarios. A more stable and robust iteration will be released in the upcoming version.
Citation
If you find this work useful, please cite our paper:
@misc{kang2025llaseg1incentivizinggeneralizationcapability,
title={LLaSE-G1: Incentivizing Generalization Capability for LLaMA-based Speech Enhancement},
author={Boyi Kang and Xinfa Zhu and Zihan Zhang and Zhen Ye and Mingshuai Liu and Ziqian Wang and Yike Zhu and Guobin Ma and Jun Chen and Longshuai Xiao and Chao Weng and Wei Xue and Lei Xie},
year={2025},
eprint={2503.00493},
archivePrefix={arXiv},
primaryClass={eess.AS},
url={https://arxiv.org/abs/2503.00493},
}
License
This project is released under the Apache-2.0.
Contact
For any questions, please contact: [email protected]