maskgct-audio-lab

Running on Zero

File size: 9,052 Bytes

8c92a11

# Amphion Evaluation Recipe

## Supported Evaluation Metrics

Until now, Amphion Evaluation has supported the following objective metrics:

- **F0 Modeling**:
  - F0 Pearson Coefficients (FPC)
  - F0 Periodicity Root Mean Square Error (PeriodicityRMSE)
  - F0 Root Mean Square Error (F0RMSE)
  - Voiced/Unvoiced F1 Score (V/UV F1)
- **Energy Modeling**:
  - Energy Root Mean Square Error (EnergyRMSE)
  - Energy Pearson Coefficients (EnergyPC)
- **Intelligibility**:
  - Character Error Rate (CER) based on [Whipser](https://github.com/openai/whisper)
  - Word Error Rate (WER) based on [Whipser](https://github.com/openai/whisper)
- **Spectrogram Distortion**:
  - Frechet Audio Distance (FAD)
  - Mel Cepstral Distortion (MCD)
  - Multi-Resolution STFT Distance (MSTFT)
  - Perceptual Evaluation of Speech Quality (PESQ)
  - Short Time Objective Intelligibility (STOI)
  - Scale Invariant Signal to Distortion Ratio (SISDR)
  - Scale Invariant Signal to Noise Ratio (SISNR)
- **Speaker Similarity**:
  - Cosine similarity based on:
    - [Rawnet3](https://github.com/Jungjee/RawNet)
    - [Resemblyzer](https://github.com/resemble-ai/Resemblyzer)
    - [WavLM](https://huggingface.co./microsoft/wavlm-base-plus-sv)

We provide a recipe to demonstrate how to objectively evaluate your generated audios. There are three steps in total:

1. Pretrained Models Preparation
2. Audio Data Preparation
3. Evaluation

## 1. Pretrained Models Preparation

If you want to calculate `RawNet3` based speaker similarity, you need to download the pretrained model first, as illustrated [here](../../pretrained/README.md).

## 2. Audio Data Preparation

Prepare reference audios and generated audios in two folders, the `ref_dir` contains the reference audio and the `gen_dir` contains the generated audio. Here is an example.

```plaintext
 ┣ {ref_dir}
 ┃ ┣ sample1.wav
 ┃ ┣ sample2.wav
 ┣ {gen_dir}
 ┃ ┣ sample1.wav
 ┃ ┣ sample2.wav
```

You have to make sure that the pairwise **reference audio and generated audio are named the same**, as illustrated above (sample1 to sample1, sample2 to sample2).

## 3. Evaluation

Run the `run.sh` with specified refenrece folder, generated folder, dump folder and metrics.

```bash
cd Amphion
sh egs/metrics/run.sh \
	--reference_folder [Your path to the reference audios] \
	--generated_folder [Your path to the generated audios] \
	--dump_folder [Your path to dump the objective results] \
	--metrics [The metrics you need] \
	--fs [Optional. To calculate all metrics in the specified sampling rate] \
	--similarity_model [Optional. To choose the model for calculating the speaker similarity. Currently "rawnet", "wavlm" and "resemblyzer" are available. Default to "wavlm"] \
	--similarity_mode [Optional. To choose the mode for calculating the speaker similarity. "pairwith" for calculating a series of ground truth / prediction audio pairs to obtain the speaker similarity, and "overall" for computing the average score with all possible pairs between the refernece folder and generated folder. Default to "pairwith"] \
	--intelligibility_mode [Optionoal. To choose the mode for computing CER and WER. "gt_audio" means selecting the recognition content of the reference audio as the target, "gt_content" means using transcription as the target. Default to "gt_audio"] \
	--ltr_path [Optional. Path to the transcription file] \
	--language [Optional. Language for computing CER and WER. Default to "english"]
```

As for the metrics, an example is provided below:

```bash
--metrics "mcd pesq fad"
```

All currently available metrics keywords are listed below:

| Keys                      | Description                                |
| ------------------------- | ------------------------------------------ |
| `fpc`                     | F0 Pearson Coefficients                    |
| `f0_periodicity_rmse`     | F0 Periodicity Root Mean Square Error      |
| `f0rmse`                  | F0 Root Mean Square Error                  |
| `v_uv_f1`                 | Voiced/Unvoiced F1 Score                   |
| `energy_rmse`             | Energy Root Mean Square Error              |
| `energy_pc`               | Energy Pearson Coefficients                |
| `cer`                     | Character Error Rate                       |
| `wer`                     | Word Error Rate                            |
| `similarity`      | Speaker Similarity
| `fad`                     | Frechet Audio Distance                     |
| `mcd`                     | Mel Cepstral Distortion                    |
| `mstft`                   | Multi-Resolution STFT Distance             |
| `pesq`                    | Perceptual Evaluation of Speech Quality    |
| `si_sdr`                  | Scale Invariant Signal to Distortion Ratio |
| `si_snr`                  | Scale Invariant Signal to Noise Ratio      |
| `stoi`                    | Short Time Objective Intelligibility       |

For example, if want to calculate the speaker similarity between the synthesized audio and the reference audio with the same content, run:

```bash
sh egs/metrics/run.sh \
	--reference_folder [Your path to the reference audios] \
	--generated_folder [Your path to the generated audios] \
	--dump_folder [Your path to dump the objective results] \
	--metrics "similarity" \
	--similarity_model [Optional. To choose the model for calculating the speaker similarity. Currently "rawnet", "wavlm" and "resemblyzer" are available. Default to "wavlm"] \
	--similarity_mode "pairwith" \
```

If you don't have the reference audio with the same content, run the following to get the conteng-free similarity score:

```bash
sh egs/metrics/run.sh \
	--reference_folder [Your path to the reference audios] \
	--generated_folder [Your path to the generated audios] \
	--dump_folder [Your path to dump the objective results] \
	--metrics "similarity" \
	--similarity_model [Optional. To choose the model for calculating the speaker similarity. Currently "rawnet", "wavlm" and "resemblyzer" are available. Default to "wavlm"] \
	--similarity_mode "overall" \
```

## Troubleshooting
### FAD (Using Offline Models)
If your system is unable to access huggingface.co from the terminal, you might run into an error like "OSError: Can't load tokenizer for ...". To work around this, follow these steps to use local models:

1. Download the [bert-base-uncased](https://huggingface.co./bert-base-uncased), [roberta-base](https://huggingface.co./roberta-base), and [facebook/bart-base](https://huggingface.co./facebook/bart-base) models from `huggingface.co`. Ensure that the models are complete and uncorrupted. Place these directories within `Amphion/pretrained`. For a detailed file structure reference, see [This README](../../pretrained/README.md#optional-model-dependencies-for-evaluation) under `Amphion/pretrained`.
2. Inside the `Amphion/pretrained` directory, create a bash script with the content outlined below. This script will automatically update the tokenizer paths used by your system:
  ```bash
  #!/bin/bash

  BERT_DIR="bert-base-uncased"
  ROBERTA_DIR="roberta-base"
  BART_DIR="facebook/bart-base"
  PYTHON_SCRIPT="[YOUR ENV PATH]/lib/python3.9/site-packages/laion_clap/training/data.py"

  update_tokenizer_path() {
      local dir_name=$1
      local tokenizer_variable=$2
      local full_path

      if [ -d "$dir_name" ]; then
          full_path=$(realpath "$dir_name")
          if [ -f "$PYTHON_SCRIPT" ]; then
              sed -i "s|${tokenizer_variable}.from_pretrained(\".*\")|${tokenizer_variable}.from_pretrained(\"$full_path\")|" "$PYTHON_SCRIPT"
              echo "Updated ${tokenizer_variable} path to $full_path."
          else
              echo "Error: The specified Python script does not exist."
              exit 1
          fi
      else
          echo "Error: The directory $dir_name does not exist in the current directory."
          exit 1
      fi
  }

  update_tokenizer_path "$BERT_DIR" "BertTokenizer"
  update_tokenizer_path "$ROBERTA_DIR" "RobertaTokenizer"
  update_tokenizer_path "$BART_DIR" "BartTokenizer"

  echo "BERT, BART and RoBERTa Python script paths have been updated."

  ```

3. The script provided is intended to adjust the tokenizer paths in the `data.py` file, found under `/lib/python3.9/site-packages/laion_clap/training/`, within your specific environment. For those utilizing conda, you can determine your environment path by running `conda info --envs`. Then, substitute `[YOUR ENV PATH]` in the script with this path. If your environment is configured differently, you'll need to update the `PYTHON_SCRIPT` variable to correctly point to the `data.py` file.
4. Run the script. If it executes successfully, the tokenizer paths will be updated, allowing them to be loaded locally.

### WavLM-based Speaker Similarity (Using Offline Models)

If your system is unable to access huggingface.co from the terminal and you want to calculate `WavLM` based speaker similarity, you need to download the pretrained model first, as illustrated [here](../../pretrained/README.md).