adefossez commited on
Commit
188a77f
2 Parent(s): 4c6424c 925b7f8

Merge branch 'main' into our_hf2

Browse files
CHANGELOG.md CHANGED
@@ -4,6 +4,15 @@ All notable changes to this project will be documented in this file.
4
 
5
  The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/).
6
 
7
- ## [0.0.1a] - TBD
8
 
9
- Initial release, with model evaluation only.
 
 
 
 
 
 
 
 
 
 
4
 
5
  The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/).
6
 
7
+ ## [0.0.2a] - TBD
8
 
9
+ Improved demo, fixed top p (thanks @jnordberg).
10
+
11
+ Compressor tanh on output to avoid clipping with some style (especially piano).
12
+ Now repeating the conditioning periodically if it is too short.
13
+
14
+ More options when launching Gradio app locally (thanks @ashleykleynhans).
15
+
16
+ ## [0.0.1] - 2023-06-09
17
+
18
+ Initial release, with model evaluation only.
MODEL_CARD.md CHANGED
@@ -52,7 +52,7 @@ The model was evaluated on the [MusicCaps benchmark](https://www.kaggle.com/data
52
 
53
  ## Training datasets
54
 
55
- The model was trained using the following sources: the [Meta Music Initiative Sound Collection](https://www.fb.com/sound), [Shutterstock music collection](https://www.shutterstock.com/music) and the [Pond5 music collection](https://www.pond5.com/). See the paper for more details about the training set and corresponding preprocessing.
56
 
57
  ## Quantitative analysis
58
 
@@ -62,7 +62,7 @@ More information can be found in the paper [Simple and Controllable Music Genera
62
 
63
  **Data:** The data sources used to train the model are created by music professionals and covered by legal agreements with the right holders. The model is trained on 20K hours of data, we believe that scaling the model on larger datasets can further improve the performance of the model.
64
 
65
- **Mitigations:** All vocals have been removed from the data source using a state-of-the-art music source separation method, namely using the open source [Hybrid Transformer for Music Source Separation](https://github.com/facebookresearch/demucs) (HT-Demucs). The model is therefore not able to produce vocals.
66
 
67
  **Limitations:**
68
 
 
52
 
53
  ## Training datasets
54
 
55
+ The model was trained on licensed data using the following sources: the [Meta Music Initiative Sound Collection](https://www.fb.com/sound), [Shutterstock music collection](https://www.shutterstock.com/music) and the [Pond5 music collection](https://www.pond5.com/). See the paper for more details about the training set and corresponding preprocessing.
56
 
57
  ## Quantitative analysis
58
 
 
62
 
63
  **Data:** The data sources used to train the model are created by music professionals and covered by legal agreements with the right holders. The model is trained on 20K hours of data, we believe that scaling the model on larger datasets can further improve the performance of the model.
64
 
65
+ **Mitigations:** Vocals have been removed from the data source using corresponding tags, and then using using a state-of-the-art music source separation method, namely using the open source [Hybrid Transformer for Music Source Separation](https://github.com/facebookresearch/demucs) (HT-Demucs).
66
 
67
  **Limitations:**
68
 
README.md CHANGED
@@ -24,12 +24,12 @@ Audiocraft is a PyTorch library for deep learning research on audio generation.
24
  ## MusicGen
25
 
26
  Audiocraft provides the code and models for MusicGen, [a simple and controllable model for music generation][arxiv]. MusicGen is a single stage auto-regressive
27
- Transformer model trained over a 32kHz <a href="https://github.com/facebookresearch/encodec">EnCodec tokenizer</a> with 4 codebooks sampled at 50 Hz. Unlike existing methods like [MusicLM](https://arxiv.org/abs/2301.11325), MusicGen doesn't not require a self-supervised semantic representation, and it generates
28
  all 4 codebooks in one pass. By introducing a small delay between the codebooks, we show we can predict
29
  them in parallel, thus having only 50 auto-regressive steps per second of audio.
30
  Check out our [sample page][musicgen_samples] or test the available demo!
31
 
32
- <a target="_blank" href="https://colab.research.google.com/drive/1fxGqfg96RBUvGxZ1XXN07s3DthrKUl4-?usp=sharing">
33
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
34
  </a>
35
  <a target="_blank" href="https://huggingface.co/spaces/facebook/MusicGen">
@@ -37,6 +37,8 @@ Check out our [sample page][musicgen_samples] or test the available demo!
37
  </a>
38
  <br>
39
 
 
 
40
  ## Installation
41
  Audiocraft requires Python 3.9, PyTorch 2.0.0, and a GPU with at least 16 GB of memory (for the medium-sized model). To install Audiocraft, you can run the following:
42
 
@@ -51,7 +53,12 @@ pip install -e . # or if you cloned the repo locally
51
  ```
52
 
53
  ## Usage
54
- You can play with MusicGen by running the jupyter notebook at [`demo.ipynb`](./demo.ipynb) locally, or use the provided [colab notebook](https://colab.research.google.com/drive/1fxGqfg96RBUvGxZ1XXN07s3DthrKUl4-?usp=sharing). Finally, a demo is also available on the [`facebook/MusiGen` HugginFace Space](https://huggingface.co/spaces/facebook/MusicGen) (huge thanks to all the HF team for their support).
 
 
 
 
 
55
 
56
  ## API
57
 
@@ -68,7 +75,7 @@ GPUs will be able to generate short sequences, or longer sequences with the `sma
68
  **Note**: Please make sure to have [ffmpeg](https://ffmpeg.org/download.html) installed when using newer version of `torchaudio`.
69
  You can install it with:
70
  ```
71
- apt get install ffmpeg
72
  ```
73
 
74
  See after a quick example for using the API.
@@ -90,7 +97,7 @@ wav = model.generate_with_chroma(descriptions, melody[None].expand(3, -1, -1), s
90
 
91
  for idx, one_wav in enumerate(wav):
92
  # Will save under {idx}.wav, with loudness normalization at -14 db LUFS.
93
- audio_write(f'{idx}', one_wav.cpu(), model.sample_rate, strategy="loudness")
94
  ```
95
 
96
 
@@ -105,6 +112,11 @@ See [the model card page](./MODEL_CARD.md).
105
  Yes. We will soon release the training code for MusicGen and EnCodec.
106
 
107
 
 
 
 
 
 
108
  ## Citation
109
  ```
110
  @article{copet2023simple,
 
24
  ## MusicGen
25
 
26
  Audiocraft provides the code and models for MusicGen, [a simple and controllable model for music generation][arxiv]. MusicGen is a single stage auto-regressive
27
+ Transformer model trained over a 32kHz <a href="https://github.com/facebookresearch/encodec">EnCodec tokenizer</a> with 4 codebooks sampled at 50 Hz. Unlike existing methods like [MusicLM](https://arxiv.org/abs/2301.11325), MusicGen doesn't require a self-supervised semantic representation, and it generates
28
  all 4 codebooks in one pass. By introducing a small delay between the codebooks, we show we can predict
29
  them in parallel, thus having only 50 auto-regressive steps per second of audio.
30
  Check out our [sample page][musicgen_samples] or test the available demo!
31
 
32
+ <a target="_blank" href="https://colab.research.google.com/drive/1-Xe9NCdIs2sCUbiSmwHXozK6AAhMm7_i?usp=sharing">
33
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
34
  </a>
35
  <a target="_blank" href="https://huggingface.co/spaces/facebook/MusicGen">
 
37
  </a>
38
  <br>
39
 
40
+ We use 20K hours of licensed music to train MusicGen. Specifically, we rely on an internal dataset of 10K high-quality music tracks, and on the ShutterStock and Pond5 music data.
41
+
42
  ## Installation
43
  Audiocraft requires Python 3.9, PyTorch 2.0.0, and a GPU with at least 16 GB of memory (for the medium-sized model). To install Audiocraft, you can run the following:
44
 
 
53
  ```
54
 
55
  ## Usage
56
+ We offer a number of way to interact with MusicGen:
57
+ 1. You can play with MusicGen by running the jupyter notebook at [`demo.ipynb`](./demo.ipynb) locally, or use the provided [colab notebook](https://colab.research.google.com/drive/1fxGqfg96RBUvGxZ1XXN07s3DthrKUl4-?usp=sharing).
58
+ 2. You can use the gradio demo locally by running `python app.py`.
59
+ 3. A demo is also available on the [`facebook/MusicGen` HuggingFace Space](https://huggingface.co/spaces/facebook/MusicGen) (huge thanks to all the HF team for their support).
60
+ 4. Finally, you can run the [Gradio demo with a Colab GPU](https://colab.research.google.com/drive/1-Xe9NCdIs2sCUbiSmwHXozK6AAhMm7_i?usp=sharing),
61
+ as adapted from [@camenduru Colab](https://github.com/camenduru/MusicGen-colab).
62
 
63
  ## API
64
 
 
75
  **Note**: Please make sure to have [ffmpeg](https://ffmpeg.org/download.html) installed when using newer version of `torchaudio`.
76
  You can install it with:
77
  ```
78
+ apt-get install ffmpeg
79
  ```
80
 
81
  See after a quick example for using the API.
 
97
 
98
  for idx, one_wav in enumerate(wav):
99
  # Will save under {idx}.wav, with loudness normalization at -14 db LUFS.
100
+ audio_write(f'{idx}', one_wav.cpu(), model.sample_rate, strategy="loudness", loudness_compressor=True)
101
  ```
102
 
103
 
 
112
  Yes. We will soon release the training code for MusicGen and EnCodec.
113
 
114
 
115
+ #### I need help on Windows
116
+
117
+ @FurkanGozukara made a complete tutorial for [Audiocraft/MusicGen on Windows](https://youtu.be/v-YpvPkhdO4)
118
+
119
+
120
  ## Citation
121
  ```
122
  @article{copet2023simple,
app.py CHANGED
@@ -7,14 +7,15 @@ LICENSE file in the root directory of this source tree.
7
  """
8
 
9
  from tempfile import NamedTemporaryFile
 
10
  import torch
11
  import gradio as gr
 
12
  from audiocraft.models import MusicGen
13
-
14
  from audiocraft.data.audio import audio_write
15
 
16
-
17
  MODEL = None
 
18
 
19
 
20
  def load_model(version):
@@ -56,95 +57,160 @@ def predict(model, text, melody, duration, topk, topp, temperature, cfg_coef):
56
 
57
  output = output.detach().cpu().float()[0]
58
  with NamedTemporaryFile("wb", suffix=".wav", delete=False) as file:
59
- audio_write(file.name, output, MODEL.sample_rate, strategy="loudness", add_suffix=False)
 
 
60
  waveform_video = gr.make_waveform(file.name)
61
  return waveform_video
62
 
63
 
64
- with gr.Blocks() as demo:
65
- gr.Markdown(
66
- """
67
- # MusicGen
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
68
 
69
- This is the demo for [MusicGen](https://github.com/facebookresearch/audiocraft), a simple and controllable model for music generation
70
- presented at: ["Simple and Controllable Music Generation"](https://huggingface.co/papers/2306.05284).
71
- <br/>
72
- <a href="https://huggingface.co/spaces/musicgen/MusicGen?duplicate=true" style="display: inline-block;margin-top: .5em;margin-right: .25em;" target="_blank">
73
- <img style="margin-bottom: 0em;display: inline;margin-top: -.25em;" src="https://bit.ly/3gLdBN6" alt="Duplicate Space"></a>
74
- for longer sequences, more control and no queue.</p>
75
- """
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
76
  )
77
- with gr.Row():
78
- with gr.Column():
79
- with gr.Row():
80
- text = gr.Text(label="Input Text", interactive=True)
81
- melody = gr.Audio(source="upload", type="numpy", label="Melody Condition (optional)", interactive=True)
82
- with gr.Row():
83
- submit = gr.Button("Submit")
84
- with gr.Row():
85
- model = gr.Radio(["melody", "medium", "small", "large"], label="Model", value="melody", interactive=True)
86
- with gr.Row():
87
- duration = gr.Slider(minimum=1, maximum=30, value=10, label="Duration", interactive=True)
88
- with gr.Row():
89
- topk = gr.Number(label="Top-k", value=250, interactive=True)
90
- topp = gr.Number(label="Top-p", value=0, interactive=True)
91
- temperature = gr.Number(label="Temperature", value=1.0, interactive=True)
92
- cfg_coef = gr.Number(label="Classifier Free Guidance", value=3.0, interactive=True)
93
- with gr.Column():
94
- output = gr.Video(label="Generated Music")
95
- submit.click(predict, inputs=[model, text, melody, duration, topk, topp, temperature, cfg_coef], outputs=[output])
96
- gr.Examples(
97
- fn=predict,
98
- examples=[
99
- [
100
- "An 80s driving pop song with heavy drums and synth pads in the background",
101
- "./assets/bach.mp3",
102
- "melody"
103
- ],
104
- [
105
- "A cheerful country song with acoustic guitars",
106
- "./assets/bolero_ravel.mp3",
107
- "melody"
108
- ],
109
- [
110
- "90s rock song with electric guitar and heavy drums",
111
- None,
112
- "medium"
113
- ],
114
- [
115
- "a light and cheerly EDM track, with syncopated drums, aery pads, and strong emotions",
116
- "./assets/bach.mp3",
117
- "melody"
118
- ],
119
- [
120
- "lofi slow bpm electro chill with organic samples",
121
- None,
122
- "medium",
123
- ],
124
- ],
125
- inputs=[text, melody, model],
126
- outputs=[output]
127
  )
128
- gr.Markdown(
129
- """
130
- ### More details
131
-
132
- The model will generate a short music extract based on the description you provided.
133
- You can generate up to 30 seconds of audio.
134
-
135
- We present 4 model variations:
136
- 1. Melody -- a music generation model capable of generating music condition on text and melody inputs. **Note**, you can also use text only.
137
- 2. Small -- a 300M transformer decoder conditioned on text only.
138
- 3. Medium -- a 1.5B transformer decoder conditioned on text only.
139
- 4. Large -- a 3.3B transformer decoder conditioned on text only (might OOM for the longest sequences.)
140
-
141
- When using `melody`, ou can optionaly provide a reference audio from
142
- which a broad melody will be extracted. The model will then try to follow both the description and melody provided.
143
-
144
- You can also use your own GPU or a Google Colab by following the instructions on our repo.
145
- See [github.com/facebookresearch/audiocraft](https://github.com/facebookresearch/audiocraft)
146
- for more details.
147
- """
148
  )
149
 
150
- demo.launch()
 
 
 
 
 
 
 
 
 
 
7
  """
8
 
9
  from tempfile import NamedTemporaryFile
10
+ import argparse
11
  import torch
12
  import gradio as gr
13
+ import os
14
  from audiocraft.models import MusicGen
 
15
  from audiocraft.data.audio import audio_write
16
 
 
17
  MODEL = None
18
+ IS_SHARED_SPACE = "musicgen/MusicGen" in os.environ['SPACE_ID']
19
 
20
 
21
  def load_model(version):
 
57
 
58
  output = output.detach().cpu().float()[0]
59
  with NamedTemporaryFile("wb", suffix=".wav", delete=False) as file:
60
+ audio_write(
61
+ file.name, output, MODEL.sample_rate, strategy="loudness",
62
+ loudness_headroom_db=16, loudness_compressor=True, add_suffix=False)
63
  waveform_video = gr.make_waveform(file.name)
64
  return waveform_video
65
 
66
 
67
+ def ui(**kwargs):
68
+ with gr.Blocks() as interface:
69
+ gr.Markdown(
70
+ """
71
+ # MusicGen
72
+ This is your private demo for [MusicGen](https://github.com/facebookresearch/audiocraft), a simple and controllable model for music generation
73
+ presented at: ["Simple and Controllable Music Generation"](https://huggingface.co/papers/2306.05284)
74
+ """
75
+ )
76
+ if IS_SHARED_SPACE:
77
+ gr.Markdown("""
78
+ ⚠ This Space doesn't work in this shared UI ⚠
79
+
80
+ <a href="https://huggingface.co/spaces/musicgen/MusicGen?duplicate=true" style="display: inline-block;margin-top: .5em;margin-right: .25em;" target="_blank">
81
+ <img style="margin-bottom: 0em;display: inline;margin-top: -.25em;" src="https://bit.ly/3gLdBN6" alt="Duplicate Space"></a>
82
+ to use it privately, or use the <a href="https://huggingface.co/spaces/facebook/MusicGen">public demo</a>
83
+ """)
84
+ with gr.Row():
85
+ with gr.Column():
86
+ with gr.Row():
87
+ text = gr.Text(label="Input Text", interactive=True)
88
+ melody = gr.Audio(source="upload", type="numpy", label="Melody Condition (optional)", interactive=True)
89
+ with gr.Row():
90
+ submit = gr.Button("Submit")
91
+ with gr.Row():
92
+ model = gr.Radio(["melody", "medium", "small", "large"], label="Model", value="melody", interactive=True)
93
+ with gr.Row():
94
+ duration = gr.Slider(minimum=1, maximum=30, value=10, label="Duration", interactive=True)
95
+ with gr.Row():
96
+ topk = gr.Number(label="Top-k", value=250, interactive=True)
97
+ topp = gr.Number(label="Top-p", value=0, interactive=True)
98
+ temperature = gr.Number(label="Temperature", value=1.0, interactive=True)
99
+ cfg_coef = gr.Number(label="Classifier Free Guidance", value=3.0, interactive=True)
100
+ with gr.Column():
101
+ output = gr.Video(label="Generated Music")
102
+ submit.click(predict, inputs=[model, text, melody, duration, topk, topp, temperature, cfg_coef], outputs=[output])
103
+ gr.Examples(
104
+ fn=predict,
105
+ examples=[
106
+ [
107
+ "An 80s driving pop song with heavy drums and synth pads in the background",
108
+ "./assets/bach.mp3",
109
+ "melody"
110
+ ],
111
+ [
112
+ "A cheerful country song with acoustic guitars",
113
+ "./assets/bolero_ravel.mp3",
114
+ "melody"
115
+ ],
116
+ [
117
+ "90s rock song with electric guitar and heavy drums",
118
+ None,
119
+ "medium"
120
+ ],
121
+ [
122
+ "a light and cheerly EDM track, with syncopated drums, aery pads, and strong emotions",
123
+ "./assets/bach.mp3",
124
+ "melody"
125
+ ],
126
+ [
127
+ "lofi slow bpm electro chill with organic samples",
128
+ None,
129
+ "medium",
130
+ ],
131
+ ],
132
+ inputs=[text, melody, model],
133
+ outputs=[output]
134
+ )
135
+ gr.Markdown(
136
+ """
137
+ ### More details
138
+
139
+ The model will generate a short music extract based on the description you provided.
140
+ You can generate up to 30 seconds of audio.
141
+
142
+ We present 4 model variations:
143
+ 1. Melody -- a music generation model capable of generating music condition on text and melody inputs. **Note**, you can also use text only.
144
+ 2. Small -- a 300M transformer decoder conditioned on text only.
145
+ 3. Medium -- a 1.5B transformer decoder conditioned on text only.
146
+ 4. Large -- a 3.3B transformer decoder conditioned on text only (might OOM for the longest sequences.)
147
+
148
+ When using `melody`, ou can optionaly provide a reference audio from
149
+ which a broad melody will be extracted. The model will then try to follow both the description and melody provided.
150
+
151
+ You can also use your own GPU or a Google Colab by following the instructions on our repo.
152
+ See [github.com/facebookresearch/audiocraft](https://github.com/facebookresearch/audiocraft)
153
+ for more details.
154
+ """
155
+ )
156
 
157
+ # Show the interface
158
+ launch_kwargs = {}
159
+ username = kwargs.get('username')
160
+ password = kwargs.get('password')
161
+ server_port = kwargs.get('server_port', 0)
162
+ inbrowser = kwargs.get('inbrowser', False)
163
+ share = kwargs.get('share', False)
164
+ server_name = kwargs.get('listen')
165
+
166
+ launch_kwargs['server_name'] = server_name
167
+
168
+ if username and password:
169
+ launch_kwargs['auth'] = (username, password)
170
+ if server_port > 0:
171
+ launch_kwargs['server_port'] = server_port
172
+ if inbrowser:
173
+ launch_kwargs['inbrowser'] = inbrowser
174
+ if share:
175
+ launch_kwargs['share'] = share
176
+
177
+ interface.queue().launch(**launch_kwargs, max_threads=1)
178
+
179
+
180
+ if __name__ == "__main__":
181
+ parser = argparse.ArgumentParser()
182
+ parser.add_argument(
183
+ '--listen',
184
+ type=str,
185
+ default='127.0.0.1',
186
+ help='IP to listen on for connections to Gradio',
187
  )
188
+ parser.add_argument(
189
+ '--username', type=str, default='', help='Username for authentication'
190
+ )
191
+ parser.add_argument(
192
+ '--password', type=str, default='', help='Password for authentication'
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
193
  )
194
+ parser.add_argument(
195
+ '--server_port',
196
+ type=int,
197
+ default=0,
198
+ help='Port to run the server listener on',
199
+ )
200
+ parser.add_argument(
201
+ '--inbrowser', action='store_true', help='Open in browser'
202
+ )
203
+ parser.add_argument(
204
+ '--share', action='store_true', help='Share the gradio UI'
 
 
 
 
 
 
 
 
 
205
  )
206
 
207
+ args = parser.parse_args()
208
+
209
+ ui(
210
+ username=args.username,
211
+ password=args.password,
212
+ inbrowser=args.inbrowser,
213
+ server_port=args.server_port,
214
+ share=args.share,
215
+ listen=args.listen
216
+ )
app_batched.py CHANGED
@@ -57,7 +57,9 @@ def predict(texts, melodies):
57
  out_files = []
58
  for output in outputs:
59
  with NamedTemporaryFile("wb", suffix=".wav", delete=False) as file:
60
- audio_write(file.name, output, MODEL.sample_rate, strategy="loudness", add_suffix=False)
 
 
61
  waveform_video = gr.make_waveform(file.name)
62
  out_files.append(waveform_video)
63
  return [out_files]
 
57
  out_files = []
58
  for output in outputs:
59
  with NamedTemporaryFile("wb", suffix=".wav", delete=False) as file:
60
+ audio_write(
61
+ file.name, output, MODEL.sample_rate, strategy="loudness",
62
+ loudness_headroom_db=16, loudness_compressor=True, add_suffix=False)
63
  waveform_video = gr.make_waveform(file.name)
64
  out_files.append(waveform_video)
65
  return [out_files]
audiocraft/__init__.py CHANGED
@@ -7,4 +7,4 @@
7
  # flake8: noqa
8
  from . import data, modules, models
9
 
10
- __version__ = '0.0.1'
 
7
  # flake8: noqa
8
  from . import data, modules, models
9
 
10
+ __version__ = '0.0.2a1'
audiocraft/data/audio.py CHANGED
@@ -155,6 +155,7 @@ def audio_write(stem_name: tp.Union[str, Path],
155
  format: str = 'wav', mp3_rate: int = 320, normalize: bool = True,
156
  strategy: str = 'peak', peak_clip_headroom_db: float = 1,
157
  rms_headroom_db: float = 18, loudness_headroom_db: float = 14,
 
158
  log_clipping: bool = True, make_parent_dir: bool = True,
159
  add_suffix: bool = True) -> Path:
160
  """Convenience function for saving audio to disk. Returns the filename the audio was written to.
@@ -173,7 +174,8 @@ def audio_write(stem_name: tp.Union[str, Path],
173
  rms_headroom_db (float): Headroom in dB when doing 'rms' strategy. This must be much larger
174
  than the `peak_clip` one to avoid further clipping.
175
  loudness_headroom_db (float): Target loudness for loudness normalization.
176
- log_clipping (bool): If True, basic logging on stderr when clipping still
 
177
  occurs despite strategy (only for 'rms').
178
  make_parent_dir (bool): Make parent directory if it doesn't exist.
179
  Returns:
 
155
  format: str = 'wav', mp3_rate: int = 320, normalize: bool = True,
156
  strategy: str = 'peak', peak_clip_headroom_db: float = 1,
157
  rms_headroom_db: float = 18, loudness_headroom_db: float = 14,
158
+ loudness_compressor: bool = False,
159
  log_clipping: bool = True, make_parent_dir: bool = True,
160
  add_suffix: bool = True) -> Path:
161
  """Convenience function for saving audio to disk. Returns the filename the audio was written to.
 
174
  rms_headroom_db (float): Headroom in dB when doing 'rms' strategy. This must be much larger
175
  than the `peak_clip` one to avoid further clipping.
176
  loudness_headroom_db (float): Target loudness for loudness normalization.
177
+ loudness_compressor (bool): Uses tanh for soft clipping when strategy is 'loudness'.
178
+ when strategy is 'loudness'log_clipping (bool): If True, basic logging on stderr when clipping still
179
  occurs despite strategy (only for 'rms').
180
  make_parent_dir (bool): Make parent directory if it doesn't exist.
181
  Returns:
audiocraft/data/audio_utils.py CHANGED
@@ -54,8 +54,8 @@ def convert_audio(wav: torch.Tensor, from_rate: float,
54
  return wav
55
 
56
 
57
- def normalize_loudness(wav: torch.Tensor, sample_rate: int, loudness_headroom_db: float = 12,
58
- energy_floor: float = 2e-3):
59
  """Normalize an input signal to a user loudness in dB LKFS.
60
  Audio loudness is defined according to the ITU-R BS.1770-4 recommendation.
61
 
@@ -63,6 +63,7 @@ def normalize_loudness(wav: torch.Tensor, sample_rate: int, loudness_headroom_db
63
  wav (torch.Tensor): Input multichannel audio data.
64
  sample_rate (int): Sample rate.
65
  loudness_headroom_db (float): Target loudness of the output in dB LUFS.
 
66
  energy_floor (float): anything below that RMS level will not be rescaled.
67
  Returns:
68
  output (torch.Tensor): Loudness normalized output data.
@@ -76,6 +77,8 @@ def normalize_loudness(wav: torch.Tensor, sample_rate: int, loudness_headroom_db
76
  delta_loudness = -loudness_headroom_db - input_loudness_db
77
  gain = 10.0 ** (delta_loudness / 20.0)
78
  output = gain * wav
 
 
79
  assert output.isfinite().all(), (input_loudness_db, wav.pow(2).mean().sqrt())
80
  return output
81
 
@@ -93,7 +96,8 @@ def _clip_wav(wav: torch.Tensor, log_clipping: bool = False, stem_name: tp.Optio
93
  def normalize_audio(wav: torch.Tensor, normalize: bool = True,
94
  strategy: str = 'peak', peak_clip_headroom_db: float = 1,
95
  rms_headroom_db: float = 18, loudness_headroom_db: float = 14,
96
- log_clipping: bool = False, sample_rate: tp.Optional[int] = None,
 
97
  stem_name: tp.Optional[str] = None) -> torch.Tensor:
98
  """Normalize the audio according to the prescribed strategy (see after).
99
 
@@ -109,6 +113,7 @@ def normalize_audio(wav: torch.Tensor, normalize: bool = True,
109
  rms_headroom_db (float): Headroom in dB when doing 'rms' strategy. This must be much larger
110
  than the `peak_clip` one to avoid further clipping.
111
  loudness_headroom_db (float): Target loudness for loudness normalization.
 
112
  log_clipping (bool): If True, basic logging on stderr when clipping still
113
  occurs despite strategy (only for 'rms').
114
  sample_rate (int): Sample rate for the audio data (required for loudness).
@@ -132,7 +137,7 @@ def normalize_audio(wav: torch.Tensor, normalize: bool = True,
132
  _clip_wav(wav, log_clipping=log_clipping, stem_name=stem_name)
133
  elif strategy == 'loudness':
134
  assert sample_rate is not None, "Loudness normalization requires sample rate."
135
- wav = normalize_loudness(wav, sample_rate, loudness_headroom_db)
136
  _clip_wav(wav, log_clipping=log_clipping, stem_name=stem_name)
137
  else:
138
  assert wav.abs().max() < 1
 
54
  return wav
55
 
56
 
57
+ def normalize_loudness(wav: torch.Tensor, sample_rate: int, loudness_headroom_db: float = 14,
58
+ loudness_compressor: bool = False, energy_floor: float = 2e-3):
59
  """Normalize an input signal to a user loudness in dB LKFS.
60
  Audio loudness is defined according to the ITU-R BS.1770-4 recommendation.
61
 
 
63
  wav (torch.Tensor): Input multichannel audio data.
64
  sample_rate (int): Sample rate.
65
  loudness_headroom_db (float): Target loudness of the output in dB LUFS.
66
+ loudness_compressor (bool): Uses tanh for soft clipping.
67
  energy_floor (float): anything below that RMS level will not be rescaled.
68
  Returns:
69
  output (torch.Tensor): Loudness normalized output data.
 
77
  delta_loudness = -loudness_headroom_db - input_loudness_db
78
  gain = 10.0 ** (delta_loudness / 20.0)
79
  output = gain * wav
80
+ if loudness_compressor:
81
+ output = torch.tanh(output)
82
  assert output.isfinite().all(), (input_loudness_db, wav.pow(2).mean().sqrt())
83
  return output
84
 
 
96
  def normalize_audio(wav: torch.Tensor, normalize: bool = True,
97
  strategy: str = 'peak', peak_clip_headroom_db: float = 1,
98
  rms_headroom_db: float = 18, loudness_headroom_db: float = 14,
99
+ loudness_compressor: bool = False, log_clipping: bool = False,
100
+ sample_rate: tp.Optional[int] = None,
101
  stem_name: tp.Optional[str] = None) -> torch.Tensor:
102
  """Normalize the audio according to the prescribed strategy (see after).
103
 
 
113
  rms_headroom_db (float): Headroom in dB when doing 'rms' strategy. This must be much larger
114
  than the `peak_clip` one to avoid further clipping.
115
  loudness_headroom_db (float): Target loudness for loudness normalization.
116
+ loudness_compressor (bool): If True, uses tanh based soft clipping.
117
  log_clipping (bool): If True, basic logging on stderr when clipping still
118
  occurs despite strategy (only for 'rms').
119
  sample_rate (int): Sample rate for the audio data (required for loudness).
 
137
  _clip_wav(wav, log_clipping=log_clipping, stem_name=stem_name)
138
  elif strategy == 'loudness':
139
  assert sample_rate is not None, "Loudness normalization requires sample rate."
140
+ wav = normalize_loudness(wav, sample_rate, loudness_headroom_db, loudness_compressor)
141
  _clip_wav(wav, log_clipping=log_clipping, stem_name=stem_name)
142
  else:
143
  assert wav.abs().max() < 1
audiocraft/models/musicgen.py CHANGED
@@ -88,6 +88,8 @@ class MusicGen:
88
  cache_dir = os.environ.get('MUSICGEN_ROOT', None)
89
  compression_model = load_compression_model(name, device=device, cache_dir=cache_dir)
90
  lm = load_lm_model(name, device=device, cache_dir=cache_dir)
 
 
91
 
92
  return MusicGen(name, compression_model, lm)
93
 
 
88
  cache_dir = os.environ.get('MUSICGEN_ROOT', None)
89
  compression_model = load_compression_model(name, device=device, cache_dir=cache_dir)
90
  lm = load_lm_model(name, device=device, cache_dir=cache_dir)
91
+ if name == 'melody':
92
+ lm.condition_provider.conditioners['self_wav'].match_len_on_eval = True
93
 
94
  return MusicGen(name, compression_model, lm)
95
 
audiocraft/modules/conditioners.py CHANGED
@@ -9,6 +9,7 @@ from copy import deepcopy
9
  from dataclasses import dataclass, field
10
  from itertools import chain
11
  import logging
 
12
  import random
13
  import re
14
  import typing as tp
@@ -484,7 +485,7 @@ class ChromaStemConditioner(WaveformConditioner):
484
  **kwargs: Additional parameters for the chroma extractor.
485
  """
486
  def __init__(self, output_dim: int, sample_rate: int, n_chroma: int, radix2_exp: int,
487
- duration: float, match_len_on_eval: bool = False, eval_wavs: tp.Optional[str] = None,
488
  n_eval_wavs: int = 0, device: tp.Union[torch.device, str] = "cpu", **kwargs):
489
  from demucs import pretrained
490
  super().__init__(dim=n_chroma, output_dim=output_dim, device=device)
@@ -535,7 +536,10 @@ class ChromaStemConditioner(WaveformConditioner):
535
  chroma = chroma[:, :self.chroma_len]
536
  logger.debug(f'chroma was truncated! ({t} -> {chroma.shape[1]})')
537
  elif t < self.chroma_len:
538
- chroma = F.pad(chroma, (0, 0, 0, self.chroma_len - t))
 
 
 
539
  logger.debug(f'chroma was zero-padded! ({t} -> {chroma.shape[1]})')
540
  return chroma
541
 
 
9
  from dataclasses import dataclass, field
10
  from itertools import chain
11
  import logging
12
+ import math
13
  import random
14
  import re
15
  import typing as tp
 
485
  **kwargs: Additional parameters for the chroma extractor.
486
  """
487
  def __init__(self, output_dim: int, sample_rate: int, n_chroma: int, radix2_exp: int,
488
+ duration: float, match_len_on_eval: bool = True, eval_wavs: tp.Optional[str] = None,
489
  n_eval_wavs: int = 0, device: tp.Union[torch.device, str] = "cpu", **kwargs):
490
  from demucs import pretrained
491
  super().__init__(dim=n_chroma, output_dim=output_dim, device=device)
 
536
  chroma = chroma[:, :self.chroma_len]
537
  logger.debug(f'chroma was truncated! ({t} -> {chroma.shape[1]})')
538
  elif t < self.chroma_len:
539
+ # chroma = F.pad(chroma, (0, 0, 0, self.chroma_len - t))
540
+ n_repeat = int(math.ceil(self.chroma_len / t))
541
+ chroma = chroma.repeat(1, n_repeat, 1)
542
+ chroma = chroma[:, :self.chroma_len]
543
  logger.debug(f'chroma was zero-padded! ({t} -> {chroma.shape[1]})')
544
  return chroma
545
 
requirements.txt CHANGED
@@ -17,3 +17,4 @@ transformers
17
  xformers
18
  demucs
19
  librosa
 
 
17
  xformers
18
  demucs
19
  librosa
20
+ gradio