Advanced long-form generation

#12
by skroed - opened

Is there a way to do advanced long-form generation similar as it is described here: https://github.com/suno-ai/bark/blob/main/notebooks/long_form_generation.ipynb
by passing a temperature and min_eos_p ?

You can use the same logic than the original notebook you've sent:

import nltk  # we'll use this to split into sentences
import numpy as np
from transformers import BarkModel, AutoProcessor
import torch

nltk.download('punkt')
device = "cuda"

# Load model with optimization
model = BarkModel.from_pretrained("suno/bark", torch_dtype=torch.float16).to(device)
# flash attention
model = model.to_bettertransformer()

processor = AutoProcessor.from_pretrained("suno/bark")

sampling_rate = model.generation_config.sample_rate
silence = np.zeros(int(0.25 * sampling_rate))  # quarter second of silence
voice_preset = "v2/en_speaker_6"

BATCH_SIZE = 12

# split into sentences
model_input = nltk.sent_tokenize(TEXT_TO_GENERATE)

pieces = []
for i in range(0, len(model_input), BATCH_SIZE):
    inputs = model_input[BATCH_SIZE*i:min(BATCH_SIZE*(i+1), len(model_input))]
    
    if len(inputs) != 0:
        inputs = processor(inputs, voice_preset=voice_preset)
        
        speech_output, output_lengths = model.generate(**inputs.to(device), return_output_lengths=True, min_eos_p=0.2)
        
        speech_output = [output[:length].cpu().numpy() for (output,length) in zip(speech_output, output_lengths)]
        
        print(f"{i}-th part generated")
        # you can already play `speech_output` or wait for the whole generation
        pieces += [*speech_output, silence.copy()]
        
        
whole_ouput = np.concatenate(pieces)

This one however is faster because it produces the output in batch!

thanks I will give this a try.

Sign up or log in to comment