This repo contains model for music generation from images. The generated music returns in ABC format and it can be sound for example here. Note, that you need to correct BPM (this is speed) to make music more logical and natural. The model is fune-tuned concatecation of two pre-trained models: google/vit-base-patch16-224 as encoder and sander-wood/text-to-music as decoder. To use this model you can write this:

from PIL import Image
import requests
from transformers import AutoTokenizer, VisionEncoderDecoderModel, ViTImageProcessor

def generate_music(model, image, tokenizer):
    pixel_values = feature_extractor(images=image, return_tensors="pt").pixel_values
    pixel_values = pixel_values.to(device)

    generated_tokens = model.generate(
        pixel_values,
        max_length=300,
        num_beams=5,
        top_p=0.8,
        temperature=2.0,
        do_sample=True,
        pad_token_id=tokenizer.pad_token_id,
        eos_token_id=tokenizer.eos_token_id,
    )

    generated_music = tokenizer.decode(generated_tokens[0], skip_special_tokens=True)
    return generated_music

path = 'AnyaSchen/image2music'
fine_tuned_model = VisionEncoderDecoderModel.from_pretrained(path).to(device)
feature_extractor = ViTImageProcessor.from_pretrained(path)
tokenizer = AutoTokenizer.from_pretrained(path)

url = 'https://anandaindia.org/wp-content/uploads/2018/12/happy-man.jpg'
image = Image.open(requests.get(url, stream=True).raw)

generated_music = generate_music(fine_tuned_model, image, tokenizer)
print(generated_music)
Downloads last month
12
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.

Dataset used to train AnyaSchen/image2music