arxiv:2503.02823

A Multimodal Symphony: Integrating Taste and Sound through Generative AI

Published on Mar 4

· Submitted by

matteospanio on Mar 5

Upvote

Authors:

Matteo Spanio ,

Abstract

In recent decades, neuroscientific and psychological research has traced direct relationships between taste and auditory perceptions. This article explores multimodal generative models capable of converting taste information into music, building on this foundational research. We provide a brief review of the state of the art in this field, highlighting key findings and methodologies. We present an experiment in which a fine-tuned version of a generative music model (MusicGEN) is used to generate music based on detailed taste descriptions provided for each musical piece. The results are promising: according the participants' (n=111) evaluation, the fine-tuned model produces music that more coherently reflects the input taste descriptions compared to the non-fine-tuned model. This study represents a significant step towards understanding and developing embodied interactions between AI, sound, and taste, opening new possibilities in the field of generative AI. We release our dataset, code and pre-trained model at: https://osf.io/xs5jy/.

View arXiv page View PDF Add to collection

Community

matteospanio

Paper author Paper submitter 5 days ago

Generative AI has been making waves in creative domains, from text and image generation to music composition. However, one sensory modality has remained largely unexplored in the realm of AI-driven creativity: taste. In A Multimodal Symphony: Integrating Taste and Sound through Generative AI, we investigate how AI can bridge the gap between taste and sound, generating music that embodies the essence of different flavors.

The Science Behind Taste-Sound Associations

Neuroscientific and psychological research has shown that certain auditory characteristics influence how we perceive taste. High-pitched sounds, for instance, are often linked to sweetness, while low-pitched, resonant tones can evoke bitterness. These crossmodal correspondences form the foundation for our study, where we fine-tuned a generative music model to produce compositions aligned with specific taste descriptions.

Fine-Tuning MusicGEN for Taste-Based Composition

For our experiment, we fine-tuned MusicGEN, an open-source music generation model, on a dataset enriched with taste and emotional descriptors. Using the Taste & Affect Music Database, we trained the model to associate musical elements—such as tempo, timbre, and harmony—with specific taste profiles (sweet, sour, bitter, and salty). The goal was to determine whether this fine-tuned model could generate music that listeners perceive as more representative of the given taste prompts compared to its non-fine-tuned counterpart.

Evaluating the Generated Music

To validate our approach, we conducted an online survey with 111 participants, who listened to audio clips and rated their coherence with corresponding taste descriptions. The results were promising: our fine-tuned model generated music that was significantly more aligned with the intended taste attributes, particularly for sweet, bitter, and sour prompts. However, representations of saltiness proved more challenging, indicating a need for further refinement in dataset composition and model training.