SoundCrafter: Bridging text and Sound with a diffusion model

Main Article Content

DOI

Haitham ALHAJI

haithamtalhaji@yahoo.com

Alaa Yaseen TAQA

alaa.taha@uomosul.edu.iq

Abstract

Text-to-sound systems have recently attracted interest for their ability to synthesize common sounds from textual descriptions. However, previous research on sound generation has shown limited generation quality and increased computational complexity. We present SoundCrafter, a text-to-sound generation framework that utilizes diffusion models. Unlike previous methods, SoundCrafter operates within a compressed domain of mel spectrograms and is driven by semantic embeddings derived from the CLAP model, which stands for contrastive language audio pretraining. SoundCrafter improves generation quality and computational efficiency by learning the sound signals without modeling the cross-modal interaction. In addition, we employ a curricular learning technique by progressively increasing spectrogram resolution to stabilize training and improve output fidelity. SoundCrafter distinguishes itself by integrating CLAP-conditional semantic embeddings with a diffusion model that operates in the compressed domain of mel-spectrograms. Using the AudioCaps dataset, it achieves superior text-to-sound synthesis with a Fréchet Distance (FD) of 23.45 and an Inception Score (IS) of 7.57 - exceeding the performance of previous models while requiring significantly less computational resources and training on a single GPU.

Keywords:

text-to-sound generation, diffusion model, cumulative learning, mel-spectrogram tokens

References

Article Details

ALHAJI, H., & Yaseen TAQA , A. (2025). SoundCrafter: Bridging text and Sound with a diffusion model . Applied Computer Science, 21(4), 13–20. https://doi.org/10.35784/acs_7549