SoundCrafter: Bridging text and Sound with a diffusion model
Article Sidebar
Open full text
Issue Vol. 21 No. 4 (2025)
-
Real-time detection of seat belt usage in overhead traffic surveillance using YOLOv7
Catur Edi WIDODO, Kusworo ADI, Priyono PRIYONO, Aji SETIAWAN1-12
-
SoundCrafter: Bridging text and Sound with a diffusion model
Haitham ALHAJI, Alaa Yaseen TAQA13-20
-
Application of encoder-based motion analysis and machine learning for knee osteoarthritis detection: A pilot study
Robert KARPIŃSKI, Arkadiusz SYTA21-31
-
SSAtt-SolNet: An efficient model for dusty solar panel classification with Sparse Shuffle and Attention mechanisms
An CONG TRAN, Nghi CONG TRAN32-46
-
IoT-driven environmental optimization for hydroponic lettuce: A data-centric approach to smart agriculture
Okky Putra BARUS, Ade MAULANA, Pujianto YUGOPUSPITO, Achmad Nizar HIDAYANTO, Winar Joko ALEXANDER47-58
-
Computer-Aided System with Machine Learning components for generating medical recommendations for type 1 diabetes patients
Tomasz NOWICKI59-75
-
Interpretable VAE-based predictive modeling for enhanced complex industrial systems dependability in developing countries
Richard NASSO TOUMBA, Maxime MOAMISSOAL SAMUEL, Achille EBOKE, Wangkaké TAIWE, Timothée KOMBE76-97
-
Measuring comparative eco-efficiency in the Eurasian Economic Union using MaxDEA X 12.2 software
Bella GABRIELYAN, Narek KESOYAN, Armen GHAZARYAN, Argam ARTASHYAN98-109
-
K4F-Net: Lightweight multi-view speech emotion recognition with Kronecker convolution and cross-language robustness
Paweł POWROŹNIK, Maria SKUBLEWSKA-PASZKOWSKA110-126
-
The modelling of NiTi shape memory alloy functional properties by machine learning methods
Volodymyr HUTSAYLYUK, Vladyslav DEMCHYK, Oleh YASNIY, Nadiia LUTSYK, Andrii FIIALKA127-135
-
Application of machine learning algorithms for forecasting labour demand in the metallurgical industry of the east Kazakhstan region
Oxana DENISSOVA, Aman ISMUKHAMEDOV, Zhadyra KONURBAYEVA, Saule RAKHMETULLINA, Yelena SAMUSSENKO, Monika KULISZ136-158
-
Evaluating the impact of residual learning and feature fusion on soil moisture prediction accuracy
Pascal YAMAKILI, Mrindoko Rashid NICHOLAUS, Kenedy Aliila GREYSON159-168
Archives
-
Vol. 21 No. 4
2025-12-31 12
-
Vol. 21 No. 3
2025-10-05 12
-
Vol. 21 No. 2
2025-06-27 12
-
Vol. 21 No. 1
2025-03-31 12
-
Vol. 20 No. 4
2025-01-31 12
-
Vol. 20 No. 3
2024-09-30 12
-
Vol. 20 No. 2
2024-08-14 12
-
Vol. 20 No. 1
2024-03-30 12
-
Vol. 19 No. 4
2023-12-31 10
-
Vol. 19 No. 3
2023-09-30 10
-
Vol. 19 No. 2
2023-06-30 10
-
Vol. 19 No. 1
2023-03-31 10
-
Vol. 18 No. 4
2022-12-30 8
-
Vol. 18 No. 3
2022-09-30 8
-
Vol. 18 No. 2
2022-06-30 8
-
Vol. 18 No. 1
2026-01-08 8
-
Vol. 17 No. 4
2021-12-30 8
-
Vol. 17 No. 3
2021-09-30 8
-
Vol. 17 No. 2
2021-06-30 8
-
Vol. 17 No. 1
2021-03-30 8
Main Article Content
DOI
Authors
Abstract
Text-to-sound systems have recently attracted interest for their ability to synthesize common sounds from textual descriptions. However, previous research on sound generation has shown limited generation quality and increased computational complexity. We present SoundCrafter, a text-to-sound generation framework that utilizes diffusion models. Unlike previous methods, SoundCrafter operates within a compressed domain of mel spectrograms and is driven by semantic embeddings derived from the CLAP model, which stands for contrastive language audio pretraining. SoundCrafter improves generation quality and computational efficiency by learning the sound signals without modeling the cross-modal interaction. In addition, we employ a curricular learning technique by progressively increasing spectrogram resolution to stabilize training and improve output fidelity. SoundCrafter distinguishes itself by integrating CLAP-conditional semantic embeddings with a diffusion model that operates in the compressed domain of mel-spectrograms. Using the AudioCaps dataset, it achieves superior text-to-sound synthesis with a Fréchet Distance (FD) of 23.45 and an Inception Score (IS) of 7.57 - exceeding the performance of previous models while requiring significantly less computational resources and training on a single GPU.
Keywords:
References
Al Kateeb, Z. N., & Abdullah, D. B. (2024a). AdaBoost powered cloud of things framework for low latency, energy efficient chronic kidney disease prediction. Transactions on Emerging Telecommunications Technologies, 35(6). https://doi.org/10.1002/ett.5007
Al Kateeb, Z. N., & Abdullah, D. B. (2024b). Unlocking the potential: Synergizing IoT, cloud computing, and big data for a bright future. Iraqi Journal for Computer Science and Mathematics, 5(3). https://doi.org/10.52866/ijcsm.2024.05.03.001
Barahona Ríos, A., & Collins, T. (2022). SpecSinGAN: Sound effect variation synthesis using single image GANs. ArXiv, abs/2110.07311. https://doi.org/10.48550/arXiv.2110.07311
Berahmand, K., Daneshfar, F., Salehi, E. S., Li, Y., & Xu, Y. (2024). Autoencoders and their applications in machine learning: A survey. Artificial Intelligence Review, 57, 28. https://doi.org/10.1007/s10462 023 10662 6
Chen, K., Du, X., Zhu, B., Ma, Z., Berg Kirkpatrick, T., & Dubnov, S. (2022). HTS AT: A hierarchical token semantic audio transformer for sound classification and detection. ArXiv, abs/2202.00874. https://doi.org/10.48550/arXiv.2202.00874
Cherep, M., Singh, N., & Shand, J. (2024). Creative text to audio generation via synthesizer programming. ArXiv, abs/2405.18698. https://doi.org/10.48550/arXiv.2406.00294
Ghosal, D., Majumder, N., Mehrish, A., & Poria, S. (2023). Text to audio generation using instruction tuned LLM and latent diffusion model. ArXiv, abs/2304.13731. https://doi.org/10.48550/arXiv.2304.13731
Guan, W., Wang, K., Zhou, W., Wang, Y., Deng, F., Wang, H., Li, L., Hong, Q., & Qin, Y. (2024). LAFMA: A latent flow matching model for text to audio generation. ArXiv, abs/2406.08203. https://doi.org/10.48550/arXiv.2406.08203
Hasoon, S. O., & Al Hashimi, M. M. (2022). Hybrid deep neural network and long short term memory network for predicting of sunspot time series. International Journal of Mathematics and Computer Science, 17(3), 955–967.
Huang, J., Ren, Y., Huang, R., Yang, D., Ye, Z., Zhang, C., Liu, J., Yin, X., Ma, Z., & Zhao, Z. (2023). Make An Audio 2: Temporal enhanced text to audio generation. ArXiv, abs/2305.18474. https://doi.org/10.48550/arXiv.2305.18474
Huang, R., Huang, J., Yang, D., Ren, Y., Liu, L., Li, M., Ye, Z., Liu, J., Yin, X., & Zhao, Z. (2023). Make An Audio: Text to audio generation with prompt enhanced diffusion models. ArXiv, abs/2301.12661. https://doi.org/10.48550/arXiv.2301.12661
Issa, R. J., & Al Irhaym, Y. F. (2021). Audio source separation using supervised deep neural network. Journal of Physics: Conference Series, 1879, 022077. https://doi.org/10.1088/1742-6596/1879/2/022077
Karchkhadze, T., Kavaki, H. S., Izadi, M. R., Irvin, B., Kegler, M., Hertz, A., Zhang, S., & Stamenovic, M. (2024). Latent CLAP loss for better Foley sound synthesis. ArXiv, abs/2403.12182. https://doi.org/10.48550/arXiv.2403.12182
Kim, C. D., Kim, B., Lee, H., & Kim, G. (2019). AudioCaps: Generating captions for audios in the wild. 2019 Conference of the North (pp. 119–132). Association for Computational Linguistics. https://doi.org/10.18653/v1/N19-1011
Koonce, B. (2021). Convolutional neural networks with Swift for TensorFlow. Apress.
Kreuk, F., Synnaeve, G., Polyak, A., Singer, U., Défossez, A., Copet, J., Parikh, D., Taigman, Y., & Adi, Y. (2022). AudioGen: Textually guided audio generation. ArXiv, abs/2209.15352. https://doi.org/10.48550/arXiv.2209.15352
Liu, H., Huang, R., Liu, Y., Cao, H., Wang, J., Cheng, X., Zheng, S., & Zhao, Z. (2024). AudioLCM: Text to audio generation with latent consistency models. ArXiv, abs/2406.00356. https://doi.org/10.48550/arXiv.2406.00356
Perez, E., Strub, F., de Vries, H., Dumoulin, V., & Courville, A. (2017). FiLM: Visual reasoning with a general conditioning layer. ArXiv, abs/1709.07871. https://doi.org/10.48550/arXiv.1709.07871
Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., & Sutskever, I. (2021). Learning transferable visual models from natural language supervision. ArXiv, abs/2103.00020. https://doi.org/10.48550/arXiv.2103.00020
Ronneberger, O., Fischer, P., & Brox, T. (2015). U Net: Convolutional networks for biomedical image segmentation. ArXiv, abs/1505.04597. https://doi.org/10.48550/arXiv.1505.04597
Talal, R., & Anas, H. (2025). Prediction of drug risks consumption by using artificial intelligence techniques. International Journal of Computing and Digital Systems, 17(1), 1–11.
Wu, Y., Chen, K., Zhang, T., Hui, Y., Nezhurina, M., Berg Kirkpatrick, T., & Dubnov, S. (2022). Large scale contrastive language audio pretraining with feature fusion and keyword to caption augmentation. ArXiv, abs/2211.06687. https://doi.org/10.48550/arXiv.2211.06687
Yang, D., Yu, J., Wang, H., Wang, W., Weng, C., Zou, Y., & Yu, D. (2022). Diffsound: Discrete diffusion model for text to sound generation. ArXiv, abs/2207.09983. https://doi.org/10.48550/arXiv.2207.09983
Yuan, Y., Liu, H., Liu, X., Kang, X., Wu, P., Plumbley, M. D., & Wang, W. (2023). Text driven Foley sound generation with latent diffusion model. ArXiv, abs/2306.10359. https://doi.org/10.48550/arXiv.2306.10359
Zhang, C., Zhang, C., Zheng, S., Zhang, M., Qamar, M., Bae, S. H., & Kweon, I. S. (2023). A survey on audio diffusion models: Text to speech synthesis and enhancement in generative AI. ArXiv, abs/2303.13336. https://doi.org/10.48550/arXiv.2303.13336
Article Details
Abstract views: 29
License

This work is licensed under a Creative Commons Attribution 4.0 International License.
All articles published in Applied Computer Science are open-access and distributed under the terms of the Creative Commons Attribution 4.0 International License.
