K4F-Net: Lightweight multi-view speech emotion recognition with Kronecker convolution and cross-language robustness
Article Sidebar
Open full text
Issue Vol. 21 No. 4 (2025)
-
Real-time detection of seat belt usage in overhead traffic surveillance using YOLOv7
Catur Edi WIDODO, Kusworo ADI, Priyono PRIYONO, Aji SETIAWAN1-12
-
SoundCrafter: Bridging text and Sound with a diffusion model
Haitham ALHAJI, Alaa Yaseen TAQA13-20
-
Application of encoder-based motion analysis and machine learning for knee osteoarthritis detection: A pilot study
Robert KARPIŃSKI, Arkadiusz SYTA21-31
-
SSAtt-SolNet: An efficient model for dusty solar panel classification with Sparse Shuffle and Attention mechanisms
An CONG TRAN, Nghi CONG TRAN32-46
-
IoT-driven environmental optimization for hydroponic lettuce: A data-centric approach to smart agriculture
Okky Putra BARUS, Ade MAULANA, Pujianto YUGOPUSPITO, Achmad Nizar HIDAYANTO, Winar Joko ALEXANDER47-58
-
Computer-Aided System with Machine Learning components for generating medical recommendations for type 1 diabetes patients
Tomasz NOWICKI59-75
-
Interpretable VAE-based predictive modeling for enhanced complex industrial systems dependability in developing countries
Richard NASSO TOUMBA, Maxime MOAMISSOAL SAMUEL, Achille EBOKE, Wangkaké TAIWE, Timothée KOMBE76-97
-
Measuring comparative eco-efficiency in the Eurasian Economic Union using MaxDEA X 12.2 software
Bella GABRIELYAN, Narek KESOYAN, Armen GHAZARYAN, Argam ARTASHYAN98-109
-
K4F-Net: Lightweight multi-view speech emotion recognition with Kronecker convolution and cross-language robustness
Paweł POWROŹNIK, Maria SKUBLEWSKA-PASZKOWSKA110-126
-
The modelling of NiTi shape memory alloy functional properties by machine learning methods
Volodymyr HUTSAYLYUK, Vladyslav DEMCHYK, Oleh YASNIY, Nadiia LUTSYK, Andrii FIIALKA127-135
-
Application of machine learning algorithms for forecasting labour demand in the metallurgical industry of the east Kazakhstan region
Oxana DENISSOVA, Aman ISMUKHAMEDOV, Zhadyra KONURBAYEVA, Saule RAKHMETULLINA, Yelena SAMUSSENKO, Monika KULISZ136-158
-
Evaluating the impact of residual learning and feature fusion on soil moisture prediction accuracy
Pascal YAMAKILI, Mrindoko Rashid NICHOLAUS, Kenedy Aliila GREYSON159-168
Archives
-
Vol. 21 No. 4
2025-12-31 12
-
Vol. 21 No. 3
2025-10-05 12
-
Vol. 21 No. 2
2025-06-27 12
-
Vol. 21 No. 1
2025-03-31 12
-
Vol. 20 No. 4
2025-01-31 12
-
Vol. 20 No. 3
2024-09-30 12
-
Vol. 20 No. 2
2024-08-14 12
-
Vol. 20 No. 1
2024-03-30 12
-
Vol. 19 No. 4
2023-12-31 10
-
Vol. 19 No. 3
2023-09-30 10
-
Vol. 19 No. 2
2023-06-30 10
-
Vol. 19 No. 1
2023-03-31 10
-
Vol. 18 No. 4
2022-12-30 8
-
Vol. 18 No. 3
2022-09-30 8
-
Vol. 18 No. 2
2022-06-30 8
-
Vol. 18 No. 1
2022-03-30 7
-
Vol. 17 No. 4
2021-12-30 8
-
Vol. 17 No. 3
2021-09-30 8
-
Vol. 17 No. 2
2021-06-30 8
-
Vol. 17 No. 1
2021-03-30 8
Main Article Content
DOI
Authors
Abstract
Speech emotion recognition has been gaining importance for years, but most of the existing models are based on a single signal representation or conventional convolutional layers with a large number of parameters. In this study, we propose a compact multi-representation architecture that combines four images of the speech signal: spectrogram, MFCC features, wavelet scalogram, and fuzzy transform maps. Furthermore, the application of Kronecker convolution for efficient feature extraction with an extended receptive field is shown. Another novelty is cross-fusion, a mechanism that models interactions between branches without significantly increasing complexity. The core of the network is complemented by a transformer-based block and language-independent adversarial learning. The model is evaluated in a scenario of quadruple cross-lingual tests covering four data corpora for four languages: English, German, Polish and Danish. It is trained on three languages and tested on the fourth, achieving a weighted accuracy of 96.3%. In addition, the influence of selected activation functions on the classification quality is investigated. Ablation analysis shows that removing the Kronecker convolution reduces the efficiency by 5.6%, and removing the fuzzy transform representation by 4.7%. The obtained results indicate that the combination of Kronecker convolution, multi-channel fusion, and adversarial learning is a promising direction for building universal, language-independent emotion recognition systems.
Keywords:
References
Abdel-Hamid, O., Mohamed, A., Jiang, H., Deng, L., Penn, G., & Yu, D. (2014). Convolutional neural networks for speech recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 22(10), 1533–1545. https://doi.org/10.1109/TASLP.2014.2339736
Abdul, Z. Kh., & Al-Talabani, A. K. (2022). Mel frequency cepstral coefficient and its applications: A review. IEEE Access, 10, 122136–122158. https://doi.org/10.1109/ACCESS.2022.3223444
Ahn, C.-S., Rana, R., Busso, C., & Rajapakse, J. C. (2025). Multitask transformer for cross-corpus speech emotion recognition. IEEE Transactions on Affective Computing, 16(3), 1581-1591. https://doi.org/10.1109/TAFFC.2025.3526592
Akinpelu, S., Viriri, S., & Adegun, A. (2024). An enhanced speech emotion recognition using vision transformer. Scientific Reports, 14, 13126. https://doi.org/10.1038/s41598-024-63776-4
Allen, J. B., & Rabiner, L. R. (1977). A unified approach to short-time Fourier analysis and synthesis. Proceedings of the IEEE, 65(11), 1558–1564. https://doi.org/10.1109/PROC.1977.10770
Arezzo, A., & Berretti, S. (2022). SPEAKER VGG CCT: Cross-corpus speech emotion recognition with speaker embedding and vision transformers. 4th ACM International Conference on Multimedia in Asia (pp. 1-7). Association for Computing Machinery. https://doi.org/10.1145/3551626.3564937
Avots, E., Sapiński, T., Bachmann, M., & Kamińska, D. (2019). Audiovisual emotion recognition in wild. Machine Vision and Applications, 30, 975–985. https://doi.org/10.1007/s00138-018-0960-9
Chowdhury, J. H., Ramanna, S., & Kotecha, K. (2025). Speech emotion recognition with light weight deep neural ensemble model using hand crafted features. Scientific Reports, 15, 11824. https://doi.org/10.1038/s41598-025-95734-z
Chumachenko, K., Iosifidis, A., & Gabbouj, M. (2022). Self-attention fusion for audiovisual emotion recognition with incomplete data. 26th International Conference on Pattern Recognition (ICPR) (pp. 2822–2828). IEEE. https://doi.org/10.1109/ICPR56361.2022.9956592
Chwaleba, K., & Wach, W. (2024). Polish dance music classification based on mel spectrogram decomposition. Advances in Science and Technology Research Journal, 19(2), 95–113. https://doi.org/10.12913/22998624/195506
Czerwinski, D., & Powroźnik, P. (2018). Human emotions recognition with the use of speech signal of polish language. Conference on Electrotechnology: Processes, Models, Control and Computer Science (EPMCCS) (pp. 1–6). IEEE. https://doi.org/10.1109/EPMCCS.2018.8596404
Echim, S.-V., Smădu, R.-A., & Cercel, D.-C. (2024). Benchmarking adversarial robustness in speech emotion recognition: insights into low-resource romanian and german languages. In U. Endriss, F. S. Melo, K. Bach, A. Bugarín-Diz, J. M. Alonso-Moral, S. Barro, & F. Heintz (Eds.), Frontiers in Artificial Intelligence and Applications (pp. 2468–2475). IOS Press. https://doi.org/10.3233/FAIA240774
El Ayadi, M., Kamel, M. S., & Karray, F. (2011). Survey on speech emotion recognition: Features, classification schemes, and databases. Pattern Recognition, 44(3), 572–587. https://doi.org/10.1016/j.patcog.2010.09.020
Engberg, I. S., & Hansen, A. V. (1996). Documentation of the Emotional Speech Data Base, DES. Aalborg Universitetsforlag.
Ezzameli, K., & Mahersia, H. (2023). Emotion recognition from unimodal to multimodal analysis: A review. Information Fusion, 99, 101847. https://doi.org/10.1016/j.inffus.2023.101847
Ganin, Y., & Lempitsky, V. (2014). Unsupervised domain adaptation by backpropagation (Version 2). ArXiv, abs/1409.7495. https://doi.org/10.48550/ARXIV.1409.7495
George, S. M., & Ilyas, P. M. (2024). A review on speech emotion recognition: A survey, recent advances, challenges, and the influence of noise. Neurocomputing, 568, 127015. https://doi.org/10.1016/j.neucom.2023.127015
Gong, Y., Chung, Y.-A., & Glass, J. (2021). AST: Audio spectrogram transformer (Version 3). ArXiv, abs/2104.01778. https://doi.org/10.48550/ARXIV.2104.01778
Hareli, S., & Hess, U. (2012). The social signal value of emotions. Cognition and Emotion, 26(3), 385–389. https://doi.org/10.1080/02699931.2012.665029
Hashemi, S., & Asgari, M. (2023). Vision transformer and parallel convolutional neural network for speech emotion recognition. 31st International Conference on Electrical Engineering (ICEE) (pp. 888–892). https://doi.org/10.1109/ICEE59167.2023.10334797
Ibrahim, A., Shehata, S., Kulkarni, A., Mohamed, M., & Abdul-Mageed, M. (2024). What does it take to generalize SER model across datasets? A comprehensive benchmark (Version 1). ArXiv, abs/2406.09933. https://doi.org/10.48550/ARXIV.2406.09933
Jin, Q., Li, C., Chen, S., & Wu, H. (2015). Speech emotion recognition with acoustic and lexical features. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 4749–4753). IEEE. https://doi.org/10.1109/ICASSP.2015.7178872
Johanson, D. L., Ahn, H. S., & Broadbent, E. (2021). Improving interactions with healthcare robots: A review of communication behaviours in social and healthcare contexts. International Journal of Social Robotics, 13(8), 1835–1850. https://doi.org/10.1007/s12369-020-00719-9
Kakuba, S., & Han, D. S. (2022). Speech emotion recognition using context-aware dilated convolution network. 27th Asia Pacific Conference on Communications (APCC) (pp. 601–604). IEEE. https://doi.org/10.1109/APCC55198.2022.9943771
Kamaruddin, N., Wahab, A., & Quek, C. (2012). Cultural dependency analysis for understanding speech emotion. Expert Systems with Applications, 39(5), 5115–5133. https://doi.org/10.1016/j.eswa.2011.11.028
Kaminska, D., Sapinski, T., & Pelikant, A. (2013). Recognition of emotional states in natural speech. 2013 Signal Processing Symposium (SPS) (pp. 1–4). IEEE. https://doi.org/10.1109/SPS.2013.6623599
Khasgiwala, Y., & Tailor, J. (2021). Vision transformer for music genre classification using mel-frequency cepstrum coefficient. IEEE 4th International Conference on Computing, Power and Communication Technologies (GUCON) (pp. 1–5). https://doi.org/10.1109/GUCON50781.2021.9573568
Kim, J.-Y., & Lee, S.-H. (2023). CoordViT: A novel method of improve vision transformer-based speech emotion recognition using coordinate information concatenate. International Conference on Electronics, Information, and Communication (ICEIC) (pp. 1–4). IEEE. https://doi.org/10.1109/ICEIC57457.2023.10049941
Kim, J.-Y., & Lee, S.-H. (2024). Accuracy enhancement method for speech emotion recognition from spectrogram using temporal frequency correlation and positional information learning through knowledge transfer. IEEE Access, 12, 128039–128048. https://doi.org/10.1109/ACCESS.2024.3447770
Kim, J.-Y., & Lee, S.-H. (2025). A method for improving the accuracy of speech emotion recognition using implicitly filtered image generation in extreme noisy environments. International Conference on Electronics, Information, and Communication (ICEIC) (pp. 1–3). IEEE. https://doi.org/10.1109/ICEIC64972.2025.10879629
Kozieł, G., Harasim, D., Dziuba-Kozieł, M., & Kisała, P. (2024). Fourier transform usage to analyse data of polarisation plane rotation measurement with a TFBG sensor. Metrology and Measurement Systems, 31(2), 2. https://doi.org/10.24425/mms.2024.149698
Latif, S., Rana, R., Khalifa, S., Jurdak, R., & Schuller, B. W. (2020). Deep architecture enhancing robustness to noise, adversarial attacks, and cross-corpus setting for speech emotion recognition (Version 3). ArXiv, abs/2005.08453. https://doi.org/10.48550/ARXIV.2005.08453
Liu, J., Ang, M. C., Chaw, J. K., Ng, K. W., & Kor, A.-L. (2025). Personalized emotion analysis based on fuzzy multi-modal transformer model. Applied Intelligence, 55(3), 227. https://doi.org/10.1007/s10489-024-05954-5
Luna-Jiménez, C., Kleinlein, R., Griol, D., Callejas, Z., Montero, J. M., & Fernández-Martínez, F. (2021). A proposal for multimodal emotion recognition using aural transformers and action units on RAVDESS dataset. Applied Sciences, 12(1), 327. https://doi.org/10.3390/app12010327
Madanian, S., Adeleye, O., Templeton, J. M., Chen, T., Poellabauer, C., Zhang, E., & Schneider, S. L. (2025). A multi-dilated convolution network for speech emotion recognition. Scientific Reports, 15, 8254. https://doi.org/10.1038/s41598-025-92640-2
Madanian, S., Chen, T., Adeleye, O., Templeton, J. M., Poellabauer, C., Parry, D., & Schneider, S. L. (2023). Speech emotion recognition using machine learning - A systematic review. Intelligent Systems with Applications, 20, 200266. https://doi.org/10.1016/j.iswa.2023.200266
Mallat, S. (2009). A wavelet tour of signal processing. Elsevier.
Mishra, R., Frye, A., Rayguru, M. M., & Popa, D. O. (2025). Personalized speech emotion recognition in human-robot interaction using vision transformers. IEEE Robotics and Automation Letters, 10(5), 4890–4897. https://doi.org/10.1109/LRA.2025.3554949
Mohamed, E. A., Koura, A., & Kayed, M. (2024). Speech emotion recognition in multimodal environments with transformer: arabic and english audio datasets. International Journal of Advanced Computer Science and Applications, 15(3). https://doi.org/10.14569/IJACSA.2024.0150359
Motamed, S., Setayeshi, S., & Rabiee, A. (2017). Speech emotion recognition based on a modified brain emotional learning model. Biologically Inspired Cognitive Architectures, 19, 32–38. https://doi.org/10.1016/j.bica.2016.12.002
Ntalampiras, S., Potamitis, I., & Fakotakis, N. (2009). An adaptive framework for acoustic monitoring of potential hazards. EURASIP Journal on Audio, Speech, and Music Processing, 2009, 1–15. https://doi.org/10.1155/2009/594103
Ong, K. L., Lee, C. P., Lim, H. S., Lim, K. M., & Alqahtani, A. (2024). MaxMViT-MLP: Multiaxis and multiscale vision transformers fusion network for speech emotion recognition. IEEE Access, 12, 18237–18250. https://doi.org/10.1109/ACCESS.2024.3360483
Patro, K. K., Allam, J. P., Neelapu, B. C., Tadeusiewicz, R., Acharya, U. R., Hammad, M., Yildirim, O., & Pławiak, P. (2023). Application of Kronecker convolutions in deep learning technique for automated detection of kidney stones with coronal CT images. Information Sciences, 640, 119005. https://doi.org/10.1016/j.ins.2023.119005
Perfilieva, I. (2006). Fuzzy transforms: Theory and applications. Fuzzy Sets and Systems, 157(8), 993–1023. https://doi.org/10.1016/j.fss.2005.11.012
Petrushin, V. (2000). Emotion in speech: Recognition and application to call centers. Artificial Neural Networks in Engineering, 710, 22.
Powers, D. M. W. (2020). Evaluation: From precision, recall and F-measure to ROC, informedness, markedness and correlation. ArXiv, abs/2010.16061. https://doi.org/10.48550/ARXIV.2010.16061
Powroźnik, P. (2014). Polish emotional speech recognition using artificial neural network. Advances in Science and Technology Research Journal, 8(24), 24–27. https://doi.org/10.12913/22998624/562
Powroźnik, P., & Czerwiński, D. (2016). Spectral methods in polish emotional speech recognition. Advances in Science and Technology Research Journal, 10(32), 73–81. https://doi.org/10.12913/22998624/65138
Powroźnik, P., Wojcicki, P., & Przylucki, S. W. (2021). Scalogram as a representation of emotional Speech. IEEE Access, 9, 154044–154057. https://doi.org/10.1109/ACCESS.2021.3127581
Prasomphan, S. (2015). Improvement of speech emotion recognition with neural network classifier by using speech spectrogram. International Conference on Systems, Signals and Image Processing (IWSSIP) (pp. 73–76). IEEE. https://doi.org/10.1109/IWSSIP.2015.7314180
Ristea, N.-C., Ionescu, R. T., & Khan, F. S. (2022). SepTr: Separable transformer for audio spectrogram processing (Version 3). ArXiv, abs/2203.09581. https://doi.org/10.48550/ARXIV.2203.09581
Sokolova, M., & Lapalme, G. (2009). A systematic analysis of performance measures for classification tasks. Information Processing & Management, 45(4), 427–437. https://doi.org/10.1016/j.ipm.2009.03.002
Song, P., Zheng, W., Ou, S., Zhang, X., Jin, Y., Liu, J., & Yu, Y. (2016). Cross-corpus speech emotion recognition based on transfer non-negative matrix factorization. Speech Communication, 83, 34–41. https://doi.org/10.1016/j.specom.2016.07.010
Tang, X., Huang, J., Lin, Y., Dang, T., & Cheng, J. (2025). Speech emotion recognition via CNN-transformer and multidimensional attention mechanism. Speech Communication, 171, 103242. https://doi.org/10.1016/j.specom.2025.103242
Trigeorgis, G., Ringeval, F., Brueckner, R., Marchi, E., Nicolaou, M. A., Schuller, B., & Zafeiriou, S. (2016). Adieu features? End-to-end speech emotion recognition using a deep convolutional recurrent network. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 5200–5204). IEEE. https://doi.org/10.1109/ICASSP.2016.7472669
Ververidis, D., & Kotropoulos, C. (2006). Emotional speech recognition: Resources, features, and methods. Speech Communication, 48(9), 1162–1181. https://doi.org/10.1016/j.specom.2006.04.003
Wu, T., Tang, S., Zhang, R., Cao, J., & Li, J. (2019). Tree-structured Kronecker convolutional network for semantic segmentation. IEEE International Conference on Multimedia and Expo (ICME) (pp. 940–945). IEEE. https://doi.org/10.1109/ICME.2019.00166
Xia, W., Huang, J., & Hansen, J. H. L. (2019). Cross-lingual text-independent speaker verification using unsupervised adversarial discriminative domain adaptation. ArXiv, abs/1908.01447. https://doi.org/10.48550/ARXIV.1908.01447
Zhao, X., Zhang, S., & Lei, B. (2014). Robust emotion recognition in noisy speech via sparse representation. Neural Computing and Applications, 24, 1539–1553. https://doi.org/10.1007/s00521-013-1377-z
Zheng, W. Q., Yu, J. S., & Zou, Y. X. (2015). An experimental study of speech emotion recognition based on deep convolutional neural networks. 2015 International Conference on Affective Computing and Intelligent Interaction (ACII) (pp. 827–831). IEEE. https://doi.org/10.1109/ACII.2015.7344669
Article Details
Abstract views: 15
License

This work is licensed under a Creative Commons Attribution 4.0 International License.
All articles published in Applied Computer Science are open-access and distributed under the terms of the Creative Commons Attribution 4.0 International License.
