Deep learning architectures for multiclass clothing recognition as the semantic core of automated virtual try-on systems

Main Article Content

Roman Chekhmestruk

chekhroma@gmail.com

https://orcid.org/0000-0002-5362-8796
Olena Voitsekhovska

vojcexovska.o.v@vntu.edu.ua

https://orcid.org/0000-0001-8755-1574
Svitlana Kyrylashchuk

kyrylashchuk@vntu.edu.ua

https://orcid.org/0000-0002-8972-3541

Abstract

This article examines and substantiates the choice of deep learning architectures for multiclass clothing classification integrated into virtual try-on (VTO) systems. Systematically compared ResNet-50, EfficientNet-B4, and Vision Transformer (ViT-B/16) on DeepFashion2 and ModaNet datasets. ViT-B/16 achieved the highest accuracy of 92.4% Top-1 on DeepFashion2 and 88.9% on ModaNet, demonstrating an average cross-dataset accuracy drop of 3.9 percentage points, the smallest among evaluated models. Preliminary U2-Net segmentation statistically significantly improved macro-F1 for all architectures (p < 0.001), with an average gain of 3.2 percentage points and reduction of the studio-to-street domain gap from 11 to 6 percentage points. EfficientNet-B4 provided the optimal accuracy-to-latency ratio, achieving 87% Top-1 accuracy at 60 FPS on consumer hardware (RTX 3060), while ViT-B/16 required optimization to maintain 45 FPS. The recommended strategy for industrial VTO systems combines U2-Net segmentation with architecture selection based on target platform capabilities, balancing visual fidelity and computational efficiency.

Keywords:

VTO-systems, segmentation, CNN, DeepFashion, vision transformer, online shopping

References

Article Details

Chekhmestruk, R., Voitsekhovska, O., & Kyrylashchuk, S. (2026). Deep learning architectures for multiclass clothing recognition as the semantic core of automated virtual try-on systems. Informatyka, Automatyka, Pomiary W Gospodarce I Ochronie Środowiska, 16(2), 162–172. https://doi.org/10.35784/iapgos.7957