A text-guided vision model for enhanced recognition of small instances
Article Sidebar
Issue Vol. 22 No. 1 (2026)
-
Development of dead-reckoning sensor system for indoor environments
Toshihiro YUKAWA1-19
-
A real-time adaptive traffic light control algorithm at urban intersections for smart cities
Chahrazad HAMBLI, Mourad AMAD20-34
-
A text-guided vision model for enhanced recognition of small instances
Hyun-Ki JUNG35-46
-
Reinforcement learning for solving optimization problems: Opportunities and limitations on the example of the assignment problem
Wojciech MISZTAL, Sybilla NAZAREWICZ47-62
-
SCADA-Driven big data framework for fault prediction in spiral steel pipe manufacturing using fuzzy and neural network models
Bakhshali BAKHTIYAROV, Aynur JABIYEVA, Mahabbat KHUDAVERDIYEVA63-81
-
Enhanced ELECTRE III method with interval-valued hesitant fuzzy linguistic sets for multi-criteria group decision-making in smart supply networks
Fadoua TAMTAM, Amina TOURABI82-98
-
Models for calculating the integral quality indicator of the offset printing process for the IIOT-system
Vyacheslav REPETA, Pavlo RYVAK, Oleksandra KRYKHOVETS99-109
-
A scalable and cost-effective forest fire detection approach using deep transfer learning on a Raspberry Pi cluster
Achraf Nasser Eddine BELFERD, Hamdan BENSENANE, Abdellatif RAHMOUN110-122
-
Addressing non-stationarity with stochastic trend in the context of limited time series data: An experimental survey in healthcare analytics
Apollinaire BATOURE BAMANA, Yannick SOKDOU BILA LAMOU, David Jaures FOTSA-MBOGNE, Mahdi SHAFIEE KAMALABAD123-139
-
Efficient multi-robot exploration of unknown environments using inverted ant colony optimization and reinforcement learning
Nabila RAHMOUNE, Adel RAHMOUNE140-153
-
A comprehensive review of metaheuristic algorithms for mobile robot path planning
Sheren SADIQ, Araz ABRAHIM, Haval SADEEQ154-170
-
Smart Autolube: Optimized machine learning-based pressure prediction for AIoT lubrication systems
Ali KHUMAIDI, Risanto DARMAWAN; Lukman ADITYA; Wardhana Halking HAMKA, Hudzaifah Al JIHAD171-183
-
Application of artificial intelligence methods to determine the optimal process parameters in resistance projection welding of steel nuts
Szymon KARSKI, Michał AWTONIUK, Mirosław SZALA184-198
-
Development of non-destructive vibration method for classification of bone fracture severity
Jignesh JANI, Nikunj RACHCHH199-213
-
Quantifying pain: An AI-driven approach to detecting pain levels via facial expressions
Abeer A. Mohamad ALSHIHA214-227
Archives
-
Vol. 22 No. 1
2026-03-31 15
-
Vol. 21 No. 4
2025-12-31 12
-
Vol. 21 No. 3
2025-10-05 12
-
Vol. 21 No. 2
2025-06-27 12
-
Vol. 21 No. 1
2025-03-31 12
-
Vol. 20 No. 4
2025-01-31 12
-
Vol. 20 No. 3
2024-09-30 12
-
Vol. 20 No. 2
2024-08-14 12
-
Vol. 20 No. 1
2024-03-30 12
-
Vol. 19 No. 4
2023-12-31 10
-
Vol. 19 No. 3
2023-09-30 10
-
Vol. 19 No. 2
2023-06-30 10
-
Vol. 19 No. 1
2023-03-31 10
-
Vol. 18 No. 4
2022-12-30 8
-
Vol. 18 No. 3
2022-09-30 8
-
Vol. 18 No. 2
2022-06-30 8
-
Vol. 18 No. 1
2022-03-31 8
Main Article Content
DOI
Authors
Abstract
As drone-based object detection technology continues to evolve, the demand is shifting from simply detecting objects to enabling users to accurately identify specific targets. For example, users can enter specific targets as prompts to accurately detect the desired objects. To address this need, an efficient text-guided object recognition model has been developed to improve the recognition of small objects. Specifically, an improved version of the existing YOLO-World model is presented. The proposed method replaces the C2f layer in the YOLOv8 backbone with a C3k2 layer, allowing for a more accurate representation of local features, especially for small objects or those with well-defined boundaries. In addition, the proposed architecture improves processing speed and efficiency by optimizing parallel processing, while contributing to a more lightweight model design. Comparative experiments on the VisDrone dataset show that the proposed model outperforms the original YOLO-World model, with precision increasing from 40.6% to 41.6%, recall from 30.8% to 31%, F1 score from 35% to 35.5%, and mAP@0.5 from 30.4% to 30.7%, confirming its improved accuracy. In addition, the model exhibits superior lightweight performance, with the number of parameters reduced from 4 million to 3.8 million and the FLOPs reduced from 15.7 billion to 15.2 billion. These results indicate that the proposed approach provides a practical and effective solution for accurate object detection in drone-based applications.
Keywords:
References
Abu-Khadrah, A., Al-Qerem, A., Hassan, M. R., Ali, A. M., & Jarrah, M. (2025). Drone-assisted adaptive object detection and privacy-preserving surveillance in smart cities using whale-optimized deep reinforcement learning techniques. Scientific Reports, 15, 9931. https://doi.org/10.1038/s41598-025-94796-3
AISkyEye. (n.d.).AISkyEye – low altitude intelligent platform. http://aiskyeye.com
Bochkovskiy, A., Wang, C. Y., & Liao, H. Y. M., (2020). Yolov4: Optimal speed and accuracy of object detection. ArXiv, abs/2004.10934. https://doi.org/10.48550/arXiv.2004.10934
Chen, J., Zhang, T., Zheng, W. S., & Wang, R. (2024). TagFog: Textual anchor guidance and fake outlier generation for visual out-of-distribution detection. AAAI Technical Track on Computer Vision I, 38(2), 1100-1109. https://doi.org/10.1609/aaai.v38i2.27871
Cheng, T., Song, L., Ge, Y., Liu, W., Wang, X., & Shan, Y. (2024). YOLO-World: Real-time open-vocabulary object detection. IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 16901-16911). IEEE. https://doi.org/10.1109/CVPR52733.2024.01599
Colpaert, A., Raes, M., & Vinogradov, E. (2022). Drone delivery: Reliable cellular UAV Communication using multi-operator diversity. ICC 2022-IEEE International Conference on Communications. (pp. 1-6). IEEE. https://doi.org/10.1109/ICC45855.2022.9839125
Girshick, R. (2015). Fast R-CNN. 2015 IEEE International Conference on Computer Vision (ICCV) (pp. 1440–1448). IEEE. https://doi.org/10.1109/ICCV.2015.169
Girshick, R., Donahue, J., Darrell, T., & Malik, J. (2015). Rich feature hierarchies for accurate object detection and semantic segmentation. IEEE Conference on computer vision and pattern recognition (pp. 580–587). IEEE. https://doi.org/10.1109/CVPR.2014.81
Hasan, M. J., Nalwan, A., Ong, K. L., Jahani, H., Boo, Y. L., Nguyen, K. C., & Hasan, M. (2024). GroundingCarDD: text-guided multimodal phrase grounding for car damage detection. IEEE Access, 12, 179464-179477. https://doi.org/10.1109/ACCESS.2024.3506563
He, K., Gkioxari, G., Dollár, P., & Girshick, R. (2017). Mask r-cnn. IEEE international conference on computer vision (pp. 2961–2969). IEEE. https://doi.org/10.1109/ICCV.2017.322
Huang, P. H., Lee, H. H., Chen, H. T., & Liu, T. L. (2021). Text-guided graph neural networks for referring 3d instance segmentation. AAAI Technical Track on Computer Vision I, 35(2), 1610-1618. https://doi.org/10.1609/aaai.v35i2.16253
Jocher, G. (2022, November 22). Ultralytics YOLOv5. Retrieved June 18, 2025 from https://github.com/ultralytics/yolov5
Jocher, G. (2023, November 12). Explore Ultralytics YOLOv8. https://docs.ultralytics.com/models/yolov8
Jocher, G. (2024). Ultralytics YOLO11., Retrieved September 30, 2024 from https://docs.ultralytics.com/ko/models/yolo11
Jung, H. K. (2025). YOLO-Drone: An efficient object detection approach using the ghosthead network for drone images. Journal of Information Systems Engineering and Management, 10(26s), 2468-4376. https://doi.org/10.52783/jisem.v10i26s.4216
Li, C., Li, L., Jiang, H., Weng, K., Geng, Y., Li, L., Ke, Z., Li, Q., Cheng, M., Nie, W., Li, Y., Zhang, B., Liang, Y., Zhou, L., Xu, X., Chu, X., Wei, X., & Wei, X. (2022). YOLOv6: A single-state object detection framework for industrial applications. ArXiv, abs/2209.02976. https://doi.org/10.48550/ARXIV.2209.02976
Liang, M., Su, J. C., Schulter, S., Garg, S., Zhao, S., Wu, Y., & Chandraker, M. (2024). AIDE: An automatic data engine for object detection in autonomous driving. IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 14695-14706). IEEE. https://doi.org/10.1109/CVPR52733.2024.01392
Lin, T. Y., Goyal, P., Girshick, R., He, K., & Dollár, P. (2017). Focal loss for dense object detection. 2017 IEEE International Conference on Computer Vision (ICCV) (pp. 2999-3007). IEEE. https://doi.org/10.1109/ICCV.2017.324
Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C.-Y., & Berg, A. C. (2016). SSD: Single shot multiBox detector. In B. Leibe, J. Matas, N. Sebe, & M. Welling (Eds), Computer Vision – ECCV 2016 (Vol. 9905, pp. 21–37). Springer International Publishing. https://doi.org/10.1007/978-3-319-46448-0_2
Niu, Y., Lin, C., Jiang, X., & Qu, Z. (2025). VSTDet: A lightweight small object detection network inspired by the ventral visual pathway. Applied Soft Computing, 171, 112775. https://doi.org/10.1016/j.asoc.2025.112775
Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., & Sutskever, I. (2021). Learning transferable visual models from natural language supervision. ArXiv, abs/2103.00020. https://doi.org/10.48550/arXiv.2103.00020
Redmon, J., & Farhadi, A. (2018). Yolov3: An incremental improvement. ArXiv, abs/1804.02767. https://doi.org/10.48550/arXiv.1804.02767
Redmon, J., & Farhadi, A., (2017). YOLO9000: better, faster, stronger. IEEE/CVF Conference on computer vision and pattern recognition (pp. 6517-6525). IEEE. https://doi.org/10.1109/CVPR.2017.690
Redmon, J., Divvala, S., Girshick, R., & Farhadi, A. (2016). You only look once: Unified real-time object detection. IEEE/CVF Conference on computer vision and pattern recognition (pp. 779–788). IEEE. https://doi.org/10.1109/CVPR.2016.91
Ren, S., He, K., Girshick, R., & Sun, J., (2016). Faster R-CNN: Towards real-time object detection with region proposal networks. ArXiv, abs/1506.01497. https://doi.org/10.48550/arXiv.1506.01497
Shah, I. A., Jhanjhi, N. Z., & Ujjan, R. M. (2024). Use of AI Applications for the Drone Industry. In I. Shah & N. Jhanjhi (Eds.), Cybersecurity Issues and Challenges in the Drone Industry (pp. 27-41). IGI Global Scientific Publishing. https://doi.org/10.4018/979-8-3693-0774-8.ch002
Shen, R., Inoue, N., & Shinoda, K. (2023). Text-guided object detector for multi-modal video question answering. 2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) (pp. 1032-1042). https://doi.org/10.1109/WACV56688.2023.00109
Song, Y., Chen, Z., Yang, H., & Liao, J. (2025). GS-LinYOLOv10: A drone-based model for real-time construction site safety monitoring. Alexandria Engineering Journal, 120, 62-73. https://doi.org/10.1016/j.aej.2025.01.021
Tao, S., Shengqi, Y., Haiying, L., Jason, G., Lixia, D., & Lida, L. (2025). MIS-YOLOv8: Ani algorithm for detecting small objects in UAV aerial photography based on YOLOv8. IEEE Transactions on Instrumentation and Measurement, 74, 5020212. https://doi.org/10.1109/TIM.2025.3551917
Vuong, T., Chang, M., Palaparthi, M., Howell, L. G., Bonti, A., Abdelrazek, M., & Nguyen, D. T., (2025). An empirical study of automatic wildlife detection using drone-derived imagery and object detection. Multimedia Tools and Applications, 84, 24487–24514. https://doi.org/10.1007/s11042-024-20522-2
Wang, A., Chen, H., Liu, L., Chen, K., Lin, Z., Han, J., & Ding, G. (2024). YOLOv10: Real-time end-to-end object detection. ArXiv, abs/2405.14458. https://doi.org/10.48550/arXiv.2405.14458
Wang, C. Y., Bochkovskiy, A., & Liao, H. Y. M. (2023). YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. IEEE/CVF Conference on computer vision and pattern recognition (pp. 7464–7475). IEEE. https://doi.org/10.1109/CVPR52729.2023.00721
Wang, C. Y., Liao, H. Y. M., Wu, Y. H., Chen, P. Y., Hsieh, J. W., & Yeh, I. H. (2020). CSPNet: A new backbone that can enhance learning capability of CNN. IEEE/CVF Conference on computer vision and pattern recognition workshops (pp. 390–391). IEEE. https://doi.org/10.1109/CVPRW50498.2020.00203
Wang, C.-Y., Yeh, I.-H., & Mark Liao, H.-Y. (2025). YOLOv9: Learning what you want to learn using programmable gradient information. In A. Leonardis, E. Ricci, S. Roth, O. Russakovsky, T. Sattler, & G. Varol (Eds), Computer Vision – ECCV 2024 (Vol. 15089, pp. 1–21). Springer Nature Switzerland. https://doi.org/10.1007/978-3-031-72751-1_1
Wei, G., Yuan, X., Liu, Y., Shang, Z., Yao, K., Li, C., & Xiao, R. (2024). OVA-Det: Open vocabulary aerial object detection with image-text collaboration. ArXiv, abs/2408.12246v2.
Xie, J., & Zheng, S. (2022). Zero-shot object detection through vision-language embedding alignment. IEEE international conference on data mining workshops (pp. 1-15). IEEE. https://doi.org/10.1109/ICDMW58026.2022.00121
Xu, L., Zhao, Y., Zhai, Y., Huang, L., & Ruan, C. (2024). Small object detection in UAV images based on YOLOv8n. International Journal of Computational Intelligence Systems, 17, 223. https://doi.org/10.1007/s44196-024-00632-3
Yang, C., Cao, Y., & Lu, X. (2024). Towards better small object detection in UAV scenes: Aggregating more object-oriented information. Pattern Recognition Letters, 182, 24-30. https://doi.org/10.1016/j.patrec.2024.04.002
Yi, X., Xu, H., Zhang, H., Tang, L., & Ma, J. (2024). Text-IF: Leveraging semantic text guidance for degradation-aware and interactive image fusion. IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 27026-27035). IEEE. https://doi.org/10.1109/CVPR52733.2024.02552
Yuan, Y., Wu, Y., Zhao, L., Liu, Y., & Pang, Y. (2025). TLSH-MOT: Drone- view video multiple object tracking via transformer-based locally sensitive hash. IEEE Transactions on Geoscience and Remote Sensing, 63, 1-16. https://doi.org/10.1109/TGRS.2025.3545081
Zhang, J., Yang, X., He, W., Ren, J., Zhang, Q., & Zhao, Y. (2024). Scale Optimization Using Evolutionary Reinforcement Learning for Object Detection on Drone Imagery. AAAI Technical Track on Application Domains, 38(1), 410-418. https://doi.org/10.1609/aaai.v38i1.27795
Zhang, S., Wen, L., Bian, X., Lei, Z., & Li, S. Z., (2018). Single-shot refinement neural network for object detection IEEE/CVF Conference on computer vision and pattern recognition (pp. 4203–4212). IEEE. https://doi.org/10.1109/CVPR.2018.00442
Article Details
Abstract views: 0
License

This work is licensed under a Creative Commons Attribution 4.0 International License.
All articles published in Applied Computer Science are open-access and distributed under the terms of the Creative Commons Attribution 4.0 International License.
