A text-guided vision model for enhanced recognition of small instances

Main Article Content

DOI

Hyun-Ki JUNG

stillhk3@uos.ac.kr

Abstract

As drone-based object detection technology continues to evolve, the demand is shifting from simply detecting objects to enabling users to accurately identify specific targets. For example, users can enter specific targets as prompts to accurately detect the desired objects. To address this need, an efficient text-guided object recognition model has been developed to improve the recognition of small objects. Specifically, an improved version of the existing YOLO-World model is presented. The proposed method replaces the C2f layer in the YOLOv8 backbone with a C3k2 layer, allowing for a more accurate representation of local features, especially for small objects or those with well-defined boundaries. In addition, the proposed architecture improves processing speed and efficiency by optimizing parallel processing, while contributing to a more lightweight model design. Comparative experiments on the VisDrone dataset show that the proposed model outperforms the original YOLO-World model, with precision increasing from 40.6% to 41.6%, recall from 30.8% to 31%, F1 score from 35% to 35.5%, and mAP@0.5 from 30.4% to 30.7%, confirming its improved accuracy. In addition, the model exhibits superior lightweight performance, with the number of parameters reduced from 4 million to 3.8 million and the FLOPs reduced from 15.7 billion to 15.2 billion. These results indicate that the proposed approach provides a practical and effective solution for accurate object detection in drone-based applications.

Keywords:

Object Detection, image processing, Artificial Intelligence

References

Article Details

JUNG, H.-K. (2026). A text-guided vision model for enhanced recognition of small instances. Applied Computer Science, 22(1), 35–46. https://doi.org/10.35784/acs_7850