Fine-grained detection method for special vehicles based on self-supervised vision models

Li Xiying; Wu Hao; Pan Huayan; Li Jin; Zhu Yiyang; Hu Weipeng

doi:10.11714/acta.snus.ZR20260045

您当前的位置：

首页 >

文章列表页 >

Fine-grained detection method for special vehicles based on self-supervised vision models

更新时间：2026-05-15

- Fine-grained detection method for special vehicles based on self-supervised vision models
- Acta Scientiarum Naturalium Universitatis Sunyatseni Pages: 1-11(2026)
- 作者机构：
  
  1.中山大学智能工程学院，广东深圳518107
  2.广东省智能交通系统重点实验室，广东深圳518107
  3.广东省公安厅科技信息化总队，广东广州510050
- 作者简介：
- 基金信息：
- DOI：10.11714/acta.snus.ZR20260045
  CLC： U491
- Received：09 February 2026，
  
  Revised：2026-03-26，
  
  Accepted：01 April 2026，
  
  Online First：15 May 2026，
- 稿件说明：
移动端阅览
Li Xiying, Wu Hao, Pan Huayan, et al. Fine-grained detection method for special vehicles based on self-supervised vision models[J/OL]. Acta Scientiarum Naturalium Universitatis Sunyatseni, 2026, 1-11.
DOI：

Li Xiying, Wu Hao, Pan Huayan, et al. Fine-grained detection method for special vehicles based on self-supervised vision models[J/OL]. Acta Scientiarum Naturalium Universitatis Sunyatseni, 2026, 1-11. DOI： 10.11714/acta.snus.ZR20260045.

摘要

本文结合通用检测与原型库匹配技术，提出了一种基于自监督视觉表征的域适应检测框架。该方法利用DINOv3的预训练特征作匹配与筛选，从而在少样本标注且无需微调的情况下定位并区分目标车辆。首先，构建了一个多视角合成的数据增强前置模块，生成多角度样本以对齐俯视监控场景，弥补跨域场景下的视角缺失。然后，设计了一种类间聚类与原型匹配方法，通过聚类算法挖掘数据模态，构建包含多种形态的真实原型库以解决类内差异大的问题；在此基础上引入全局与局部联合表征，结合图像不同网络层级中语义与纹理的细节，实现对目标车辆的细粒度判别。实验表明，该方法在少样本条件下有效克服了视角域偏移导致的检测能力降低；相比传统方法，该方法提升了检测召回率，显著降低了由非目标车辆引起的误报，验证了该域适应框架在特种车辆监管场景中的有效性与鲁棒性。

Abstract

This paper proposes a domain adaptation detection framework based on self-supervised visual representations， integrating general object detection with prototype library matching techniques. By leveraging the pre-trained features of DINOv3 for matching and screening， the proposed method can localize and distinguish target vehicles under few-shot conditions without the need for fine-tuning. First， a data augmentation front-end module based on multi-view synthesis is constructed to generate multi-angle samples. This aligns with overhead surveillance scenes， compensating for viewpoint deficiency in cross-domain settings. Subsequently， an inter-class clustering and prototype matching method is designed. By mining data modalities via clustering algorithms， a real-world prototype library encompassing various morphologies is constructed to address the issue of large intra-class variations. Building upon this， a joint global and local representation is introduced， which integrates semantic and textural details from different network layers to achieve fine-grained discrimination of target vehicles. Experimental results demonstrate that under few-shot conditions， the proposed method effectively overcomes the degradation in detection performance caused by viewpoint domain shift. Compared with traditional approaches， it improves the detection recall rate and significantly reduces false positives triggered by non-target vehicles， validating the effectiveness and robustness of the proposed domain adaptation framework in special vehicle monitoring scenarios..

关键词

Keywords

references

李经宇，杨静，孔斌，等， 2021 . 基于注意力机制的多尺度车辆行人检测算法［J］. 光学精密工程， 29 （ 6 ）： 1448 - 1458 .

邱铭凯，李熙莹， 2021 . 用于车辆重识别的基于细节感知的判别特征学习模型［J］. 中山大学学报（自然科学版）， 60 （ 4 ）： 111 - 120 .

薛天朗，岳玉涛， 2025 . 基于文本-视觉多模态学习的交通目标识别与检索［J/OL］. 计算机应用与软件 . https：//www.shcas.net/cn/article/pdf/preview/cf7b460e-894b-4dbf-ba90-bc968e958623.pdf https://www.shcas.net/cn/article/pdf/preview/cf7b460e-894b-4dbf-ba90-bc968e958623.pdf .

Bochkovskiy A ， Wang C Y ， Liao H M ， 2020 . YOLOv4： Optimal speed and accuracy of object detection ［PP/OL］. （ 2020-04-23 ）［ 2026-02-09 ］. https：//doi.org/10.48550/arXiv.2004.10934 https://doi.org/10.48550/arXiv.2004.10934 .

Cheng T ， Song L ， Ge Y ， et al ， 2024 . YOLO-World： Real-time open-vocabulary object detection ［C］// 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Seattle， WA， USA ： 16901 - 16911 .

Cubuk E D ， Zoph B ， Shlensh J ， et al ， 2020 . RandAugment： Practical automated data augmentation with a reduced search space ［C］// 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops . Seattle， WA， USA ： 3008 - 3017 .

Ghiasi G ， Cui Y ， Srinivas A ， et al ， 2021 . Simple copy-paste is a strong data augmentation method for instance segmentation ［C］// 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Nashville，TN，USA ： 2917 - 2927 .

Hu J ， Shen L ， Sun G ， 2018 . Squeeze-and-Excitation Networks ［C］// 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Salt Lake City， UT ： 7132 - 7141 .

Huang S ， Hou Y ， Liu L ， et al ， 2025 . Real-time object detection meets DINOv3 ［PP/OL］. （ 2026-01-26 ）［ 2026-02-09 ］. https：//doi.org/10.48550/arXiv.2509.20787.arXiv： https://doi.org/10.48550/arXiv.2509.20787.arXiv: 2509.20787 .

Li L H ， Zhang P ， Zhang H ， et al ， 2022 . Grounded language-image pre-training ［C］// 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition . New Orleans， LA， USA ： 10955 - 10965 .

Lin T Y ， Dollar P ， Girshick R ， et al ， 2017 . Feature pyramid networks for object detection ［C］// 2017 IEEE Conference on Computer Vision and Pattern Recognition . Honolulu， HI ： 936 - 944 .

Liu R ， Wu R ， Van Hoorick B ， et al ， 2024 . Zero-1-to-3： Zero-shot one image to 3D object ［C］// 2023 IEEE/CVF International Conference on Computer Vision . Paris， France ： 9264 - 9275 .

Liu S ， Qi L ， Qin H ， et al ， 2018 . Path Aggregation network for instance segmentation ［C］// 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Salt Lake City， UT ： 8759 - 8768 .

Liu S ， Zeng Z ， Ren T ， et al ， 2023 . Grounding DINO： Marrying DINO with grounded pre-training for open-set object detection ［PP/OL］. （ 2024-07-19 ）［ 2026-02-09 ］. https：//doi.org/10.48550/arXiv.2303.05499.arXiv： https://doi.org/10.48550/arXiv.2303.05499.arXiv: 2303.05499 .

Oquab M ， Darcet T ， Moutakanni T ， et al ， 2023 . DINOv2： Learning robust visual features without supervision ［PP/OL］. （ 2024-02-22 ）［ 2026-02-09 ］. https：//doi.org/10.48550/arXiv.2304.07193 https://doi.org/10.48550/arXiv.2304.07193 .

Poole B ， Jain A ， Barron J T ， et al ， 2022 . DreamFusion： Text-to-3D using 2D diffusion ［PP/OL］.（ 2022-09-29 ）［ 2026-02-09 ］. https：//doi.org/10.48550/arXiv.2209.14988 https://doi.org/10.48550/arXiv.2209.14988 .

Rombach R ， Blattmann A ， Lorenz D ， et al ， 2022 . High-Resolution image synthesis with latent diffusion models ［C］// 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition . New Orleans，LA， USA ： 10674 - 10685 .

Shi Y ， Wang P ， Ye J ， et al ， 2023 . MVDream：Multi-view diffusion for 3D generation ［PP/OL］.（ 2024-04-18 ）［ 2026-02-09 ］. https：//doi.org/10.48550/arXiv.2308.16512 https://doi.org/10.48550/arXiv.2308.16512 .

Tang J ， Ren J ， Zhou H ， et al ， 2023 . DreamGaussian： Generative Gaussian splatting for efficient 3D content creation ［PP/OL］.（ 2024-03-29 ）［ 2026-02-09 ］. https：//doi.org/10.48550/arXiv.2309.16653.0 https://doi.org/10.48550/arXiv.2309.16653.0

Wei J ， Wang Q ， Zhao Z ， 2023 . YOLO-G： Improved YOLO for cross-domain object detection ［J］. Plos One ， 18 （ 9 ）： e0291241 .

Woo S ， Park J ， Lee J Y ， et al ， 2018 . CBAM： Convolutional block attention module ［M］// Computer Vision-ECCV 2018 . Cham ： Springer International Publishing： 3 - 19 .

Yun S ， Han D ， Chun S ， et al ， 2004 . CutMix： Regularization strategy to train strong classifiers with localizable features ［C］// 2019 IEEE/CVF International Conference on Computer Vision . Seoul， Korea ： 6022 - 6031 .

Zhang H ， Cisse M ， Dauphin Y N ， et al ， 2017 . Mixup： Beyond empirical risk minimization ［PP/OL ］.（ 2018-04-27 ）［ 2026-02-09 ］. https：//doi.org/10.48550/arXiv.1710.09412.0 https://doi.org/10.48550/arXiv.1710.09412.0

Zhang L ， Rao A ， Agrawala M ， 2023 . Adding conditional control to text-to-image diffusion models ［C］// 2023 IEEE/CVF International Conference on Computer Vision . Paris， France ： 3813 - 3824 .

Views

下载量

CSCD

Alert me when the article has been cited

提交

Tools

Publicity Resources

Temporal and spatial variation of abrupt cooling and warming in Chinese mainland based on site observations during 1979-2019

Qingkailing Oral Liquid modulates AMPK/PI3K/AKT signaling to attenuate NAFLD pathogenesis

Spatio-temporal variations of NDVI and its driving factors in the Tianshan Mountains（1982-2021）

Analysis of chemical components and blood plasma constituents of Zhangyanming Tablet extract based on UFLC-Q-TOF-MS/MS method

Kirchhoff indices of Cayley graphs on T4n×ℤm

Related Author

No data

Related Institution

No data

Postal code：510275
Tel：020-84112585，84113223 Email：xuebaozr@mail.sysu.edu.cn
Technical support is provided by Beijing Founder electronics co., LTD 京ICP备09064830号-19 京公网安备11010802024621
It is recommended to read the content of this site in Chrome&IE9+. Please switch to extreme mode in browser 360.
Cookies We use cookies to help provide and enhance our service and tailor content. By continuing, you agree to the use of cookies.

⁰