Efficient and Effective Object Detection and Recognition: from Convolutions to Transformers


Student Name: Tianxiao Zhang
Defense Date:
Location: Eaton Hall, Room 2001B
Chair: Bo Luo

Prasad Kulkarni

Fengjun Li

Cuncong Zhong

Guanghui Wang

Huazhen Fang

Abstract:

With the development of Convolutional Neural Networks (CNNs), computer vision has entered a new era, significantly enhancing the performance of tasks such as image classification, object detection, segmentation, and recognition. Furthermore, the introduction of Transformer architectures has brought the attention mechanism and a global perspective to computer vision, advancing the field to a new level. The inductive bias inherent in CNNs makes convolutional models particularly well-suited for processing images and videos. On the other hand, the attention mechanism in Transformer models allows them to capture global relationships between tokens. While Transformers often require more data and longer training periods compared to their convolutional counterparts, they have the potential to achieve comparable or even superior performance when the constraints of data availability and training time are mitigated.

In this work, we propose more efficient and effective CNNs and Transformers to increase the performance of object detection and recognition. (1) A novel approach is proposed for real-time detection and tracking of small golf balls by combining object detection with the Kalman filter. Several classical object detection models were implemented and compared in terms of detection precision and speed. (2) To address the domain shift problem in object detection, we employ generative adversarial networks (GANs) to generate images from different domains. The original RGB images are concatenated with the corresponding GAN-generated images to form a 6-channel representation, improving model performance across domains. (3) A dynamic strategy for improving label assignment in modern object detection models is proposed. Rather than relying on fixed or statistics-based adaptive thresholds, a dynamic paradigm is introduced to define positive and negative samples. This allows more high-quality samples to be selected as positives, reducing the gap between classification and IoU scores and producing more accurate bounding boxes. (4) An efficient hybrid architecture combining Vision Transformers and convolutional layers is introduced for object recognition, particularly for small datasets. Lightweight depth-wise convolution modules bypass the entire Transformer block to capture local details that the Transformer backbone might overlook. The majority of the computations and parameters remain within the Transformer architecture, resulting in significantly improved performance with minimal overhead. (5) An innovative Multi-Overlapped-Head Self-Attention mechanism is introduced to enhance information exchange between heads in the Multi-Head Self-Attention mechanism of Vision Transformers. By overlapping adjacent heads during self-attention computation, information can flow between heads, leading to further improvements in vision recognition.

Degree: PhD Dissertation Defense (EE)
Degree Type: PhD Dissertation Defense
Degree Field: Electrical Engineering