Toward Efficient Deep Learning for Computer Vision Applications


Student Name: Xiangyu Chen
Defense Date:
Location: Nichols Hall, Room 246
Chair: Cuncong Zhong

Prasad Kulkarni

Bo Luo

Fengjun Li

Honguo Xu

Guanghui Wang

Abstract:

Deep learning leads the performance in many areas of computer vision. However, after a decade of research, it tends to require larger datasets and more complex models, leading to heightened resource consumption across all fronts. Regrettably, meeting these requirements proves challenging in many real-life scenarios. First, both data collection and labeling processes entail substantial labor and time investments. This challenge becomes especially pronounced in domains such as medicine, where identifying rare diseases demands meticulous data curation. Secondly, the large size of state-of-the-art models, such as ViT, Stable Diffusion, and ConvNext, hinders their deployment on resource-constrained platforms like mobile devices. Research indicates pervasive redundancies within current neural network structures, exacerbating the issue. Lastly, even with ample datasets and optimized models, the time required for training and inference remains prohibitive in certain contexts. Consequently, there is a burgeoning interest among researchers in exploring avenues for efficient artificial intelligence.

This study endeavors to delve into various facets of efficiency within computer vision, including data efficiency, model efficiency, as well as training and inference efficiency. The data efficiency is improved from the perspective of increasing information brought by given image inputs and reducing redundancies of RGB image formats. To achieve this, we propose to integrate both spatial and frequency representations to finetune the classifier. Additionally, we propose explicitly increasing the input information density in the frequency domain by deleting unimportant frequency channels. For model efficiency, we scrutinize the redundancies present in widely used vision transformers. Our investigation reveals that trivial attention in their attention modules covers useful non-trivial attention due to its large amount. We propose mitigating the impact of accumulated trivial attention weights. To increase training efficiency, we propose SuperLoRA, a generation of LoRA adapter, to fine-tune pretrained models with few iterations and extremely-low parameters. Finally, a model simplification pipeline is proposed to further reduce inference time on mobile devices. By addressing these challenges, we aim to advance the practicality and performance of computer vision systems in real-world applications.

Degree: PhD Dissertation Defense (EE)
Degree Type: PhD Dissertation Defense
Degree Field: Electrical Engineering