Comparison of the potential between transformer and CNN in image classification

Konferenz: ICMLCA 2021 - 2nd International Conference on Machine Learning and Computer Application
17.12.2021 - 19.12.2021 in Shenyang, China

Tagungsband: ICMLCA 2021

Seiten: 6Sprache: EnglischTyp: PDF

Persönliche VDE-Mitglieder erhalten auf diesen Artikel 10% Rabatt

Autoren:
Lu, Kangrui (Fuqua School of Business, Duke University, Durham, NC, USA)
Xu, Yuanrun (College of Computer Science, Sichuan University, Chengdu, Sichuan, China)
Yang, Yige (Department of Computer Science, University of Surrey, Guildford, Surrey, UK)

Inhalt:
Convolution Neural Network (CNN) based algorithms have been dominating image classification tasks. In the meantime, Transformer based methods have also started to gain popularity and usage in recent years. To get a clear view and understanding of the two types of methods for image classification tasks on a butterfly dataset of 10,000 data points, the study is designed to compare the efficiency of CNN’s Insception-ResNetV2 model and Vision Transformer (ViT). For each of the methods, we also compare internally by different sizes of the dataset. By examining the experiment results of both validation accuracy and training time, we conclude that the ViT model’s accuracy is much more sensitive to largescale datasets. Also, ViT training requires relatively higher expense and duration. Meanwhile, the ViT model displays a relatively stable loss throughout the training process, illustrating feasible industry-level applications and opportunities for further refinement.