MNTN: Deep Modular N-shape Transformer Networks for Image Captioning

Konferenz: ISCTT 2021 - 6th International Conference on Information Science, Computer Technology and Transportation
26.11.2021 - 28.11.2021 in Xishuangbanna, China

Tagungsband: ISCTT 2021

Seiten: 6Sprache: EnglischTyp: PDF

Persönliche VDE-Mitglieder erhalten auf diesen Artikel 10% Rabatt

Yang, You (National Center for Applied Mathematics in Chongqing, Chongqing, China)
Fang, Xiaolong; Deng, Yi; Wu, Chunyan (School of Computer and Information Science, Chongqing Normal University, Chongqing, China)

Image captioning requires the computer automatically generate natural language captions from the input image. Recent progress on image captioning uses multiple features as model inputs to improve performance. Nevertheless, there has not been sufficient feature utilization. In this paper, we introduce a Modular N-shape Transformer (MNT) fully to the high order intra interaction of single-feature and the high order guided interaction of multi-feature, which is composed of two basic attention transformer units. Furthermore, we present a deep Modular N-shape Transformer Network (MNTN) that novelty integrates MNT into image encoder part of image captioning model not only fully to leverage the spatial and location information of image, but also make the features better locate the image. Experiments show that MNTN outperforms most previously published methods and can express the semantic content of the image extreme accurately.