Image Captioning Based on Convolutional Neural Network and Transformer

Konferenz: CAIBDA 2022 - 2nd International Conference on Artificial Intelligence, Big Data and Algorithms
17.06.2022 - 19.06.2022 in Nanjing, China

Tagungsband: CAIBDA 2022

Seiten: 9Sprache: EnglischTyp: PDF

Autoren:
Haidarh, Mosa; Zhang, Shuang (College of Software, Northeastern University Shenyang, Liaoning, China)

Inhalt:
Image captioning refers to the automatic description of images using words, and the task has sparked the interest of researchers in the fields of computer vision and NLP. In recent years, most works on image captioning used the encoderdecoder framework where the encoder is used to get the image features vectors, and the decoder takes these features vectors to generate the description of the image. By combining CNN and LSTM, the encoder-decoder framework for image captioning has achieved significant progress; Additionally, by including attention mechanisms into this framework, the performance of the captioning models was significantly enhanced. Furthermore, the Transformer, which depends on attention in its work, is better in terms of performance and efficiency in NLP missions compared with LSTM. Based on the previous ideas, we proposed a model based on CNN and Transformer that achieved accurate image captions. With the proposed model, the Transformer-Encoder is used to get a new representation of the image features to help the Transformer-Decoder focus on the most relevant part of the image when generating a new word. We also employ adaptive attention in the Transformer-Decoder to determine when and where the decoder uses the image information. The experiments were done on the combined datasets of Flickr30K, Flickr8K, and 1K images collected by us, and the testing results prove that the work is effective and valuable.