Enhancing Image Caption Performance with Improved Visual Attention Mechanism

Authority: ICIC Express Letters
Category: Journal Publication

Image captioning analyzes and translates images into text, requiring extensive data and often facing challenges in comprehending the diverse contents of images during text generation. This research enhances image captioning using a visual attention mechanism to improve image-to-text translation performance. We propose a neural net-work architecture comprising an encoder, decoder, and beam search. The encoder uses either dual convolutional neural networks (Dual-CNN) or a single CNN to extract visual features, which are then passed to the decoder. The decoder employs long short-term memory (LSTM) to learn temporal and sequential patterns, converting visual features into output probabilities. The resulting outputs are then processed by the beam search algorithm to generate the best captions. Three experiments were conducted. First, single CNN architectures (ResNet-101, EfficientNet-B0, and ResNeXt-101) were evaluated with visual attention mechanisms on the Flickr8K dataset using BLEU scores. ResNet101 achieved the highest performance. Second, three Dual-CNNs combined with attention mechanisms were tested, with ResNet-101 and EfficientNet-B0 outperforming other combinations. Third, early stopping was used to determine the optimal training epoch, revealing that the Dual-CNN with visual attention mechanism yielded the best results. The proposed framework, tested on the Flickr8K dataset, achieved BLEU scores of 68.76%, 49.15%, 35.46%, and 24.71% in different scenarios, demonstrating superior performance compared to other approaches.

Interdisciplinary Research Centers (IRCs)

Applied Research Centers (ARCs)

Joint Research Centers (JRCs)

Research Support

CONSORTIA

Chair Professors

Visiting Scholars & Postdocs

Publications

Enhancing Image Caption Performance with Improved Visual Attention Mechanism