Bi-SAN-CAP: Bi-Directional Self-Attention for Image Captioning

Md Zakir Hossain, Ferdous Sohel, Mohd Fairuz Shiratuddin, Hamid Laga, Mohammed Bennamoun

Research output: Chapter in Book/Report/Conference proceedingConference contribution

10 Citations (Scopus)


In a typical image captioning pipeline, a Convolutional Neural Network (CNN) is used as the image encoder and Long Short-Term Memory (LSTM) as the language decoder. LSTM with attention mechanism has shown remarkable performance on sequential data including image captioning. LSTM can retain long-range dependency of sequential data. However, it is hard to parallelize the computations of LSTM because of its inherent sequential characteristics. In order to address this issue, recent works have shown benefits in using self-attention, which is highly parallelizable without requiring any temporal dependencies. However, existing techniques apply attention only in one direction to compute the context of the words. We propose an attention mechanism called Bi-directional Self-Attention (Bi-SAN) for image captioning. It computes attention both in forward and backward directions. It achieves high performance comparable to state-of-the-art methods.

Original languageEnglish
Title of host publication2019 Digital Image Computing
Subtitle of host publicationTechniques and Applications, DICTA 2019
PublisherInstitute of Electrical and Electronics Engineers Inc.
ISBN (Electronic)9781728138572
Publication statusPublished - Dec 2019
Event2019 International Conference on Digital Image Computing: Techniques and Applications, DICTA 2019 - Perth, Australia
Duration: Dec 2 2019Dec 4 2019

Publication series

Name2019 Digital Image Computing: Techniques and Applications, DICTA 2019


Conference2019 International Conference on Digital Image Computing: Techniques and Applications, DICTA 2019


  • Bi-directional Self-Attention
  • Deep Learning
  • Image Captioning
  • Self-Attention

ASJC Scopus subject areas

  • Artificial Intelligence
  • Computer Science Applications
  • Computer Vision and Pattern Recognition
  • Signal Processing
  • Media Technology


Dive into the research topics of 'Bi-SAN-CAP: Bi-Directional Self-Attention for Image Captioning'. Together they form a unique fingerprint.

Cite this