Bi-SAN-CAP: Bi-Directional Self-Attention for Image Captioning

Md Zakir Hossain; Ferdous Sohel; Mohd Fairuz Shiratuddin; Hamid Laga; Mohammed Bennamoun

doi:10.1109/DICTA47822.2019.8946003

Bi-SAN-CAP: Bi-Directional Self-Attention for Image Captioning

Md Zakir Hossain, Ferdous Sohel, Mohd Fairuz Shiratuddin, Hamid Laga, Mohammed Bennamoun

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

10 Citations (Scopus)

Abstract

In a typical image captioning pipeline, a Convolutional Neural Network (CNN) is used as the image encoder and Long Short-Term Memory (LSTM) as the language decoder. LSTM with attention mechanism has shown remarkable performance on sequential data including image captioning. LSTM can retain long-range dependency of sequential data. However, it is hard to parallelize the computations of LSTM because of its inherent sequential characteristics. In order to address this issue, recent works have shown benefits in using self-attention, which is highly parallelizable without requiring any temporal dependencies. However, existing techniques apply attention only in one direction to compute the context of the words. We propose an attention mechanism called Bi-directional Self-Attention (Bi-SAN) for image captioning. It computes attention both in forward and backward directions. It achieves high performance comparable to state-of-the-art methods.

Original language	English
Title of host publication	2019 Digital Image Computing
Subtitle of host publication	Techniques and Applications, DICTA 2019
Publisher	Institute of Electrical and Electronics Engineers Inc.
ISBN (Electronic)	9781728138572
DOIs	https://doi.org/10.1109/DICTA47822.2019.8946003
Publication status	Published - Dec 2019
Externally published	Yes
Event	2019 International Conference on Digital Image Computing: Techniques and Applications, DICTA 2019 - Perth, Australia Duration: Dec 2 2019 → Dec 4 2019

Publication series

Name	2019 Digital Image Computing: Techniques and Applications, DICTA 2019

Conference

Conference	2019 International Conference on Digital Image Computing: Techniques and Applications, DICTA 2019
Country/Territory	Australia
City	Perth
Period	12/2/19 → 12/4/19

Keywords

Bi-directional Self-Attention
Deep Learning
Image Captioning
Self-Attention

ASJC Scopus subject areas

Artificial Intelligence
Computer Science Applications
Computer Vision and Pattern Recognition
Signal Processing
Media Technology

Access to Document

10.1109/DICTA47822.2019.8946003

Cite this

Hossain, M. Z., Sohel, F., Shiratuddin, M. F., Laga, H., & Bennamoun, M. (2019). Bi-SAN-CAP: Bi-Directional Self-Attention for Image Captioning. In 2019 Digital Image Computing: Techniques and Applications, DICTA 2019 Article 8946003 (2019 Digital Image Computing: Techniques and Applications, DICTA 2019). Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/DICTA47822.2019.8946003

Bi-SAN-CAP: Bi-Directional Self-Attention for Image Captioning. / Hossain, Md Zakir; Sohel, Ferdous; Shiratuddin, Mohd Fairuz et al.
2019 Digital Image Computing: Techniques and Applications, DICTA 2019. Institute of Electrical and Electronics Engineers Inc., 2019. 8946003 (2019 Digital Image Computing: Techniques and Applications, DICTA 2019).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

Hossain, MZ, Sohel, F, Shiratuddin, MF, Laga, H & Bennamoun, M 2019, Bi-SAN-CAP: Bi-Directional Self-Attention for Image Captioning. in 2019 Digital Image Computing: Techniques and Applications, DICTA 2019., 8946003, 2019 Digital Image Computing: Techniques and Applications, DICTA 2019, Institute of Electrical and Electronics Engineers Inc., 2019 International Conference on Digital Image Computing: Techniques and Applications, DICTA 2019, Perth, Australia, 12/2/19. https://doi.org/10.1109/DICTA47822.2019.8946003

Hossain MZ, Sohel F, Shiratuddin MF, Laga H, Bennamoun M. Bi-SAN-CAP: Bi-Directional Self-Attention for Image Captioning. In 2019 Digital Image Computing: Techniques and Applications, DICTA 2019. Institute of Electrical and Electronics Engineers Inc. 2019. 8946003. (2019 Digital Image Computing: Techniques and Applications, DICTA 2019). doi: 10.1109/DICTA47822.2019.8946003

@inproceedings{9479a4fdad1e4abfa7b4d404fef2c606,

title = "Bi-SAN-CAP: Bi-Directional Self-Attention for Image Captioning",

abstract = "In a typical image captioning pipeline, a Convolutional Neural Network (CNN) is used as the image encoder and Long Short-Term Memory (LSTM) as the language decoder. LSTM with attention mechanism has shown remarkable performance on sequential data including image captioning. LSTM can retain long-range dependency of sequential data. However, it is hard to parallelize the computations of LSTM because of its inherent sequential characteristics. In order to address this issue, recent works have shown benefits in using self-attention, which is highly parallelizable without requiring any temporal dependencies. However, existing techniques apply attention only in one direction to compute the context of the words. We propose an attention mechanism called Bi-directional Self-Attention (Bi-SAN) for image captioning. It computes attention both in forward and backward directions. It achieves high performance comparable to state-of-the-art methods.",

keywords = "Bi-directional Self-Attention, Deep Learning, Image Captioning, Self-Attention",

author = "Hossain, {Md Zakir} and Ferdous Sohel and Shiratuddin, {Mohd Fairuz} and Hamid Laga and Mohammed Bennamoun",

note = "Publisher Copyright: {\textcopyright} 2019 IEEE.; 2019 International Conference on Digital Image Computing: Techniques and Applications, DICTA 2019 ; Conference date: 02-12-2019 Through 04-12-2019",

year = "2019",

month = dec,

doi = "10.1109/DICTA47822.2019.8946003",

language = "English",

series = "2019 Digital Image Computing: Techniques and Applications, DICTA 2019",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

booktitle = "2019 Digital Image Computing",

}

TY - GEN

T1 - Bi-SAN-CAP

T2 - 2019 International Conference on Digital Image Computing: Techniques and Applications, DICTA 2019

AU - Hossain, Md Zakir

AU - Sohel, Ferdous

AU - Shiratuddin, Mohd Fairuz

AU - Laga, Hamid

AU - Bennamoun, Mohammed

PY - 2019/12

Y1 - 2019/12

N2 - In a typical image captioning pipeline, a Convolutional Neural Network (CNN) is used as the image encoder and Long Short-Term Memory (LSTM) as the language decoder. LSTM with attention mechanism has shown remarkable performance on sequential data including image captioning. LSTM can retain long-range dependency of sequential data. However, it is hard to parallelize the computations of LSTM because of its inherent sequential characteristics. In order to address this issue, recent works have shown benefits in using self-attention, which is highly parallelizable without requiring any temporal dependencies. However, existing techniques apply attention only in one direction to compute the context of the words. We propose an attention mechanism called Bi-directional Self-Attention (Bi-SAN) for image captioning. It computes attention both in forward and backward directions. It achieves high performance comparable to state-of-the-art methods.

AB - In a typical image captioning pipeline, a Convolutional Neural Network (CNN) is used as the image encoder and Long Short-Term Memory (LSTM) as the language decoder. LSTM with attention mechanism has shown remarkable performance on sequential data including image captioning. LSTM can retain long-range dependency of sequential data. However, it is hard to parallelize the computations of LSTM because of its inherent sequential characteristics. In order to address this issue, recent works have shown benefits in using self-attention, which is highly parallelizable without requiring any temporal dependencies. However, existing techniques apply attention only in one direction to compute the context of the words. We propose an attention mechanism called Bi-directional Self-Attention (Bi-SAN) for image captioning. It computes attention both in forward and backward directions. It achieves high performance comparable to state-of-the-art methods.

KW - Bi-directional Self-Attention

KW - Deep Learning

KW - Image Captioning

KW - Self-Attention

UR - http://www.scopus.com/inward/record.url?scp=85078486323&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85078486323&partnerID=8YFLogxK

U2 - 10.1109/DICTA47822.2019.8946003

DO - 10.1109/DICTA47822.2019.8946003

M3 - Conference contribution

AN - SCOPUS:85078486323

T3 - 2019 Digital Image Computing: Techniques and Applications, DICTA 2019

BT - 2019 Digital Image Computing

PB - Institute of Electrical and Electronics Engineers Inc.

Y2 - 2 December 2019 through 4 December 2019

ER -

Bi-SAN-CAP: Bi-Directional Self-Attention for Image Captioning

Abstract

Publication series

Conference

Keywords

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this