TY - JOUR
T1 - Deep convolutional neural networks-based Hardware–Software on-chip system for computer vision application
AU - Messaoud, Seifeddine
AU - Bouaafia, Soulef
AU - Maraoui, Amna
AU - Ammari, Ahmed Chiheb
AU - Khriji, Lazhar
AU - Machhout, Mohsen
N1 - Publisher Copyright:
© 2022 Elsevier Ltd
PY - 2022/3
Y1 - 2022/3
N2 - Embedded vision systems are the best solutions for high-performance and lightning-fast inspection tasks. As everyday life evolves, it becomes almost imperative to harness artificial intelligence (AI) in vision applications that make these systems intelligent and able to make decisions close to or similar to humans. In this context, the AI's integration on embedded systems poses many challenges, given that its performance depends on data volume and quality they assimilate to learn and improve. This returns to the energy consumption and cost constraints of the FPGA-SoC that have limited processing, memory, and communication capacity. Despite this, the AI algorithm implementation on embedded systems can drastically reduce energy consumption and processing times, while reducing the costs and risks associated with data transmission. Therefore, its efficiency and reliability always depend on the designed prototypes. Within this range, this work proposes two different designs for the Traffic Sign Recognition (TSR) application based on the convolutional neural network (CNN) model, followed by three implantations on PYNQ-Z1. Firstly, we propose to implement the CNN-based TSR application on the PYNQ-Z1 processor. Considering its runtime result of around 3.55 s, there is room for improvement using programmable logic (PL) and processing system (PS) in a hybrid architecture. Therefore, we propose a streaming architecture, in which the CNN layers will be accelerated to provide a hardware accelerator for each layer where direct memory access (DMA) interface is used. Thus, we noticed efficient power consumption, decreased hardware cost, and execution time optimization of 2.13 s, but, there was still room for design optimizations. Finally, we propose a second co-design, in which the CNN will be accelerated to be a single computation engine where BRAM interface is used. The implementation results prove that our proposed embedded TSR design achieves the best performances compared to the first proposed architectures, in terms of execution time of about 0.03 s, computation roof of about 36.6 GFLOPS, and bandwidth roof of about 3.2 GByte/s.
AB - Embedded vision systems are the best solutions for high-performance and lightning-fast inspection tasks. As everyday life evolves, it becomes almost imperative to harness artificial intelligence (AI) in vision applications that make these systems intelligent and able to make decisions close to or similar to humans. In this context, the AI's integration on embedded systems poses many challenges, given that its performance depends on data volume and quality they assimilate to learn and improve. This returns to the energy consumption and cost constraints of the FPGA-SoC that have limited processing, memory, and communication capacity. Despite this, the AI algorithm implementation on embedded systems can drastically reduce energy consumption and processing times, while reducing the costs and risks associated with data transmission. Therefore, its efficiency and reliability always depend on the designed prototypes. Within this range, this work proposes two different designs for the Traffic Sign Recognition (TSR) application based on the convolutional neural network (CNN) model, followed by three implantations on PYNQ-Z1. Firstly, we propose to implement the CNN-based TSR application on the PYNQ-Z1 processor. Considering its runtime result of around 3.55 s, there is room for improvement using programmable logic (PL) and processing system (PS) in a hybrid architecture. Therefore, we propose a streaming architecture, in which the CNN layers will be accelerated to provide a hardware accelerator for each layer where direct memory access (DMA) interface is used. Thus, we noticed efficient power consumption, decreased hardware cost, and execution time optimization of 2.13 s, but, there was still room for design optimizations. Finally, we propose a second co-design, in which the CNN will be accelerated to be a single computation engine where BRAM interface is used. The implementation results prove that our proposed embedded TSR design achieves the best performances compared to the first proposed architectures, in terms of execution time of about 0.03 s, computation roof of about 36.6 GFLOPS, and bandwidth roof of about 3.2 GByte/s.
KW - Acceleration
KW - CNN
KW - Co-design
KW - FPGA
KW - PYNQ-Z1
UR - http://www.scopus.com/inward/record.url?scp=85123030584&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85123030584&partnerID=8YFLogxK
U2 - 10.1016/j.compeleceng.2021.107671
DO - 10.1016/j.compeleceng.2021.107671
M3 - Article
AN - SCOPUS:85123030584
SN - 0045-7906
VL - 98
JO - Computers and Electrical Engineering
JF - Computers and Electrical Engineering
M1 - 107671
ER -