Embedded vision systems are the best solutions for high-performance and lightning-fast inspection tasks. As everyday life evolves, it becomes almost imperative to harness artificial intelligence (AI) in vision applications that make these systems intelligent and able to make decisions close to or similar to humans. In this context, the AI's integration on embedded systems poses many challenges, given that its performance depends on data volume and quality they assimilate to learn and improve. This returns to the energy consumption and cost constraints of the FPGA-SoC that have limited processing, memory, and communication capacity. Despite this, the AI algorithm implementation on embedded systems can drastically reduce energy consumption and processing times, while reducing the costs and risks associated with data transmission. Therefore, its efficiency and reliability always depend on the designed prototypes. Within this range, this work proposes two different designs for the Traffic Sign Recognition (TSR) application based on the convolutional neural network (CNN) model, followed by three implantations on PYNQ-Z1. Firstly, we propose to implement the CNN-based TSR application on the PYNQ-Z1 processor. Considering its runtime result of around 3.55 s, there is room for improvement using programmable logic (PL) and processing system (PS) in a hybrid architecture. Therefore, we propose a streaming architecture, in which the CNN layers will be accelerated to provide a hardware accelerator for each layer where direct memory access (DMA) interface is used. Thus, we noticed efficient power consumption, decreased hardware cost, and execution time optimization of 2.13 s, but, there was still room for design optimizations. Finally, we propose a second co-design, in which the CNN will be accelerated to be a single computation engine where BRAM interface is used. The implementation results prove that our proposed embedded TSR design achieves the best performances compared to the first proposed architectures, in terms of execution time of about 0.03 s, computation roof of about 36.6 GFLOPS, and bandwidth roof of about 3.2 GByte/s.
ASJC Scopus subject areas