A Fault-Tolerant Design on Convolution Neural Networks by Applying Reconfigurable Processing Element Arrays

Chen, Li2024-02-092024-02-0920232023-042024-02-09April 2023https://hdl.handle.net/10388/15495Convolutional neural networks (CNNs) implemented on field programmable gate arrays (FPGAs) have garnered significant interest due to their superior performance and flexibility, particularly during the inference phase following CNN model training on other platforms. The ability to customize the programmable logic (PL) section of the FPGA is the key factor driving the aforementioned performance and flexibility advantages. Moreover, recent trends in research have indicated that the parallel design of multiple processing element (PE) groups is becoming increasingly popular for implementing complex CNN designs. This approach offers a significant advantage over single PE or flat implementations, as it results in higher performance levels. However, increasing the number of PEs in a design can result in an elevated Single Event Upset (SEU) rate for designs operating in radiation environments. This is due to the vulnerability of the configuration memories in SRAM-based FPGAs. While memory refreshing can eliminate errors, the CNN may still produce incorrect results before SEUs are rectified. To address this issue, Triple Modular Redundancy (TMR) techniques are commonly employed to ensure correct operations. Nevertheless, this approach incurs at least 200% overhead in terms of resources, which can render it unsuitable for many complex neural networks that have high resource requirements. To address the resource limitations of TMR techniques, FPGA vendors offer Dynamic Partial Reconfiguration (DPR) methods that enable the repair of SEUs in specific regions of the configuration memories through partial refreshing without the need for additional hardware resources in the FPGAs. DPR allows for the reconfiguration of a portion of the FPGA while the rest of the device continues to operate normally. This technique can also be applied to TMR-protected CNN designs to reduce refreshing time. However, it does not alleviate the area overhead associated with TMR methods. In this thesis, a CNN was designed and implemented in a FPGA with multiple parallel PE array groups serving as computing engines, with each group working independently. Prior to the start of computation, self-testing was performed on each PE array to verify its functionality. If any faults were detected, DPR was conducted to correct the errors in the configuration memory of the affected PE array.The experiments in this thesis evaluated the performance of a single PE group without any reinforcement design as a control group using both error injection and laser experiments. Subsequently, more PE groups were added to determine whether the system could handle more SEUs or laser pulses before an error occurred. In the result, for non-critical errors where the CNN incorrectly estimates the percentage of a given output number, adding DPR can result in a 13.8 times improvement in cross-section. In cases where the CNN makes critical errors and predicts the input number incorrectly, adding DPR can improve the cross-section by 25 times. Additionally, the overall accuracy of the CNN remains consistently above 99% even after a large number of laser pulse or fault injections, indicating the robustness and reliability of the model. The key novelty of this study is the use of DPR to improve the overall fault tolerance of the entire CNN by taking advantage of the parallel processing capability of the PE arrays to perform data processing without faulty PE arrays. This approach significantly reduces area overhead compared to TMR methods. Experimental results demonstrated the effectiveness of the proposed method.application/pdfenFault-tolerant, Convolution Neural Network, FPGA, Dynamic Partial ReconfigurationA Fault-Tolerant Design on Convolution Neural Networks by Applying Reconfigurable Processing Element ArraysThesis2024-02-09