Study of Radiation Effects on FPGA and GPU based Neural Networks Accelerator Designs
Date
2024-08-16
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
ORCID
Type
Thesis
Degree Level
Doctoral
Abstract
Due to the rapid development of semiconductor technology and the increasing complexity of integrated circuit (IC) designs, they have expanded into various fields. In particular, they are used in environments affected by radiation such as space exploration, medical devices, etc. Therefore, the reliability of IC devices in radiation-hazard environments has become an important issue. Energetic particles, such as protons, neutrons, and heavy ions, can penetrate the device and cause single-event effects (SEEs), resulting in transient voltage or current changes in the circuitry. These changes can cause data to be modified, generate errors in logic, or even cause system crashes. Various radiation-resistant techniques have been proposed in this area of research, such as triple modular redundancy (TMR) , error-correcting codes (ECCs), etc., but the effectiveness of these designs gradually decreases as the technology node continues to shrink. Therefore, as IC technology continues to evolve, improving its radiation resistance remains very valuable research.
Convolutional neural networks (CNNs), as the most widely used model for deep learning, performs well in image recognition and target detection. Field Programmable Gate Arrays (FPGAs), with their high degree of parallelism, programmability, and low power consumption, are ideal platforms for efficient CNN computation, and the design of FPGA-based CNN accelerator has been widely studied and applied. However, the complexity and sensitivity to radiation effects of FPGAs make them less reliable in radiation environments, which in turn affects the inference accuracy of CNN models on FPGA accelerator. In this study, the radiation reliability of CNN models developed on FPGAs is comprehensively evaluated by using proton radiation, two-photon absorption (TPA) laser scanning, and software-level fault injection approaches. The most sensitive modules were firstly found by fully evaluating a LeNet-5 based FPGA accelerator using TPA laser scanning, and then adding a register-level selected TMR hardened design, achieving a 40\% improvement in reliability while adding only 20\% redundancy in utilizations, with experimental results validated by both laser and proton tests.
In addition, the reliability of CNN models under different architectural designs is compared. Two popular CNN accelerator architectures: streaming architecture (SA) and single computation engine (SCE), are implemented on our FPGA board. Experimental results show that SA-based CNNs require more hardware resources but exhibit superior resilience against single event upsets (SEUs). Without any Radiation Hardened by Design (RHBD) protection, SCE has an error rate approximately twice as high as SA. At the same time, the use of dynamic partial reconfiguration (DPR) method combined with soft-error mitigation (SEM) IP core was proposed (AutoDPR-SEM). It significantly improves the reliability of the model without increasing the inference timing of both model. This AutoDPR-SEM significantly improves CNN accelerators reliability, reducing the critical error rate by approximately 17.8 times in SCE and 14.8 times in SA. A software level simulation is also applied to validate the TPA experiment, showing similar trends of the testing results across all models.
Transformer networks, as high-performing models in the field of natural language processing (NLP), have demonstrated excellent performance across various applications. With the increasing model size and computational demands, GPUs have become the main platform for accelerating the training and inference of transformer networks. GPUs are popular in the field of deep learning due to their powerful parallel processing capabilities and efficient utilization of computational resources. However, GPUs also face challenges from SEE in radiation environments. In this study, the reliability of the popular transformer model DistilBERT is first evaluated using the TPA laser platform. An innovative soft-error impact assessment scheme is proposed, comparing the Euclidean distance (L2 distance) generated by the tensor output of each layer when affected by soft errors. When the output tensor shows an L2 distance greater than 1.0 compared to the standard tensor unaffected by soft errors, the likelihood of generating incorrect classification results significantly increases. This is used as a criterion to introduce a selective temporal redundancy computation method, which is enabled only when the output of the layers of impact is larger than 1.0 L2 distance. This approach significantly improves the reliability of running DistilBERT on GPU platforms. Laser experimental results validate the effectiveness of this approach.
In summary, this research proposes and evaluates radiation-hardened designs for FPGA and GPU to enhance the reliability of neural networks in radiation-prone environments. Through these studies, the feasibility of achieving high-reliability computing in harsh environments is demonstrated, providing essential references for future radiation-hardened electronic system designs.
Description
Keywords
CNN, Radiation tolerant, FPGA, SEU
Citation
Degree
Doctor of Philosophy (Ph.D.)
Department
Electrical and Computer Engineering
Program
Electrical Engineering