An Empirical Study on the Effectiveness of Testing Metrics to Test Deep Learning Models
dc.contributor.advisor | Roy, Chanchal K. | |
dc.contributor.advisor | Stavness, Ian | |
dc.contributor.committeeMember | Vassileva, Julita | |
dc.contributor.committeeMember | Xing, Li | |
dc.contributor.committeeMember | Mondal, Manishankar | |
dc.creator | Awal, Md. Abdul | |
dc.date.accessioned | 2022-05-02T15:45:32Z | |
dc.date.available | 2022-05-02T15:45:32Z | |
dc.date.created | 2022-01 | |
dc.date.issued | 2022-05-02 | |
dc.date.submitted | January 2022 | |
dc.date.updated | 2022-05-02T15:45:33Z | |
dc.description.abstract | In recent years, Deep Learning (DL) models have widely been applied to develop safety and security critical systems. The recent evolvement of Deep Neural Networks (DNNs) is the key reason behind the unprecedented achievements in image classification, object detection, medical image analysis, speech recog nition, and autonomous driving. However, DL models often remain a black box for their practitioners due to the lack of interpretability and explainability. DL practitioners generally use standard metrics such as Precision, Recall, and F1 score to evaluate the performance of DL models on the test dataset. However, as high-quality test data is not frequently accessed, the expected level of accuracy of these standard metrics on test datasets cannot justify the trustworthiness of testing adequacy, generality and robustness of DL models. The way we ensure the quality of DL models is still in its infancy; hence, a scalable DL model testing frame work is highly demanded in the context of software testing. The existing techniques for testing traditional software systems could not be directly applicable to DL models because of the fundamental difference in pro gramming paradigm, systems development methodologies, and processes. However, several testing metrics (e.g., Neuron Coverage (NC), Confusion and Bias error metrics, and Multi-granularity metrics) have been proposed leveraging the concept of test coverage in traditional software testing to measure the robustness of DL models and the quality of the test datasets. Although test coverage is highly effective to test traditional software systems, the effectiveness of DL coverage metrics must be evaluated in testing the robustness of DL models and measuring the quality of the test datasets. In addition, the selected testing metrics work on the activated neurons of a DL model. In our study, we consider the neuron count of a DL model differently than the existing studies. For example, according to our calculation the LeNet-5 model has 6508 neurons whereas other studies consider the LeNet-5 model contains 268 neurons only. Therefore, it is also important to in vestigate how neurons’ concept (e.g., the idea of having neurons in the DL model and the way we calculate the number of neurons a DL model does have) impact the testing metrics. In this thesis, we thus conduct an exploratory study for evaluating the effectiveness of the testing metrics to test DL models not only in measuring their robustness but also in assessing the quality of the test datasets. Furthermore, since selected testing metrics work on the activated neurons of a DL model, we also investigate the impact of the neurons’ concepts on the testing metrics. To conduct our experiments, we select popular publicly available datasets (e.g., MNIST, Fashion MNIST, CIFAR-10, ImageNet and so on) and train DL models on them. We also select sate-of-the-art DL models (e.g., VGG-16, VGG-19, ResNet-50, ResNet-101 and so on) trained on the ImageNet dataset. Our experimental results demonstrate that whatever the neuron’s concepts are, NC and Multi-granularity testing metrics are ineffective in evaluating the robustness of DL models and in assessing the quality of the test datasets. In addition, the selection of threshold values has a negligible impact on the NC metric. Increasing the coverage values of the Multi-granularity testing metrics can not separate regular test data from adversarial test data. Our exploratory study also shows that the DL models still make accurate predictions with higher coverage values of Multi-granularity metrics than the false predictions. Therefore, it is not always true that increasing coverage values of the Multi-granularity testing metrics find more defects of DL models. Finally, the Precision and Recall scores show that the Confusion and Bias error metrics are adequate to detect class-level violations of the DL models. | |
dc.format.mimetype | application/pdf | |
dc.identifier.uri | https://hdl.handle.net/10388/13932 | |
dc.subject | Testing, Deep Learning | |
dc.title | An Empirical Study on the Effectiveness of Testing Metrics to Test Deep Learning Models | |
dc.type | Thesis | |
dc.type.material | text | |
thesis.degree.department | Computer Science | |
thesis.degree.discipline | Computer Science | |
thesis.degree.grantor | University of Saskatchewan | |
thesis.degree.level | Masters | |
thesis.degree.name | Master of Science (M.Sc.) |