An Empirical Study on the Effectiveness of Testing Metrics to Test Deep Learning Models
Date
2022-05-02
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
ORCID
Type
Thesis
Degree Level
Masters
Abstract
In recent years, Deep Learning (DL) models have widely been applied to develop safety and security critical systems. The recent evolvement of Deep Neural Networks (DNNs) is the key reason behind the
unprecedented achievements in image classification, object detection, medical image analysis, speech recog nition, and autonomous driving. However, DL models often remain a black box for their practitioners due
to the lack of interpretability and explainability. DL practitioners generally use standard metrics such as
Precision, Recall, and F1 score to evaluate the performance of DL models on the test dataset. However, as
high-quality test data is not frequently accessed, the expected level of accuracy of these standard metrics on
test datasets cannot justify the trustworthiness of testing adequacy, generality and robustness of DL models.
The way we ensure the quality of DL models is still in its infancy; hence, a scalable DL model testing frame work is highly demanded in the context of software testing. The existing techniques for testing traditional
software systems could not be directly applicable to DL models because of the fundamental difference in pro gramming paradigm, systems development methodologies, and processes. However, several testing metrics
(e.g., Neuron Coverage (NC), Confusion and Bias error metrics, and Multi-granularity metrics) have been
proposed leveraging the concept of test coverage in traditional software testing to measure the robustness of
DL models and the quality of the test datasets. Although test coverage is highly effective to test traditional
software systems, the effectiveness of DL coverage metrics must be evaluated in testing the robustness of DL
models and measuring the quality of the test datasets. In addition, the selected testing metrics work on the
activated neurons of a DL model. In our study, we consider the neuron count of a DL model differently than
the existing studies. For example, according to our calculation the LeNet-5 model has 6508 neurons whereas
other studies consider the LeNet-5 model contains 268 neurons only. Therefore, it is also important to in vestigate how neurons’ concept (e.g., the idea of having neurons in the DL model and the way we calculate
the number of neurons a DL model does have) impact the testing metrics. In this thesis, we thus conduct
an exploratory study for evaluating the effectiveness of the testing metrics to test DL models not only in
measuring their robustness but also in assessing the quality of the test datasets. Furthermore, since selected
testing metrics work on the activated neurons of a DL model, we also investigate the impact of the neurons’
concepts on the testing metrics. To conduct our experiments, we select popular publicly available datasets
(e.g., MNIST, Fashion MNIST, CIFAR-10, ImageNet and so on) and train DL models on them. We also
select sate-of-the-art DL models (e.g., VGG-16, VGG-19, ResNet-50, ResNet-101 and so on) trained on the
ImageNet dataset. Our experimental results demonstrate that whatever the neuron’s concepts are, NC and
Multi-granularity testing metrics are ineffective in evaluating the robustness of DL models and in assessing
the quality of the test datasets. In addition, the selection of threshold values has a negligible impact on the
NC metric. Increasing the coverage values of the Multi-granularity testing metrics can not separate regular
test data from adversarial test data. Our exploratory study also shows that the DL models still make accurate
predictions with higher coverage values of Multi-granularity metrics than the false predictions. Therefore, it is not always true that increasing coverage values of the Multi-granularity testing metrics find more defects
of DL models. Finally, the Precision and Recall scores show that the Confusion and Bias error metrics are
adequate to detect class-level violations of the DL models.
Description
Keywords
Testing, Deep Learning
Citation
Degree
Master of Science (M.Sc.)
Department
Computer Science
Program
Computer Science