Enhancing Software Defect Prediction: Investigating Diverse Representations of Source Code as Feature Values in Classical and Quantum Machine Learning Approaches
Date
2025-04-22
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
ORCID
0000-0003-1765-2462
Type
Thesis
Degree Level
Doctoral
Abstract
Software defect prediction is one of the prominent topics within the software engineering research domain. Studies have been delved into predicting and addressing software bugs (also known as defects/ issues). Yet, the landscape remains marked by a steady influx of new reports filling bug repositories daily. Despite the vast research, critical gaps persist, potentially contributing to the continued emergence of defects within software systems. Furthermore, adopting existing research findings into practical solutions for real-world software environments presents considerable challenges because using some feature values in Machine/ Deep Learning (ML/ DL) models might seem infeasible and impractical. For instance, predicting whether a software component is buggy or non-buggy based on a feature like the ``age of the component" provides limited actionable insight for addressing issues in future iterations of the software system. On the other hand, using feature values directly derived from the structure of source code offers more practical relevance. Such features can help practitioners understand why a software component is prone to bugs due to some specific structural characteristics of the source code. This Ph.D. thesis investigates six research studies to propose novel approaches to predict software defects and thus reduce their occurrence in future software systems. The first investigation of this thesis explores Bug Inducing Commit (BIC) and Just in Time (JIT) defect prediction by proposing novel features derived from source code syntax patterns, namely source code token structure (TS) and token pattern (TP). Unlike traditional features reliant on developer metadata or repository statistics, these syntax-based features represent source code token patterns and sequences, leading to improved accuracy. The second study extends the idea of token patterns and converts them to Source Code Graphs (SCGs) representation. SCG extracts structural features such as the number of nodes, edges, and properties extracted from the connectivity among the nodes (for example, count of incoming or outgoing nodes) from a graph representation of source code. It also generates extra features by combining these SCG-based features with traditional ones to apply them in ML/ DL models for predicting software defect(s). Continuing from the second study, the third study of this thesis explores the image representation of SCG to extract feature values to utilize image processing techniques such as Histogram of Curvedness and Shape (HoCS) and Convolutional Neural Network (CNN) to improve defect prediction performance. In the fourth study, we utilize Large Language Model (LLM) based features and compare and evaluate their performance in predicting software defects against the features proposed in earlier studies of this thesis. To expand the domain of bug prediction studies, we perform the fifth study, which explores the potentials, challenges, and limitations of applying Quantum Machine Learning (QML) algorithms to software defect prediction, paving the way for integrating QML in real-world software engineering tasks. Finally, a detailed study investigates the application of Quantum Support Vector Classifiers (QSVCs) for buggy commit detection by introducing data partitioning, prediction aggregation, and an incremental testing methodology for addressing the computational challenges of the QSVC algorithm. The research directions of this PhD dissertation contribute to the advancement of software defect prediction by proposing novel techniques leveraging source code patterns, graph-based analysis, image representation of source code, and QML. They portray the quest to find innovative methods to minimize defect occurrences in software systems.
Description
Keywords
Software defect prediction, Machine Learning (ML), Deep Learning (DL), Bug Inducing Commit (BIC), Source code syntax patterns, Source Code Graphs (SCG), Histogram of Curvedness and Shape (HoCS), Large Language Model (LLM), Quantum Machine Learning (QML), Quantum Support Vector Classifiers (QSVC)
Citation
Degree
Doctor of Philosophy (Ph.D.)
Department
Computer Science
Program
Computer Science