Management Aspects of Software Clone Detection and Analysis
Copying a code fragment and reusing it by pasting with or without minor modifications is a common practice in software development for improved productivity. As a result, software systems often have similar segments of code, called software clones or code clones. Due to many reasons, unintentional clones may also appear in the source code without awareness of the developer. Studies report that significant fractions (5% to 50%) of the code in typical software systems are cloned. Although code cloning may increase initial productivity, it may cause fault propagation, inflate the code base and increase maintenance overhead. Thus, it is believed that code clones should be identified and carefully managed. This Ph.D. thesis contributes in clone management with techniques realized into tools and large-scale in-depth analyses of clones to inform clone management in devising effective techniques and strategies. To support proactive clone management, we have developed a clone detector as a plug-in to the Eclipse IDE. For clone detection, we used a hybrid approach that combines the strength of both parser-based and text-based techniques. To capture clones that are similar but not exact duplicates, we adopted a novel approach that applies a suffix-tree-based k-difference hybrid algorithm, borrowed from the area of computational biology. Instead of targeting all clones from the entire code base, our tool aids clone-aware development by allowing focused search for clones of any code fragment of the developer's interest. A good understanding on the code cloning phenomenon is a prerequisite to devise efficient clone management strategies. The second phase of the thesis includes large-scale empirical studies on the characteristics (e.g., proportion, types of similarity, change patterns) of code clones in evolving software systems. Applying statistical techniques, we also made fairly accurate forecast on the proportion of code clones in the future versions of software projects. The outcome of these studies expose useful insights into the characteristics of evolving clones and their management implications. Upon identification of the code clones, their management often necessitates careful refactoring, which is dealt with at the third phase of the thesis. Given a large number of clones, it is difficult to optimally decide what to refactor and what not, especially when there are dependencies among clones and the objective remains the minimization of refactoring efforts and risks while maximizing benefits. In this regard, we developed a novel clone refactoring scheduler that applies a constraint programming approach. We also introduced a novel effort model for the estimation of efforts needed to refactor clones in source code. We evaluated our clone detector, scheduler and effort model through comparative empirical studies and user studies. Finally, based on our experience and in-depth analysis of the present state of the art, we expose avenues for further research and development towards a versatile clone management system that we envision.
Code Clone, Clone Management, Software Maintenance, Reengineering, Software Evolution, Software Engineering
Doctor of Philosophy (Ph.D.)