Repository logo

Large-Scale Clone Detection and Benchmarking



Journal Title

Journal ISSN

Volume Title






Degree Level



Code clones are pairs of code fragments that are similar. They are created when developers re-use code by copy and paste, although clones are known to occur for a variety of reasons. Clones have a negative impact on software quality and development by leading to the needless duplication of software maintenance and evolution efforts, causing the duplication and propagation of existing bugs throughout a software system, and even leading to new bugs when duplicate code is not evolved in parallel. It is important that developers detect their clones so that they can be managed and their harm mitigated. This need has been recognized by the many clone detectors available in the literature. Additionally, clone detection in large-scale inter-project repositories has been shown to have many potential applications such as mining for new APIs, license violation detection, similar application detection, code completion, API recommendation and usage support, and so on. Despite this great interest in clone detection, there has been very little evaluation of the performance of the clone detection tools, including the creation of clone benchmarks. As well, very few clone detectors have been proposed for the large-scale inter-project use cases. In particular, the existing large-scale clone detectors require extraordinary hardware, long execution times, lack support for common clone types, and are not adaptable to target and explore the emerging large-scale inter-project use-cases. As well, none of the existing benchmarks could evaluate clone detection for these scenarios. We address these problems in this thesis by introducing new clone benchmarks using both synthetic and real clone data, including a benchmark for evaluating the large-scale inter-project use-case. We use these benchmarks to conduct comprehensive tool evaluation and comparison studies considering the state of the art tools. We introduce a new clone detector for fast, scalable and user-guided detection in large inter-project datasets, which we extensively evaluate using our benchmarks and compare against the state of the art. In the first part of this thesis, we introduce a synthetic clone benchmark we call the Mutation and Injection Framework which measures the recall of clone detection tools at a very fine granularity using artificial clones in a mutation-analysis procedure. We use the Mutation Framework to evaluate the state of the art clone detectors, and compare its results against the previous clone benchmarks. We demonstrate that the Mutation Framework enables accurate, precise and bias-free clone benchmarking experiments, and show that the previous benchmarks are outdated and inappropriate for evaluating modern clone detection tools. We also show that the Mutation Framework can be adapted with custom mutation operators to evaluate tools for any kind of clone. In the second part of this thesis, we introduce BigCloneBench, a large benchmark of 8 million real clones in a large inter-project source datasets (IJaDataset: 25K projects, 250MLOC). We built this benchmark by mining IJaDataset for functions implementing commonly needed functionalities. This benchmark can evaluate clone detection tools for all types of clones, for intra-project vs inter-project clones, for semantic clones, and for clones across the entire spectrum of syntactical similarity. It is also the only benchmark capable of evaluating clone detectors for the emerging large-scale inter-project clone detection use-case. We use this benchmark to thoroughly evaluate the state of the art tools, and demonstrate why both synthetic (Mutation Framework) and real-world (BigCloneBench) benchmarks are needed. In the third part of this thesis, we explore the scaling of clone detection to large inter-project source datasets. In our first study we introduce the Shuffling Framework, a strategy for scaling the existing natively non-scalable clone detection tools to large-scale inter-project datasets, but at the cost of a reduction in recall performance and requiring a small compute cluster. The Shuffling Framework exploits non-deterministic input partitioning, partition shuffling, inverted clone indexing and coarse-grained similarity metrics to achieve scalability. In our second study, we introduce our premier large-scale clone detection tool, CloneWorks, which enables fast, scalable and user-guided clone detection in large-scale inter-project datasets. CloneWorks achieves fast and scalable clone detection on an average personal workstation using the Jaccard similarity coefficient, the sub-block filtering heuristic, an inverted clone index, and index-based input partitioning heuristic. CloneWorks is one of the only tools to scale to an inter-project dataset of 250MLOC on an average workstation, and has the fastest detection time at just 2-10 hours, while also achieving the best recall and precision performances as per our clone benchmarks. CloneWorks uses a user-guided approach, which gives the user full control over the transformations applied to their source-code before clone detection in order to target any type or kind of clones. CloneWorks includes transformations such as tunable pretty-printing, adaptable identifier renaming, syntax abstraction and filtering, and can be extended by a plug-in architecture. Through scenarios and case studies we evaluate this user-guided aspect, and find it is adaptable has high precision.



code clone, clone detection, large-scale, benchmark, recall, precision, CloneWorks, BigCloneBench, Mutation and Injection Framework, Shuffling Framework, ForkSim, software clone, clone, software quality, software engineering



Doctor of Philosophy (Ph.D.)


Computer Science


Computer Science


Part Of