Understanding Spark System Performance for Image Processing in a Heterogeneous Commodity Cluster
dc.contributor.advisor | Eager, Derek | |
dc.contributor.advisor | Makaroff, Dwight | |
dc.contributor.committeeMember | Stanley, Kevin | |
dc.contributor.committeeMember | Stavness, Ian | |
dc.creator | Adekoya, Owolabi O 1982- | |
dc.date.accessioned | 2018-08-08T20:47:28Z | |
dc.date.available | 2018-08-08T20:47:28Z | |
dc.date.created | 2018-07 | |
dc.date.issued | 2018-08-08 | |
dc.date.submitted | July 2018 | |
dc.date.updated | 2018-08-08T20:47:28Z | |
dc.description.abstract | In recent years, Apache Spark has seen a widespread adoption in industries and institutions due to its cache mechanism for faster Big Data analytics. However, the speed advantage Spark provides, especially in a heterogeneous cluster environment, is not obtainable out-of-the-box; it requires the right combination of configuration parameters from the myriads of parameters provided by Spark developers. Recognizing this challenge, this thesis undertakes a study to provide insight on Spark performance particularly, regarding the impact of choice parameter settings. These are parameters that are critical to fast job completion and effective utilization of resources. To this end, the study focuses on two specific example applications namely, flowerCounter and imageClustering, for processing still image datasets of Canola plants collected during the Summer of 2016 from selected plot fields using timelapse cameras in a heterogeneous Spark-clustered environments. These applications were of initial interest to the Plant Phenotyping and Imaging Research Centre (P2IRC) at the University of Saskatchewan. The P2IRC is responsible for developing systems that will aid fast analysis of large-scale seed breeding to ensure global food security. The flowerCounter application estimates the count of flowers from the images while the imageClustering application clusters images based on physical plant attributes. Two clusters are used for the experiments: a 12-node and 3-node cluster (including a master node), with Hadoop Distributed File System (HDFS) as the storage medium for the image datasets. Experiments with the two case study applications demonstrate that increasing the number of tasks does not always speed-up job processing due to increased communication overheads. Findings from other experiments show that numerous tasks with one core per executor and small allocated memory limits parallelism within an executor and result in inefficient use of cluster resources. Executors with large CPU and memory, on the other hand, do not speed-up analytics due to processing delays and threads concurrency. Further experimental results indicate that application processing time depends on input data storage in conjunction with locality levels and executor run time is largely dominated by the disk I/O time especially, the read time cost. With respect to horizontal node scaling, Spark scales with increasing homogeneous computing nodes but the speed-up degrades with heterogeneous nodes. Finally, this study shows that the effectiveness of speculative tasks execution in mitigating the impact of slow nodes varies for the applications. | |
dc.format.mimetype | application/pdf | |
dc.identifier.uri | http://hdl.handle.net/10388/9533 | |
dc.subject | Apache Spark | |
dc.subject | Big Data | |
dc.subject | Hadoop Distributed File System (HDFS) | |
dc.subject | Plant Phenotyping and Imaging Research Centre | |
dc.title | Understanding Spark System Performance for Image Processing in a Heterogeneous Commodity Cluster | |
dc.type | Thesis | |
dc.type.material | text | |
thesis.degree.department | Computer Science | |
thesis.degree.discipline | Computer Science | |
thesis.degree.grantor | University of Saskatchewan | |
thesis.degree.level | Masters | |
thesis.degree.name | Master of Science (M.Sc.) |