University of SaskatchewanHARVEST
  • Login
  • Submit Your Research
  • About
    • About HARVEST
    • Guidelines
    • Browse
      • All of HARVEST
      • Communities & Collections
      • By Issue Date
      • Authors
      • Titles
      • Subjects
      • This Collection
      • By Issue Date
      • Authors
      • Titles
      • Subjects
    • My Account
      • Login
      JavaScript is disabled for your browser. Some features of this site may not work without it.
      View Item 
      • HARVEST
      • Electronic Theses and Dissertations
      • Graduate Theses and Dissertations
      • View Item
      • HARVEST
      • Electronic Theses and Dissertations
      • Graduate Theses and Dissertations
      • View Item

      A Modular Parallel Pipeline Architecture for GWAS Applications in a Cluster Environment

      Thumbnail
      View/Open
      ABRAR-THESIS-2019.pdf (960.1Kb)
      Date
      2019-05-09
      Author
      Abrar, Faheem 1992-
      ORCID
      0000-0002-5172-5488
      Type
      Thesis
      Degree Level
      Masters
      Metadata
      Show full item record
      Abstract
      A Genome Wide Association Study (GWAS) is an important bioinformatics method to associate variants with traits, identify causes of diseases and increase plant and crop production. There are several optimizations for improving GWAS performance, including running applications in parallel. However, it can be difficult for researchers to utilize different data types and workflows using existing approaches. A potential solution for this problem is to model GWAS algorithms as a set of modular tasks. In this thesis, a modular pipeline architecture for GWAS applications is proposed that can leverage a parallel computing environment as well as store and retrieve data using a shared data cache. To show that the proposed architecture increases performance of GWAS applications, two case studies are conducted in which the proposed architecture is implemented on a bioinformatics pipeline package called TASSEL and a GWAS application called FaST-LMM using both Apache Spark and Dask as the parallel processing framework and Redis as the shared data cache. The case studies implement parallel processing modules and shared data cache modules according to the specifications of the proposed architecture. Based on the case studies, a number of experiments are conducted that compare the performance of the implemented architecture on a cluster environment with the original programs. The experiments reveal that the modified applications indeed perform faster than the original sequential programs. However, the modified applications do not scale with cluster resources, as the sequential part of the operations prevent the parallelization from having linear scalability. Finally, an evaluation of the architecture was conducted based on feedback from software developers and bioinformaticians. The evaluation reveals that the domain experts find the architecture useful; the implementations have sufficient performance improvement and they are also easy to use, although a GUI based implementation would be preferable.
      Degree
      Master of Science (M.Sc.)
      Department
      Computer Science
      Program
      Computer Science
      Supervisor
      Makaroff, Dwight
      Committee
      Schneider, Kevin; Links, Matthew; Osgood, Nathaniel; Robinson, Steve
      Copyright Date
      April 2019
      URI
      http://hdl.handle.net/10388/12087
      Subject
      GWAS
      Collections
      • Graduate Theses and Dissertations
      University of Saskatchewan

      University Library

      © University of Saskatchewan
      Contact Us | Disclaimer | Privacy