A Modular Parallel Pipeline Architecture for GWAS Applications in a Cluster Environment

Abrar, Faheem 1992-

A Modular Parallel Pipeline Architecture for GWAS Applications in a Cluster Environment

Files

ABRAR-THESIS-2019.pdf (960.12 KB)

Date

2019-05-09

Authors

Abrar, Faheem 1992-

ORCID

0000-0002-5172-5488

Type

Thesis

Degree Level

Masters

Abstract

A Genome Wide Association Study (GWAS) is an important bioinformatics method to associate variants with traits, identify causes of diseases and increase plant and crop production. There are several optimizations for improving GWAS performance, including running applications in parallel. However, it can be difficult for researchers to utilize different data types and workflows using existing approaches. A potential solution for this problem is to model GWAS algorithms as a set of modular tasks. In this thesis, a modular pipeline architecture for GWAS applications is proposed that can leverage a parallel computing environment as well as store and retrieve data using a shared data cache. To show that the proposed architecture increases performance of GWAS applications, two case studies are conducted in which the proposed architecture is implemented on a bioinformatics pipeline package called TASSEL and a GWAS application called FaST-LMM using both Apache Spark and Dask as the parallel processing framework and Redis as the shared data cache. The case studies implement parallel processing modules and shared data cache modules according to the specifications of the proposed architecture. Based on the case studies, a number of experiments are conducted that compare the performance of the implemented architecture on a cluster environment with the original programs. The experiments reveal that the modified applications indeed perform faster than the original sequential programs. However, the modified applications do not scale with cluster resources, as the sequential part of the operations prevent the parallelization from having linear scalability. Finally, an evaluation of the architecture was conducted based on feedback from software developers and bioinformaticians. The evaluation reveals that the domain experts find the architecture useful; the implementations have sufficient performance improvement and they are also easy to use, although a GUI based implementation would be preferable.

Keywords

GWAS

Degree

Master of Science (M.Sc.)

Department

Computer Science

Program

Computer Science

Advisor

Makaroff, Dwight

Committee

Schneider, Kevin ; Links, Matthew ; Osgood, Nathaniel ; Robinson, Steve

URI

http://hdl.handle.net/10388/12087

Collections

Graduate Theses and Dissertations

Full item page

A Modular Parallel Pipeline Architecture for GWAS Applications in a Cluster Environment

Files

Date

Authors

Journal Title

Journal ISSN

Volume Title

Publisher

ORCID

Type

Degree Level

Abstract

Description

Keywords

Citation

Degree

Department

Program

Advisor

Committee

Citation

Part Of

item.page.relation.ispartofseries

URI

DOI

item.page.identifier.pmid

item.page.identifier.pmcid

Collections