A Modular Data Analytic Pipeline for Feature Selection in High Dimensional Microbial Data Sets

Redlick, Ellen

A Modular Data Analytic Pipeline for Feature Selection in High Dimensional Microbial Data Sets

Files

REDLICK-THESIS-2020.pdf (2.17 MB)

Ellen Redlick Thesis 2020-updated.docx (6.45 MB)

Date

2021-03-16

Authors

Redlick, Ellen

ORCID

0000-0003-1431-5516

Type

Thesis

Degree Level

Masters

Abstract

The demand on the global food supply is ever increasing. With a finite amount of land to grow crops, soil health is crucial to ensuring a continued reliable food supply. Understanding how soil microbiomes affect plant growth has proven difficult in part because of the sheer number of microbes per gram of soil. This challenge is akin to the “large p, small n” problem in statistics. We have proposed a pipeline to analyze data of this nature with the help of network analysis. Networks, which are commonly referred to in computer science as graphs, are sets of nodes and edges. For the experiments in this thesis, the nodes represent microbes and edges represent their relationships with one another. These relationships are determined by calculating pairwise correlations on the data set. The data used to test the pipeline is an Operational Taxonomic Unit (OTU) abundance table, where columns are OTUs and rows are the samples. Four types of network centralities have been implemented and are used to measure the “importance” of a microbe. Each of these centralities have different interpretations for how to quantify importance. A sensitivity analysis was performed on a smooth brome invasion dataset using the pipeline. This analysis explored the implications of varying the pipeline parameters, with respect to performance and result consistency. The trade-offs of the parameters are discussed as it is recognized that different users may value different features. This pipeline has been used as part of an application that successfully detected microbes that responded to externalities regardless of abundance.

Keywords

feature selection, data analysis, microbial data sets

Degree

Master of Science (M.Sc.)

Department

Computer Science

Program

Computer Science

Advisor

Stanley, Kevin

Committee

Kusalik, Anthony;Horsch, Michael;Siciliano, Steven;Arcand, Melissa

URI

http://hdl.handle.net/10388/13284

Collections

Graduate Theses and Dissertations

Full item page

A Modular Data Analytic Pipeline for Feature Selection in High Dimensional Microbial Data Sets

Files

Date

Authors

Journal Title

Journal ISSN

Volume Title

Publisher

ORCID

Type

Degree Level

Abstract

Description

Keywords

Citation

Degree

Department

Program

Advisor

Committee

Citation

Part Of

item.page.relation.ispartofseries

URI

DOI

item.page.identifier.pmid

item.page.identifier.pmcid

Collections