A Modular Data Analytic Pipeline for Feature Selection in High Dimensional Microbial Data Sets
Date
2021-03-16Author
Redlick, Ellen
ORCID
0000-0003-1431-5516Type
ThesisDegree Level
MastersMetadata
Show full item recordAbstract
The demand on the global food supply is ever increasing. With a finite amount of land to grow crops, soil health is crucial to ensuring a continued reliable food supply. Understanding how soil microbiomes affect plant growth has proven difficult in part because of the sheer number of microbes per gram of soil. This challenge is akin to the “large p, small n” problem in statistics. We have proposed a pipeline to analyze data of this nature with the help of network analysis. Networks, which are commonly referred to in computer science as graphs, are sets of nodes and edges. For the experiments in this thesis, the nodes represent microbes and edges represent their relationships with one another. These relationships are determined by calculating pairwise correlations on the data set. The data used to test the pipeline is an Operational Taxonomic Unit (OTU) abundance table, where columns are OTUs and rows are the samples. Four types of network centralities have been implemented and are used to measure the “importance” of a microbe. Each of these centralities have different interpretations for how to quantify importance.
A sensitivity analysis was performed on a smooth brome invasion dataset using the pipeline. This analysis explored the implications of varying the pipeline parameters, with respect to performance and result consistency. The trade-offs of the parameters are discussed as it is recognized that different users may value different features. This pipeline has been used as part of an application that successfully detected microbes that responded to externalities regardless of abundance.
Degree
Master of Science (M.Sc.)Department
Computer ScienceProgram
Computer ScienceSupervisor
Stanley, KevinCommittee
Kusalik, Anthony; Horsch, Michael; Siciliano, Steven; Arcand, MelissaCopyright Date
October 2020Subject
feature selection
data analysis
microbial data sets