Repository logo
 

A Modular Data Analytic Pipeline for Feature Selection in High Dimensional Microbial Data Sets

Date

2021-03-16

Journal Title

Journal ISSN

Volume Title

Publisher

ORCID

0000-0003-1431-5516

Type

Thesis

Degree Level

Masters

Abstract

The demand on the global food supply is ever increasing. With a finite amount of land to grow crops, soil health is crucial to ensuring a continued reliable food supply. Understanding how soil microbiomes affect plant growth has proven difficult in part because of the sheer number of microbes per gram of soil. This challenge is akin to the “large p, small n” problem in statistics. We have proposed a pipeline to analyze data of this nature with the help of network analysis. Networks, which are commonly referred to in computer science as graphs, are sets of nodes and edges. For the experiments in this thesis, the nodes represent microbes and edges represent their relationships with one another. These relationships are determined by calculating pairwise correlations on the data set. The data used to test the pipeline is an Operational Taxonomic Unit (OTU) abundance table, where columns are OTUs and rows are the samples. Four types of network centralities have been implemented and are used to measure the “importance” of a microbe. Each of these centralities have different interpretations for how to quantify importance. A sensitivity analysis was performed on a smooth brome invasion dataset using the pipeline. This analysis explored the implications of varying the pipeline parameters, with respect to performance and result consistency. The trade-offs of the parameters are discussed as it is recognized that different users may value different features. This pipeline has been used as part of an application that successfully detected microbes that responded to externalities regardless of abundance.

Description

Keywords

feature selection, data analysis, microbial data sets

Citation

Degree

Master of Science (M.Sc.)

Department

Computer Science

Program

Computer Science

Citation

Part Of

item.page.relation.ispartofseries

DOI

item.page.identifier.pmid

item.page.identifier.pmcid