Repository logo
 

Numerical Comparison: Different Methods of Handling Zeros in Microbiome Data Analysis

Date

2023-09-06

Journal Title

Journal ISSN

Volume Title

Publisher

ORCID

0009-0005-9147-1741

Type

Thesis

Degree Level

Masters

Abstract

With the advancement of sequencing methods, investigators now have more opportunities to understand the microbial community’s role in human and plant health. For instance, studying the biological network among microbial taxa can offer researchers insights into plant breeding. Also, studying the human microbiome can help us to understand functions and illnesses. However, analyzing microbiome data presents significant challenges due to the structure of the data. A critical issue in microbiome data analysis is the presence of a large number of zeros. Although many methods for microbiome data analysis have been published in the current literature, it remains challenging for investigators to select the appropriate method. Therefore, our work focuses on exploring recent methods for handling zeros in microbiome data analysis and provides a detailed numerical comparison. First, we introduce four recent methods: the Bayesian-multiplicative replacement model, the gamma-normal mixture model, the zero-inflated Dirichlet tree multinomial model, and the zero-inflated probabilistic PCA model, detailing their advantages and limitations. Second, we design and implement simulation studies using our novel data generator, the zero-inflated logistic normal multinomial model, which makes use of phylogenetic tree distance. To the best of our knowledge, this is the first zero-inflated model that employs the phylogenetic tree distance. Finally, we evaluated these four methods using the Frobenius norm error, mean squared error for Simpson’s Index, and Wasserstein distance error in this thesis. The simulation results suggest that the Zero-Inflated Dirichlet Tree Multinomial model (with pseudo counts of 0.5 used as the smoothing method) outperforms other methods with the smallest Frobenius norm error and mean squared error for Simpson’s Index. Additionally, the Square Root Multiplicative Treatment model displays notable performance, evidenced by a minimal Wasserstein distance error and efficient running time in our simulation study. Conversely, the zero-inflated probabilistic PCA model does not perform as expected due to issues with parameter estimation convergence.

Description

Keywords

Zero-Inflated, Microbiome Data, Phylogenetic Tree Distance

Citation

Degree

Master of Science (M.Sc.)

Department

Mathematics and Statistics

Program

Mathematics

Citation

Part Of

item.page.relation.ispartofseries

DOI

item.page.identifier.pmid

item.page.identifier.pmcid