Repository logo

Sensitivity And Specificity Of Gene Set Analysis



Journal Title

Journal ISSN

Volume Title






Degree Level



High-throughput technologies are widely used for understanding biological processes. Gene set analysis is a well-established computational approach for providing a concise biological interpretation of high-throughput gene expression data. Gene set analysis utilizes the available knowledge about the groups of genes involved in cellular processes or functions. Large collections of such groups of genes, referred to as gene set databases, are available through online repositories to facilitate gene set analysis. There are a large number of gene set analysis methods available, and current recommendations and guidelines about the method of choice for a given experiment are often inconsistent and contradictory. It has also been reported that some gene set analysis methods suffer from a lack of specificity. Furthermore, the sheer size of gene set databases makes it difficult to study these databases and their effect on gene set analysis. In this thesis, we propose quantitative approaches for the study of reproducibility, sensitivity, and specificity of gene set analysis methods; characterize gene set databases; and offer guidelines for choosing an appropriate gene set database for a given experiment. We review commonly used gene set analysis methods; classify these methods based on their components; describe the underlying requirements and assumptions for each class; suggest the appropriate method to be used for a given experiment; and explain the challenges and pitfalls in interpreting results for each class of methods. We propose a methodology and use it for evaluating the effect of sample size on the results of thirteen gene set analysis methods utilizing real datasets. Further, to investigate the effect of method choice on the results of gene set analysis, we develop a quantitative approach and use it to evaluate ten commonly used gene set analysis methods. We also quantify and visualize gene set overlap and study its effect on the specificity of over-representation analysis. We propose Silver, a quantitative framework for simulating gene expression datasets and evaluating gene set analysis methods without relying on oversimplifying assumptions commonly made when evaluating gene set analysis methods. Finally, we propose a systematic approach to select appropriate gene set databases for conducting gene set analysis for a given experiment. Using this approach, we highlight the drawbacks of meta-databases such as MSigDB, a well-established gene set database made by extracting gene sets from several sources including GO, KEGG, Reactome, and BioCarta. Our findings suggest that the results of most gene set analysis methods are not reproducible for small sample sizes. In addition, the results of gene set analysis significantly vary depending on the method used, with little to no commonality between the 20 most significant results. We show that there is a significant negative correlation between gene set overlap and the specificity of over-representation analysis. This suggests that gene set overlap should be taken into account when developing and evaluating gene set analysis methods. We show that the datasets synthesized using Silver preserve complex gene-gene correlations and the distribution of expression values. Using Silver provides unbiased insight about how gene set analysis methods behave when applied on real datasets and real gene set databases. Our quantitative study of several well-established gene set databases reveals that commonly used gene set databases fall short in representing some phenotypes. The proposed methodologies and achieved results in this research reveal the main challenges facing gene set analysis. We identify key factors that contribute to the lack of specificity and reproducibility of gene set analysis methods, establishing the direction for future research. Also, the quantitative methodologies proposed in this thesis facilitate the design and development of gene set analysis methods as well as gene set databases and benefit a wide range of researchers utilizing high-throughput technologies.



Gene set analysis, Enrichment analysis, Gene set databases, Gene set overlap, Gene expression



Doctor of Philosophy (Ph.D.)


Computer Science


Computer Science


Part Of