# Multiple hypothesis testing and multiple outlier identification methods

## Date

2010-03

## Authors

## Journal Title

## Journal ISSN

## Volume Title

## Publisher

## ORCID

## Type

## Degree Level

Doctoral

## Abstract

Traditional multiple hypothesis testing procedures, such as that of Benjamini and Hochberg, fix an error rate and determine the corresponding rejection region. In 2002 Storey proposed a fixed rejection region procedure and showed numerically that it can gain more power than the fixed error rate procedure of Benjamini and Hochberg while controlling the same false discovery rate (FDR). In this thesis it is proved that when the number of alternatives is small compared to the total number of hypotheses, Storey’s method can be less powerful than that of Benjamini and Hochberg. Moreover, the two procedures are compared by setting them to produce the same FDR. The difference in power between Storey’s procedure and that of Benjamini and Hochberg is near zero when the distance between the null and alternative distributions is large, but Benjamini and Hochberg’s procedure becomes more powerful as the distance decreases. It is shown that modifying the Benjamini and Hochberg procedure to incorporate an estimate of the proportion of true null hypotheses as proposed by Black gives a procedure with superior power.
Multiple hypothesis testing can also be applied to regression diagnostics. In this thesis, a Bayesian method is proposed to test multiple hypotheses, of which the i-th null and alternative hypotheses are that the i-th observation is not an outlier versus it is, for i=1,...,m. In the proposed Bayesian model, it is assumed that outliers have a mean shift, where the proportion of outliers and the mean shift respectively follow a Beta prior distribution and a normal prior distribution. It is proved in the thesis that for the proposed model, when there exists more than one outlier, the marginal distributions of the deletion residual of the i-th observation under both null and alternative hypotheses are doubly noncentral t distributions. The “outlyingness” of the i-th observation is measured by the marginal posterior probability that the i-th observation is an outlier given its deletion residual. An importance sampling method is proposed to calculate this probability. This method requires the computation of the density of the doubly noncentral F distribution and this is approximated using Patnaik’s approximation. An algorithm is proposed in this thesis to examine the accuracy of Patnaik’s approximation. The comparison of this algorithm’s output with Patnaik’s approximation shows that the latter can save massive computation time without losing much accuracy.
The proposed Bayesian multiple outlier identification procedure is applied to some simulated data sets. Various simulation and prior parameters are used to study the sensitivity of the posteriors to the priors. The area under the ROC curves (AUC) is calculated for each combination of parameters. A factorial design analysis on AUC is carried out by choosing various simulation and prior parameters as factors. The resulting AUC values are high for various selected parameters, indicating that the proposed method can identify the majority of outliers within tolerable errors. The results of the factorial design show that the priors do not have much effect on the marginal posterior probability as long as the sample size is not too small.
In this thesis, the proposed Bayesian procedure is also applied to a real data set obtained by Kanduc et al. in 2008. The proteomes of thirty viruses examined by Kanduc et al. are found to share a high number of pentapeptide overlaps to the human proteome. In a linear regression analysis of the level of viral overlaps to the human proteome and the length of viral proteome, it is reported by Kanduc et al. that among the thirty viruses, human T-lymphotropic virus 1, Rubella virus, and hepatitis C virus, present relatively higher levels of overlaps with the human proteome than the predicted level of overlaps. The results obtained using the proposed procedure indicate that the four viruses with extremely large sizes (Human herpesvirus 4, Human herpesvirus 6, Variola virus, and Human herpesvirus 5) are more likely to be the outliers than the three reported viruses. The results with thefour extreme viruses deleted confirm the claim of Kanduc et al.

## Description

## Keywords

mean shift, noncentrality parameter, area under ROC curve, receiver operating characteristic, false discovery rate, microarray, doubly noncentral t distribution, pentapeptide, amino acid sequence similarity

## Citation

## Degree

Doctor of Philosophy (Ph.D.)

## Department

Mathematics and Statistics

## Program

Mathematics and Statistics