Identifying cell types with single cell sequencing data
Date
2022-01-31
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
ORCID
0000-0003-3350-6350
Type
Thesis
Degree Level
Masters
Abstract
Single-cell RNA sequencing (scRNA-seq) techniques, which examine the genetic information of individual cells, provide an unparalleled resolution to discern deeply into cellular heterogeneity. On the contrary, traditional RNA sequencing technologies (bulk RNA sequencing technologies), measure the average RNA expression level of a large number of input cells, which are insufficient for studying heterogeneous systems. Hence, scRNA-seq technologies make it possible to tackle many inaccessible problems, such as rare cell types identification, cancer evolution and cell lineage relationship inference.
Cell population identification is the fundamental of the analysis of scRNA-seq data. Generally, the workflow of scRNA-seq analysis includes data processing, dropout imputation, feature selection, dimensionality reduction, similarity matrix construction and unsupervised clustering. Many single-cell clustering algorithms rely on similarity matrices of cells, but many existing studies have not received the expectant results. There are some unique challenges in analyzing scRNA-seq data sets, including a significant level of biological and technical noise, so similarity matrix construction still deserves further study.
In my study, I present a new method, named Learning Sparse Similarity Matrices (LSSM), to construct cell-cell similarity matrices, and then several clustering methods are used to identify cell populations respectively with scRNA-seq data. Firstly, based on sparse subspace theory, the relationship between a cell and the other cells in the same cell type is expressed by a linear combination. Secondly, I construct a convex optimization objective function to find the similarity matrix, which is consist of the corresponding coefficients of the linear combinations mentioned above. Thirdly, I design an algorithm with column-wise learning and greedy algorithm to solve the objective function. As a result, the large optimization problem on the similarity matrix can be decomposed into a series of smaller optimization problems on the single column of the similarity matrix respectively, and the sparsity of the whole matrix can be ensured by the sparsity of each column. Fourthly, in order to pick an optimal clustering method for identifying cell populations based on the similarity matrix developed by LSSM, I use several clustering methods separately based on the similarity matrix calculated by LSSM from eight scRNA-seq data sets. The clustering results show that my method performs the best when combined with spectral clustering (Laplacian eigenmaps + k-means clustering). In addition, compared with five state-of-the-art methods, my method outperforms most competing methods on eight data sets. Finally, I combine LSSM with t-Distributed Stochastic Neighbor Embedding (t-SNE) to visualize the data points of scRNA-seq data in the two-dimensional space. The results show that for most data points, in the same cell types they are close, while from different cell clusters, they are separated.
Description
Keywords
Clustering, Cell type identification, Sparse similarity learning, Single cell
Citation
Degree
Master of Science (M.Sc.)
Department
Biomedical Engineering
Program
Biomedical Engineering