The Selective Decision Tree Classifier: A Novel Classifier based on Feature Selection
Date
2019-02-20
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
ORCID
Type
Thesis
Degree Level
Masters
Abstract
Living in the era of big data, it is crucial to develop and improve techniques that aid in data processing, such as data reduction. Feature selection is a data reduction technique that generates subsets of data that can be used to build machine learning models. Machine learning utilizes the processing power of a computer to build classification models from the inputted features. By pre-selecting the most relevant features in a data set, machine learning techniques can build simpler and more accurate models.
A novel method of classification based on feature selection is proposed, the Selective Decision Tree classifier (SDTC). The SDTC is derived from the Selective Bayesian classifier (SBC). Both classifiers use Decision Trees (DTs) to rank the features of a data set. DTs are built to organize features in a tree structure, with the most predictive features at the top of the tree, branching down into features that eventually lead to a classification. Features are evaluated and ranked into the levels of the DT. Both classification methods select features from the upper levels of a DT, and those features are fed into a machine learning model built to classify data. The SDTC allows for different depths of levels to be selected, whereas the SBC uses a fixed depth of three levels. To allow for greater scalability of data sets, the SDTC uses DTs as the final machine learning model, whereas the SBC uses the Naive Bayesian model.
The SDTC is applied to three data sets. The first data set is the Level of Service Inventory (LSI) data set that details over 72,000 responses to a recidivism risk assessment test and whether those individuals recidivated. The second data set is provided by the Saskatoon Police Service and contains information on missing children cases. The third data set provides the scores and reaction times recorded from the Computerized Assessment of Mild Cognitive Impairment (CAMCI) test taken by individuals as a pre-diagnostic tool for Alzheimer's disease and dementia. To demonstrate where the SDTC excels at building better predictive models, two other classification models were built from the data sets, DTs and the SBC.
When comparing the SDTC to a DT model, the advantages of feature selection are clearly evident as demonstrated by the improved accuracy. Using the LSI data set, the SDTC selected a single feature out of 43 and obtained over a 76% accuracy, compared to the 71% accuracy obtained by a DT model using all the features. The highest accuracy, 92%, was seen using the SDTC on the Missing Persons data set, almost a 7% increase using only two of the 90 features when compared to DTs. Using the CAMCI data set, the SDTC achieved a 60% accuracy using only 12 of the 33 features, compared to a 58% accuracy obtained by DTs using all the features. The flexibility of the SDTC also gave it distinct advantages and in some cases improved accuracy in its predictive potential when compared to SBC. The SDTC outperformed the SBC on two of the three data sets. The highest increase in accuracy when comparing the SDTC and the SBC was obtained using the Missing Persons data set; the SDTC achieved 92% accuracy, a 7% higher accuracy than the SBC. The demonstrated capability of the SDTC supports its potential for analyzing a variety of data sets and obtaining a deeper understanding of how DT structures can be used for feature selection.
Description
Keywords
Feature Selection, Machine Learning, Selective Decision Tree Classifier, Data Analysis, Decision Trees, Entropy, Gini Impurity, Cognitive Impairment, Police, Missing Persons
Citation
Degree
Master of Science (M.Sc.)
Department
Computer Science
Program
Computer Science