Influence of training dataset selection on the performance of a machine learning model
Date
2022-04-05
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
ORCID
0000-0001-6049-0230
Type
Thesis
Degree Level
Masters
Abstract
To observe the growth dynamics of the canola flowers during the blooming season and estimate the harvest forecast of the Canola crops, an application called ‘Flower Counter’ has been developed by the researchers of P2IRC located at the University of Saskatchewan.
The model has been developed using Deep Learning (DL) based Multi-column Convolutional Neural Network (MCNN) algorithm and TensorFlow framework. This is an object counting model, that counts the Canola flowers from the images based on the learning from a given set of training images, called ‘ground-truths’.
This work proposes to compose a good training dataset that would give good accuracy with a robust object detection model by using different training and testing combinations. Various evaluation techniques have been used in this work to check the impact of the training dataset, on the testing results of the model and generalizability.
The primary goal of this research work is to define a good training dataset composition having diversity. A good composition also consists of different characteristics present in the dataset, that can impact the testing results and can help in creating a robust object counting model. Different characteristics of the training datasets and testing datasets are used to evaluate the most prominent characteristics and features that impact the test results. The objective is also to evaluate the impact of training dataset selection on testing results produced by the ML model in terms of accuracy. This work would help the researchers and plant scientists gain knowledge about the diversity of characteristics for the composition of a training dataset. This can give insights to reduce the manual effort which is required to create ground truth for training models by identifying the characteristics that impact testing results.
Since the entire training of the model depends on the datasets collected during diverse weather conditions, there could be factors that could impact some of the experimental results. The research area for training dataset selection has not been explored much, and this research work will give good insights about model generalization capability and scopes for manual work utilization for getting a robust object counting model.
Description
Keywords
Machine Learning, Training Dataset selection
Citation
Degree
Master of Science (M.Sc.)
Department
Computer Science
Program
Computer Science