Data De-Duplication in NoSQL Databases

Brad, Nicoleta

Data De-Duplication in NoSQL Databases

dc.contributor.advisor	Deters, Ralph	en_US
dc.contributor.committeeMember	Cooke, John	en_US
dc.contributor.committeeMember	Vassileva, Julita	en_US
dc.contributor.committeeMember	Dinh, Anh	en_US
dc.creator	Brad, Nicoleta	en_US
dc.date.accessioned	2013-01-03T22:30:42Z
dc.date.available	2013-01-03T22:30:42Z
dc.date.created	2012-03	en_US
dc.date.issued	2012-05-22	en_US
dc.date.submitted	March 2012	en_US
dc.description.abstract	With the popularity and expansion of Cloud Computing, NoSQL databases (DBs) are becoming the preferred choice of storing data in the Cloud. Because they are highly de-normalized, these DBs tend to store significant amounts of redundant data. Data de-duplication (DD) has an important role in reducing storage consumption to make it affordable to manage in today’s explosive data growth. Numerous DD methodologies like chunking and, delta encoding are available today to optimize the use of storage. These technologies approach DD at file and/or sub-file level but this approach has never been optimal for NoSQL DBs. This research proposes data De-Duplication in NoSQL Databases (DDNSDB) which makes use of a DD approach at a higher level of abstraction, namely at the DB level. It makes use of the structural information about the data (metadata) exploiting its granularity to identify and remove duplicates. The main goals of this research are: to maximally reduce the amount of duplicates in one type of NoSQL DBs, namely the key-value store, to maximally increase the process performance such that the backup window is marginally affected, and to design with horizontal scaling in mind such that it would run on a Cloud Platform competitively. Additionally, this research presents an analysis of the various types of NoSQL DBs (such as key-value, tabular/columnar, and document DBs) to understand their data model required for the design and implementation of DDNSDB. Primary experiments have demonstrated that DDNSDB can further reduce the NoSQL DB storage space compared with current archiving methods (from 17% to near 69% as more structural information is available). Also, by following an optimized adapted MapReduce architecture, DDNSDB proves to have competitive performance advantage in a horizontal scaling cloud environment compared with a vertical scaling environment (from 28.8 milliseconds to 34.9 milliseconds as the number of parallel Virtual Machines grows).	en_US
dc.identifier.uri	http://hdl.handle.net/10388/ETD-2012-03-387	en_US
dc.language.iso	eng	en_US
dc.subject	duplicates	en_US
dc.subject	hash table	en_US
dc.subject	NoSQL	en_US
dc.subject	Cloud Computing	en_US
dc.title	Data De-Duplication in NoSQL Databases	en_US
dc.type.genre	Thesis	en_US
dc.type.material	text	en_US
thesis.degree.department	Computer Science	en_US
thesis.degree.discipline	Computer Science	en_US
thesis.degree.grantor	University of Saskatchewan	en_US
thesis.degree.level	Masters	en_US
thesis.degree.name	Master of Science (M.Sc.)	en_US

Files

Original bundle

Now showing 1 - 2 of 2

Name:: BRAD-THESIS.pdf
Size:: 1.5 MB
Format:: Adobe Portable Document Format

Download

Name:: nid690-DataDeDuplicationInNoSQLDatabasesRevised.pdf
Size:: 1.5 MB
Format:: Adobe Portable Document Format

Download

License bundle

Now showing 1 - 1 of 1

Name:: license.txt
Size:: 1006 B
Format:: Plain Text
Description:

Download

Collections

Graduate Theses and Dissertations