MACRM: A Multi-agent Cluster Resource Management System

Sedighi Gilani, Mohammad Hossein

MACRM: A Multi-agent Cluster Resource Management System

Files

SEDIGHI-GILANI-THESIS.pdf (2.63 MB)

Date

2016-05-16

Authors

Sedighi Gilani, Mohammad Hossein

Degree Level

Masters

Abstract

The falling cost of cluster computing has significantly increased its use in the last decade. As a result, the number of users, the size of clusters, and the diversity of jobs that are submitted to clusters have grown. These changes lead to a quest for redesigning of clusters' resource management systems. The growth in the number of users and increase in the size of clusters require a more scalable approach to resource management. Moreover, ever-increasing use of clusters for carrying out a diverse range of computations demands fault-tolerant and highly available cluster management systems. Last, but not the least, serving highly parallel and interactive jobs in a cluster with hundreds of nodes, requires high throughput scheduling with a very short service time. This research presents MACRM, a multi-agent cluster resource management system. MACRM is an adaptive distributed/centralized resource management system which addresses the requirements of scalability, fault-tolerance, high availability, and high throughput scheduling. It breaks up resource management responsibilities and delegates it to different agents to be scalable in various aspects. Also, modularity in MACRM's design increases fault-tolerance because components are replicable and recoverable. Furthermore, MACRM has a very short service time in different loads. It can maintain an average service time of less than 15ms by adaptively switching between centralized and distributed decision making based on a cluster's load. Comparing MACRM with representative centralized and distributed systems (YARN [67] and Sparrow [52]) shows several advantages. We show that MACRM scales better when the number of resources, users, or jobs increase in a cluster. As well, MACRM has faster and less expensive failure recovery mechanisms compared with the two other systems. And finally, our experiments show that MACRM's average service time beats the other systems, particularly in high loads.

Keywords

Cluster computing, Resource management, Multi-agent systems

Degree

Master of Science (M.Sc.)

Department

Computer Science

Program

Computer Science

Advisor

Jamali, Nadeem

Committee

Roy, Chanchal ; Stanely, Kevin ; Gokaraju, Ramakrishna

URI

http://hdl.handle.net/10388/ETD-2016-04-2560

Collections

Graduate Theses and Dissertations

Full item page

MACRM: A Multi-agent Cluster Resource Management System

Files

Date

Authors

Journal Title

Journal ISSN

Volume Title

Publisher

ORCID

Type

Degree Level

Abstract

Description

Keywords

Citation

Degree

Department

Program

Advisor

Committee

Citation

Part Of

item.page.relation.ispartofseries

URI

DOI

item.page.identifier.pmid

item.page.identifier.pmcid

Collections