A Survey on Resource Management and Scheduling in Grids

Kamesh Palanisamy
 Email: kpalanis@cs.kent.edu

Prepared for Prof. Javed I. Khan
Department of Computer Science, Kent State University
Date: May 2003





Abstract. Ensembles of distributed, heterogeneous resources, or Computational Grids, have emerged as a popular platform for deploying such applications. Thus resources can be shared and used efficiently to gain such high computing power which is more cost effective than having one standalone super computing infrastructure. The gain in investment over performance with using grids is tremendous over the latter. Grid technologies promise to change the way we tackle complex problems. They will enable large-scale aggregation and sharing of computational, data and other resources across institutional boundaries. And harnessing these new technologies effectively will transform scientific disciplines ranging from high-energy physics to the life sciences. The paper surveys the all important scheduling aspects and discusses various issues in Grid computing, and also gives a brief survey of the Grids in vogue.

[Keyword]: Grid Computing, Grid Scheduling, Heterogeneous Computing, Grid Taxonomy, Grid Survey, Internet Computing.



Table of Contents:

Introduction
Emergence of Internet Computing and Grids
Grid Technology and Resource Management
Characteristics of Grids
Stages of Scheduling
Matching and Scheduling Taxonomy
Classification of Grid Scheduling
Scheduling Algorithm
Survey on some Grid Systems
Conclusions and other issues


Introduction   (top)

As computation, storage, and communication technologies steadily improve, increasingly large, complex, and resource-intensive applications are being developed both in research institutions and in the industry. It is a common observation that computational resources are failing to meet the demand of those applications. The power of network, storage, and computing resources is projected to double every 9, 12, and 18 months, respectively. Servers and storage have continued to rapidly improve their "price for performance" by leveraging new innovations and manufacturing efficiencies, and the same trend has finally taken hold for bandwidth. Scientific and other grand challenge applications such as high-energy physics, bioinformatics and other disciplines, are a driving force for developing computing infrastructure of the future. Their constantly increasing computational power requirements often cannot be met by available systems. The logical response to these changes is to move from a model based on discrete infrastructure components to a distributed computing model which fully leverages the computing capabilities of the infrastructure. Ensembles of distributed, heterogeneous resources, or Computational Grids, have emerged as a popular platform for deploying such applications. Thus resources can be shared and used efficiently to gain such high computing power which is more cost effective than having one standalone super computing infrastructure. The gain in investment over performance with using grids is tremendous over the latter. Grid technologies promise to change the way we tackle complex problems. They will enable large-scale aggregation and sharing of computational, data and other resources across institutional boundaries. And harnessing these new technologies effectively will transform scientific disciplines ranging from high-energy physics to the life sciences. In order for users from multiple organizations to make use of a computing grid efficiently and securely, they must belong to a virtual organization (VO) sharing a common goal and a common set of resources. Assigning resources, users, and applications to VOs is the fundamental Grid technical value proposition. A VO is a set of participants with various relationships that wish to share resources to perform some task.


Emergence of Internet Computing and Grids:    (top)

The evolution of real automated computing started with Personal Computers and even since the developments since the past fifty years has been to increase the raw power of the isolated processor. Although their speeds reach impressive heights, it's far too low to compute complex engineering problems on any one of them. One solution to the problem of inadequate computer power is to 'cluster' multiple individual computers, which is a standard practice in Super-computing centers, research labs and the industry.

But the inherent problem with the clusters is that they remain a dedicated facility. built at a single location, which is not all pervasive. Rapid improvements in communications technologies are leading many to consider more decentralized approaches to the problem of computing power. Also it is observed that most workstations are idle and up most of the time, and is a good resource to harness. SETI@home is a project which is considered as one of the biggest breakthrough, if not the biggest, in the internet computing arena. It harnesses about half a millions PC's around the world, delivering about 1000 CPU years per day, and is the most powerful network and special purpose super computer in the world today.

Internet computing is just a special case of something much more powerful which is the ability for communities to share resources as they tackle common goals. Thus was born "Grid Technology", whose frame work we will try and explore in this paper. We will talk about the various types of Grids and their taxonomy. The major issue in Grids is Resource management and Scheduling, about which we will explore the latest in today's trends. Scientists and Engineering envision an integrated Grid in which problems of different types can be routed to the most appropriate resource: dedicated supercomputers for specialized problems that require tightly coupled processors and idle workstations for more latency tolerant, data analysis problems.


Grid Technology and Resource Management   (top)

The ubiquity of the Internet as well as the availability of powerful computers and high-speed network technologies as low-cost commodity components is rapidly changing the computing landscape and society. These technology opportunities have led to the possibility of using wide-area distributed computers for solving large-scale problems, leading to what is popularly known as Grid computing. The term Grid is chosen as an analogy to the electric power Grid that provides consistent, pervasive, dependable, transparent access to electricity irrespective of its source. Such an approach to network computing is known by several names: metacomputing, scalable computing, global computing, and Internet computing and more recently Peer-to-Peer (P2P) computing.

>Hierarchical and decentralized approaches are suitable for Grid resource and operational management, usually with respects to memory and processing power, because traditional methods which use centralized policies need complete state information, which is not feasible in the Grid scenario. The Grid resource broker mediates between producers and consumers. The resources are Grid enabled by deploying low-level middleware systems on them. The core middleware deployed on producer's Grid resources support the ability to handle resource access authorization and permits only authorized users to access them. The user level and core middleware on consumer's resources support the ability to create Grid enabled applications or necessary tools to support the execution of legacy applications on the Grid. Upon authenticating to the Grid, consumers interact with resource brokers for executing their applications on remote resources. The resource broker takes care of resource discovery, selection, aggregation, data and program transportation, initiating execution on remote resources and gathering results.


Characteristics of Grids:    (top)
Described below are some of the key aspects of Grid technology, which provides the end users a sense of seamless computational environment.


Stages of Scheduling:  (top)


Matching and Scheduling Taxonomy  (top)

An application can be considered to be made up of subtasks, which may require different architectural capabilities. Each subtask is assigned to a particular machine (matching) and the subtasks assigned to a particular machine are ordered (scheduling) such that the overall execution time of the application is minimized. The combination of matching and scheduling subtasks to machines is defined as subtask mapping. The Purdue Heterogeneous Computing Taxonomy [16], classifies the mapping heuristics and in the following major parts:

Taxonomy for Grid resource management [18]

Attributes of RMS Taxonomy
Type of Service Computation, Data or Service Grids
Machine Organization Flat/Hierarchical cells
Resource Model Fixed or Extensible
Namespace Organization Relational, Hierarchical, Graph.
Quality of Service Soft, Hard, None.
Resource Information Store Network Directory, Distributed Objects
Resource discovery Query, Agents
Resource Info Dissemination Batch/Periodic, On-line/Demand
Scheduler Organization Centralized, Hierarchical, De-centralized
Scheduling Policy System/User Centric
State Estimation Predictive, Non-Predictive
Re-Scheduling Periodic, Event Driven

Currently it is difficult to meaningfully compare different mapping approaches and to extend existing work without understanding the relationships that exist among previous efforts. A researcher can use the above taxonomy to find the mapping heuristics that use similar target platform and application models.


Classification of Grid Scheduling [18]  (top)

In the centralized organization, there is only one scheduling controller that is responsible for the system-wide decision-making. Such an organization has several advantages including easy management, simple deployment, and the ability to co-allocate resources. In a Grid RMS the disadvantages of this organization such as the lack of scalability, lack of fault-tolerance, and the difficulty in accommodating multiple policies outweigh the advantages. Condor utilizes a centralized scheme.

The other two organizations, hierarchical and decentralized have more suitable properties for a Grid RMS scheduler organization. In a hierarchical organization, the scheduling controllers are organized in a hierarchy. One obvious way of organizing the controllers would be to let the higher level controllers manage larger sets of resources and lower level controllers manage smaller sets of resources. Compared with the centralized scheduling this mode of hierarchical scheduling addresses the scalability and fault-tolerance issues. It also retains some of the advantages of the centralized scheme such as co-allocation. Many Grid systems such as 2K, Darwin, and Legion use hierarchical scheduler.

The Decentralized organization addresses several important issues such as fault-tolerance, scalability, site-autonomy, and multi-policy scheduling. But some of the drawbacks include management, usage tracking and co-allocation. This scheme works well considering the scalability to large networks but it would have to coordinate with each other via some form of resource discovery or resource trading protocols. Systems such as Bond, MOL and Ninf use Decentralized-scheduling approaches.

As the resource availability in the Grid changes with time, the scheduling systems need to be adaptive. This is achieved by evaluating the current schedule (state estimation) based on predictive techniques; and then developing a new schedule, i.e. rescheduling to meet the users requirements. The re-scheduling can be initiated periodically or whenever some event occurs.


Scheduling Algorithms  (top)

Efficient scheduling software should minimize idle processing time. No single algorithm can achieve optimal performance on all possible job execution time distributions. However, when we know more information such as which distribution dominates the job spectrum, a scheduling system can choose which policy will be near optimal. Scheduling is said to be static when the processors on which the jobs will run are assigned at compile time or before execution. Dynamic scheduling or load balancing is performed at run time. To arrive at a scheduling decision, the scheduling system needs to take various parameters into consideration including the following:

a)	Resource Architecture and Configuration
b) Resource Capability (clock speed, memory size)
c) Resource State (such as CPU load, memory available, disk storage free)
d) Resource Requirements of an Application
e) Access Speed (such as disk access speed)
f) Free or Available Nodes
g) Priority (that the user has)
h) Queue Type and Length
i) Network Bandwidth, Load, and Latency (if jobs need to communicate)
j) Reliability of Resource and Connection
k) User Preference
l) Application Deadline
m) User Capacity/Willingness to Pay for Resource Usage
n) Resource Cost (in terms of dollars that the user need to pay to the resource owner)
o) Resource Cost Variation in terms of Time-scale (like high @ daytime and low @ night)
p) Historical Information, including Job Consumption Rate

The Scheduling algorithms could use some combination of the above parameters, and is needless to say that no algorithm can achieve optimal solution considering all parameters. It is always a tradeoff, and the priority attached to a particular parameter depends on the architecture model and other criteria as discussed earlier.


Survey on some Grid Systems  (top)
These surveys are based on the paper [11].

2K: A Distributed Operating System [18]

It provides a flexible and adaptable architecture for providing distributed services across a wide variety of platforms. 2K was intended for development and deployment of distributed service applications rather than high performance grand challenge applications. It can be considered to be a demand service Grid that uses a flat RMS organization. Resource discovery is performed through agents and its dissemination is performed on demand by injecting the mobile agents into the system. 2K uses a one level hierarchical controller for scheduling network bandwidth and a decentralized controller for all other resources. The scheduling policy in this system seems to be fixed.

Apples: A Network Enabled Scheduler

The application level scheduling (AppLeS) [19] primarily focuses on developing scheduling agents for individual applications on computational Grids. It uses other RMSs such as Globus, Legion, and NetSolve to execute application tasks. It uses templates that have been developed for parametric and master-slave type of applications. It follows the resource management model supported by the underlying Grid middleware systems. The scheduler is central to the application that performs mapping of jobs to resources, but the local resource schedulers perform the actual execution of application units. Thus AppLeS can be classified with a predictive heuristic state estimation model, online rescheduling and fixed application oriented scheduling policy.

Condor: Cycle Stealing Technology for High Throughput Computing.

Condor [20] is a high-throughput computing environment that can manage a large collection of diversely owned machines and networks. It is well known for harnessing idle computers, and can be configured to share resources. The resource agent runs on each machine periodically advertising its services to the collector. The resource requests and offers are described in the Condor classified advertisement (ClassAd) language, a query language. It can be considered as a computational Grid with a flat organization. Resource discovery is centralized queries with periodic push dissemination. The scheduler is centralized.

Darwin: Resource Management for Network Services

Darwin [21] is oriented towards resource management in network based equipment, but does provide mechanisms for scheduling computation in non-network nodes. An application provides an application input graph that describes the resource requirement. The input graph describes a set of end nodes and the network connections between them. Darwin can provide hard network QoS since Darwin components run in routers and can control bandwidth at the network low level using the built-in router functions. The core component of Darwin is Xena, a request broker, which performs global allocation of resources. Darwin uses a hierarchical fair service curve scheduling algorithm (H-FSC) for higher level resource allocation. The H-FSC algorithm was designed to efficiently support virtual networks for distributed applications. The RMS is organized in Darwin as a one level hierarchy because all requests are sent to a Xena request broker that interacts with its peer request brokers. The resource model is a fixed schema. Scheduling is hierarchical with non-predictive state estimation. Rescheduling is event driven and implemented by the control delegates. The scheduling policy is fixed and system oriented.

Globus: A Toolkit for Grid Computing

Globus [22] enables modular deployment of Grid systems by providing the required basic services and capabilities. Its emphasis is on the hierarchical integration of Grid components and their services. Globus supports soft QoS via resource reservation. The predefined Globus scheduling policies can be extended by using application level schedulers such as the Nimrod/G, AppLeS, and Condor/G. The Globus scheduler in the absence of application level scheduler has a decentralized organization with an ad-hoc extensible scheduling policy.

Legion: A Grid Operating System

Legion [23] is an object-based meta-system that provides the software infrastructure for a Grid. It defines an API for object interaction, but does not specify the programming language or communication protocol. It follows the hierarchical model. It supports a mechanism to control the load on hosts. It provides resource reservation capability and the ability for application level schedulers to perform periodic or batch scheduling. Legion machine architecture is hierarchical with decentralized scheduler. It supplies default system oriented scheduling policies, but it allows policy extensibility via a structured scheduling extension interface.

NetSolve: A Network Enabled Computational Kernel

NetSolve [24] is a client-agent-server paradigm based network enabled application server. It is designed to solve computational science problems in a distributed environment. Communications between NetSolve clients, agents, and servers are performed using TCP/IP sockets. Its agents can search for resources on a network, choose the best one available, execute the client request, and then return the answer to the user. The Netsolve system follows the service Grid model with hierarchical cell-based machine organization.
The Netsolve-agents act as an information repository and maintain the record of resources available in the network. The Netsolve agent acts as a resource broker and performs resource discovery and scheduling. Thus Netsolve has decentralized scheduler organization.

Nimrod/G: Resource Broker and Economy Grid

Nimrod/G [25] is a Grid resource broker for managing and steering task farming applications such as parameter studies on computational Grids. It follows a computational market-based model for resource management. Nimrod/G provides support for formulation of parameter studies, a single window to manage and control experiments, resource discovery, resource trading, and scheduling. It is used as a scheduling component in a new framework called Grid Architecture for Computational Economy (GRACE) which is based on using economic theories for a Grid resource management system. Nimrod-G follows hierarchical model in resource management. It supports resource reservation and QoS through the computational economy services of the GRACE infrastructure. The users specify QoS requirements such as the deadline, budget, and preferred optimization strategy. The Grid resource estimation is performed through heuristics and historical load profiling. Scheduling policy is application oriented and is driven by user defined requirements such as deadline and budget limitations. The load balancing is performed through periodic rescheduling.


Conclusion and other issues   (top)

Grid computing is broad in its domain of application and raises research questions that span many areas of distributed computing and of computer science in general. A fundamental concern for Grid computing is security and lack of standards although the Grid Forum is working towards establishing a standard for protocols and other issues related to the Grid Technology.
Security is a fundamental concern in Grids, because the computing is all pervasive, and strict policies governing virtual organizations have to be built such that economic model is followed. Scheduling about which we have spoken all through out is the biggest concern in Grids, and we have to strike a suitable compromise and achieve economy for Grids to be successful. Coordinated resource sharing which is the goal of Grid systems, consists of not only exchanging files, but rather direct access to computers, software, data and other resources. Thus policies should be clearly and carefully planned as to what is being shared, who is allowed to share and the conditions under which sharing occurs. Fault tolerance is another issue when the number of nodes in the Grid increases. The Grid is currently being built as a concerted effort among many institutions and is already supporting leading scientific applications. Eventually it will be possible to construct increasingly realistic models of the various classes of scientific problems that we have on hand. It will provide a platform for the validation of research results in real-world scenarios.


References:  (top)

[1] K. Taura, A. Chien, A Heuristic Algorithm for Mapping Communicating Tasks on Heterogeneous Resources, Proceedings of IEEE, 2000. 

[2] H.A. James, K.A. Hawick, P.D. Coddington, Scheduling Independent Tasks on Metacomputing Systems, Proc. of Parallel and Distributed Computing Systems, 1999.

[3] J.B. Weissman, Scheduling Multi-Component Applications in Heterogeneous Wide-Area Networks, Proc. of IEEE 1999.

[4] V.D. Martino, M. Mililotti, Scheduling in a Grid Computing environment using Genetic Algorithms, Proc. of the International Parallel and Distributed Processing Symposium (IPDPS 2002).

[5] K. Kurowski, J. Nabrzyski, J. Pukacki, User Preference Driven Multiobjective Resource Management in Grid Environments, Proc. of CCGRID, 2001.

[6] K.K. Leung, N.H.C. Yung, P.Y.S. Cheung, Novel Neighborhood Search for Multiprocessor Scheduling with Pipelining, Proc. of HPC 2002.

[7] S.S. Vadhiyar, J.J. Dongarra, A Metascheduler for the Grid, HPDC-11 2002.

[8] H. Topcuoglu, S. Hariri, M.Y. Wu, Task Scheduling Algorithms for Heterogeneous Processors

[9] I. Foster, The Anatomy of the Grid: Enabling Scalable Virtual Organizations, Proc. of the 1st International Symposium on Cluster Computing and the Grid (CCGRID '01).

[10] H. Casanova, Distributed Computing Research Issues in Grid Computing, ACM SIGACT News Distributed Computing Column 8, 2002.

[11] K.Krauter, R. Buyya, M. Maheswaran, A Taxonomy and Survey of Grid Resource Management Systems for Distributed Computing, Software-Practice and Experience 2001.

[12] M. Maheswaran, H. J. Siegel, A Dynamic Matching and Scheduling Algorithm for Heterogeneous Computing Systems,

[13] Qing Ding, Guoliang Chen, A Benefit Function Mapping Heuristic for a Class of Meta-tasks in Grid Environments, CCGRID 2001.

[14] Hongtu Chen, M. Maheswaran, Distributed Dynamic Scheduling of Composite Tasks on Grid Computing Systems, IPDPS 2002.

[15] A.H. Alhusaini, V.K. Prasanna, C.S. Raghavendra, A Unified Resource Scheduling Framework for Heterogeneous Computing Environments, Supported by DARPA/ITO Quorum Program through the Naval Postgraduate School under subcontract number N62271-97-M-0931.

[16] T.D. Braun, H. J. Siegel, A Taxonomy for Describing Matching and Scheduling Heuristics for Mixed-Machine Heterogeneous Computing Systems,

[17] R. Buyya, Economic-based Distributed Resource Management and Scheduling for Grid Computing, Thesis Report 2002.

[18] N. H. Kapadia, On the Design of a Demand-based Network Computing System: The Purdue University Network-Computing Hubs, PhD Thesis, Purdue University, August 1999.

[19] V. Sunderam, A. Geist, J. Dongarra, and R. Manchek, The PVM Concurrent Computing System: Evolution,Experiences, and Trends, Parallel Computing Journal, Volume 20, Number 4, April 1994.

[20] D. Hensgen, T. Kidd, et al, An Overview of MSHN: The Management System for Heterogeneous Networks, 8th Workshop on Heterogeneous Computing Systems (HCW '99), San Juan, Puerto Rico, April 1999, invited.

[21] F. Kon, R. Campbell, M. Mickunas, and K. Nahrstedt, 2K: A Distributed Operating System for Dynamic Heterogeneous Environments, 9th IEEE International Symposium on High Performance Distributed Computing (HPDC'9) August 2000.

[22] D. Carvalho, F. Kon, et al, Management of Environments in 2K, 7th International Conference on Parallel and Distributed Systems (ICPADS-2000), Iwate Japan, July 4-7 2000.

[23] J. Gehring and A. Streit, Robust Resource Management for Metacomputers, 9th IEEE International Symposium on High Performance Distributed Computing, Pittsburgh, USA, 2000.

[24] S. Fitzgerald, I. Foster, C. Kesselman, G. von Laszewski, W. Smith, S. Tuecke, A Directory Service for Configuring High-Performance Distributed Computations, 6th IEEE Symp. on High-Performance Distributed Computing, 1997.

[25] I. Ekmecic, I. Tartalja, and V. Milutinovic, A survey of heterogeneous computing: Concepts and Systems, Proceedings of the IEEE, Vol 84, No 8, Aug 1996, pp. 1127-1144.


Relevant Links:  (top)

http://www.gridcomputing.org/
http://www.gridbus.org/
http://www.gridforum.org/
http://www.buyya.com/


Research Groups:  (top)

International Workshop on Grid Computing.
Conferences on High Performance Distributed Computing.


Scope of the Survey:  (top)

IEEE, ACM, Ohiolink Digital Library.
Keywords: Grid Computing, Grid Scheduling, Heterogeneous Computing, Grid Taxonomy. : Materials found are between 1998 - till date (April 2003). Although there are many algorithms, some of which are trivial modifications of distributed computing algorithms, the paper sites only two algorithms, one which does not take network bandwidth into account and the other which does.