22

–––––––––––––––––––––––

Performance Optimization of Scientific Applications Using an Autonomic Computing Approach

Ioana Banicescu, Florina M. Ciorba, and Srishti Srivastava

22.1   INTRODUCTION

In most scientific applications, the presence of parallel loops is the main source of parallelism. To obtain high performance and to take advantage of parallelism in such applications, which are in general large, computationally intensive, and data parallel, these parallel loops are executed on multiple processors. Simplistically allocating an equal number of loop iterations to the constituent processors almost always delivers unsatisfactory application performance. Performance degradation is mostly due to factors such as interprocessor communication overhead, unequal processor capabilities, processor load differences, and processor synchronization, among others. The overhead related to the differences in processor loads, or the load imbalance, is in many cases the dominant factor causing the processors to finish executing their loop iterations at widely different times, with some processors remaining idle, while others remain heavily loaded. Load imbalance is caused by the interactive effects of irregularities in problem, algorithmic, and systemic characteristics ( [1], Chapter 4). Problem irregularities are mainly brought about by a nonuniform distribution of application data among processors, while algorithmic irregularities are often due to different conditional execution paths within loops. The systemic irregularities may be due to factors such as nonuniform memory access times, cache misses, and interrupts. Distribution of input data and variations of algorithmic nature cause intrinsic imbalance, while variations of systemic nature cause extrinsic imbalance [2].

To address the load imbalance problem, various scheduling algorithms have been developed. Such algorithms perform load balancing via loop scheduling since the parallel loops embody the majority of the application computational load. The interested reader is referred to the background work in factoring (FAC) [3], fractiling (FRAC) [4], weighted factoring (WF) [5], adaptive factoring (AF) [6, 7], adaptive weighted factoring (AWF) [8, 9], and variants of adaptive weighted factoring (AWF-B, AWF-C) [10]. These dynamic loop scheduling (DLS) algorithms have been shown to be the most effective for scheduling loop iterations to provide load balancing in unpredictable environments. These techniques are based on probabilistic analyses, rendering them inherently capable of addressing uncertainties coming from problem, algorithmic, and systemic characteristics. For a comprehensive review of these techniques, refer to References 11 and 12.

Selecting an effective and efficient scheduling algorithm from the currently available options to achieve load balancing for applications executing in an unpredictable environment is a difficult task. The difficulty is due to the complex nature of application characteristics, which may change during runtime, combined with the dynamic nature and unpredictability of the computing environment. Load balancing may be necessary in several parts of an application, and each part may require different scheduling algorithms for optimal performance. Furthermore, certain scientific applications require the execution of their computations repeatedly over the computational domain. The repetitive calculations are usually performed over a series of time steps. Such applications are referred to as time-stepping applications, and examples include heat solvers, solving time-dependent Euler equations, N-body simulations, and simulation of wavepacket dynamics. In applications requiring a large number of time steps, the load imbalance characteristics of each part may vary as the application execution progresses through the time steps. A scheduling algorithm selected offline, which performs well early in the application's lifetime, may later become inappropriate. Therefore, in this scenario, the selection of a scheduling algorithm for such a dynamic environment is a very difficult task, and a relatively more intelligent entity is needed to dynamically select during runtime the best scheduling algorithm for (possibly each part of) an application. The problematic issues associated with the offline selection of the appropriate DLS algorithm mandate the need for an autonomic selection mechanism. In general, autonomic computing (AC) systems are self-managing systems that can sense their operating environment, model their behavior in that environment, and take action either to change the environment or their behavior. An automatic self-managing system has the properties of self-configuration, self-healing, self-optimization, and self-protection. In this work, the AC aspect of the execution of an application is focused on the application's self-management attributes with respect to performance optimization. Therefore, the dynamic selection of the DLS algorithms must be based on the application performance during its execution. An intelligent agent is, therefore, needed to make informed decisions about which DLS algorithm to select in every particular case.

Over the past few years, intelligent systems have been developed and implemented in real-world problem domains, such as game playing [13, 14] and robotics [1517]. A more related example is one of program selection [18]. In these applications, the system achieves its perceived intelligence through learning via the use of a computational technique called reinforcement learning (RL). RL is an active area of research in artificial intelligence, specifically in machine learning. Its objective is general, and therefore, it has potentially widespread applicability in different scientific contexts. RL uses a goal-oriented approach to solve problems by interacting with a complex and unpredictably changing environment. It solves the problems via learning, planning, and decision making.

Extensive research has been conducted in recent years to use RL strategies for addressing load balancing problems [1922]. Specifically, RL techniques have been used for the dynamic load balancing of coarse-grain data-intensive applications, solving real-time systems management problems, and adaptively scheduling tasks in heterogeneous multiprocessor systems. These approaches make various assumptions that are described later in Section 22.4.2. An RL agent implementing two RL techniques, Q-learning and state‒action‒reward‒state‒action (SARSA) [23], has been integrated with a scientific parallel application characterized by a large number of time steps (a time-stepping application) [24, 25]. This RL agent provides a generic framework for the autonomic selection of the best suited DLS algorithm for load balancing time-stepping scientific applications in a dynamic environment using RL. Employing this framework improved the performance of a scientific application as compared to the traditional offline selection methods. The large number of time steps in the application gives the RL agent the opportunity to learn and to make informed decisions about the DLS algorithm selection.

In this chapter, the application of RL to realize an autonomic mechanism for the selection of DLS techniques is proposed. The benefits, significance, and usefulness of a DLS-with-RL approach are demonstrated for the performance improvement of time-stepping scientific applications containing computationally intensive parallel loops. The simulation of wave packets using the quantum trajectory method (QTM) is considered as a case study.

22.2   SCIENTIFIC APPLICATIONS AND THEIR PERFORMANCE

Scientific applications are typically data parallel, irregular, and often, computationally intensive. They contain loops with large number of iterations and are easily amenable for parallelization. In applications such as partial differential equations (PDEs), Monte Carlo simulations, N-body simulations, sparse matrix computations, and unstructured grid solvers, iteration execution times could vary due to the algorithm, such as, for instance, conditional statements, or the problem, for instance, nonuniform data distribution. Even in applications where there are no variances in iteration lengths due to algorithmic or problem characteristics, loop iteration execution times may vary due to interferences from other iterations, other applications, or systemic variances, such as, for instance, the operating system or network latency. The cumulative effect of variances in loop iteration execution times could, ultimately, lead to processor load imbalance and, therefore, to severe performance degradation of parallel applications. A fundamental performance trade-off in scheduling parallel loop iterations is that of balancing processor loads while minimizing communication and scheduling overhead. In addition to problems encountered in load balancing on homogeneous systems, the different speeds of processors together with their different architectures and memory capacities can significantly impact performance. Moreover, in case of heterogeneous networks of workstations, loads and availability of the workstations are unpredictable, and thus, it is difficult to know in advance the effective speed of each machine.

Depending on the number of time steps required by the algorithm to find the solution to a problem, scientific applications can be broadly classified as noniterative and iterative simulations. The iterative applications often contain a large number of time steps, whereby with each time step, the solution to the problem converges to the overall solution that is reached at the end of the computation. Examples of such applications include N-body simulations, wavepacket simulations, and computational fluid dynamics (CFD) simulations. An application in this class evolves over N time steps and is therefore referred to as time stepping (Fig. 22.1). Within a single time step, L computationally intensive parallel loops are executed, possibly with different lengths and nonuniform iteration execution times. Typically, there are also other statements between the loops, and the results of the computations from the earlier executed loops might be needed by other, later executed loops. Since the loops are parallel, DLS will be utilized for load balancing to minimize the loop completion times. General-purpose loop scheduling routines have already been developed and can be integrated into compilers, or even as load balancing tools into such applications [2630].

image

FIGURE 22.1. High-level structure of a time-stepping application with L parallel loops executed over N time steps.

22.3   LOAD BALANCING VIA DLS

The problem of load imbalance in scientific applications on parallel and distributed environments has often been addressed by loop scheduling algorithms. Extensive research has been dedicated to developing these algorithms, integrating them in runtime systems, and implementing them in applications. Over the years, there has been significant advancement in the development of loop scheduling algorithms that address load imbalance problems with increasing levels of complexity. The development of loop scheduling algorithms includes a wide range of algorithms from static scheduling (STATIC) to self-scheduling (SS). Among them, the DLS algorithms represent a prominent class of loop scheduling algorithms and include fixed-size chunking (FSC), guided self-scheduling (GSS), FAC, WF, FRAC, AF, AWF, and its variants (AWF-B, AWF-C) [3, 4, 6, 7, 9, 3135].

All the DLS techniques are based on probabilistic analyses and schedule loop iterations in chunks with variable sizes. The chunk sizes are dynamically chosen at runtime as a function of the number of iterations, n, and the number of processors, p, and have a high probability of being completed within the optimal time. The techniques also accommodate changes present in the heterogeneous system and handle most of the performance degradation factors caused by predictable phenomena (e.g., irregular data) and unpredictable phenomena (e.g., data access latency and operating system interference). The DLS algorithms are broadly classified as nonadaptive and adaptive algorithms. The nonadaptive algorithms include SS, FSC, GSS, FAC, and WF [3, 5, 32, 34, 35], and determine the chunk size based on the assumptions or information available before starting the loop execution. The adaptive algorithms include AWF, AWF-B, AWF-C, and AF [610, 12], and determine the information of the workloads during the loop execution or from the previous execution of the loop iterations. The modeling of the adaptive techniques is more complex, allowing them to address a larger number of load imbalance factors than the nonadaptive techniques. These scheduling techniques have been incorporated into load balancing libraries and underwent extensive testing on a wide variety of applications (e.g., CFD, quantum physics, automatic quadrature routines, and N-body simulations) on both distributed-memory and shared-memory environments [27, 36]. For a comprehensive review of the DLS techniques and their effectiveness in improving the performance of scientific applications, the reader is referred to Reference 11.

22.4   THE USE OF MACHINE LEARNING IN IMPROVING THE PERFORMANCE OF SCIENTIFIC APPLICATIONS

In this section, an introduction is given on machine learning and its various techniques used for solving load balancing problems in scientific applications on parallel and distributed environments. Extensive research has been conducted over the years in the area of machine learning, and many of its techniques are being used either for solving scientific problems or for improving their performance.

22.4.1   Machine Learning Basics

The main research objective of machine learning is the study and development of learning algorithms that induce intelligent programs through experience. The main idea is based on the interaction between an intelligent system called the agent (or learner) and the environment in which it operates. Machine learning can be categorized into supervised learning (SL), unsupervised learning (UL), and RL.

22.4.1.1   Supervised Learning   In SL, the training data consist of examples of both inputs and desired outputs, thus enabling learning a function. Knowledge is gained from a supervisor (or teacher) that uses a sample of input‒output pairs, while the SL agent is instructed what actions to perform. The agent should then be able to generalize from the presented data to unseen examples. In situations where data labeling comes at a cost, a method known as active learning may be used, where the agent chooses which data to label. Learning is conducted offline, and obtaining the input‒output pairs for real-time applications under all possible environmental conditions is often impossible because an exhaustive representation of all possible situations is not feasible, especially in a dynamically changing computing environment. Examples of SL algorithms include logistic regression and naive Bayes classifier.

22.4.1.2   Unsupervised Learning   The training set of vectors does not have associated function values in UL. The problem is typically that of partitioning the training set into subsets, in some appropriate way. In this case, UL can be regarded as a problem of learning a function, in which the value of the function is the name of the subset to which an input vector belongs. A UL agent may learn to reduce the problem size, while it is possible that it may not learn the correct outputs. The agent helps in finding the relationships and extracting regularities from the learning process, rather than in finding the correct solution. The procedures in UL learning attempt to find natural partitions of patterns. UL methods have application in problems involving taxonomies, where there is a need to invent ways to classify data into meaningful categories. Examples of UL methods include clustering and dimensionality reduction.

22.4.1.3   Reinforcement Learning   RL is an intermediate technique between the two extremes, SL and UL, using an approach that involves solving the problems via learning, planning, and decision making. Unlike SL, the RL agent is not told what to do, and also, it is not left on its own to learn to correct the outputs, as in UL. An RL system learns about the environment online, during its operation, through a trial-and-error process [37]. The agent receives an immediate reinforcement, or “reward,” for taking a particular action, and observes the effect of the action on the “state” of the environment. The RL agent is thus enabled to learn the optimal path to the goal by learning through experience that comes from the feedback it gains about the states, actions, and rewards in unpredictably changing environments. Learning does not require any input‒output pairs, being done online, such that the problem is concurrently solved while learning [37]. The trial-and-error learning mechanism and the concept of reward makes RL distinct from other learning techniques. A challenging problem and a key aspect in RL is the trade-off between exploration and exploitation. To exploit is to use the best experienced action, while to explore requires the agent to try new actions to discover better action selections for the future. Different strategies have been designed to address this trade-off problem, including ideas from optimal control theory and stochastic approximation [37].

RL approaches can be model based and model free. In a model-based approach, the agent uses a model M, and a utility function UM, such that M executes a control policy for the purpose of learning about the environment. Conversely, in the model-free approach, the agent uses an action-value function Q to learn the policy directly by interacting with the environment, without storing experience states, actions, and rewards, or learning a model. Examples of model-based learning are Dyna [17], prioritized sweeping [38], Queue-Dyna [39], and real-time dynamic programming [40]. Model-based learning, however, is not applicable to autonomic selection of algorithms because the model of the environment M is unknown, and building the model of the environment for real-time applications is a time-consuming process. Therefore, a model-free learning approach is herein proposed and used. An example of a model-free approach is temporal difference (TD). The TD updates the estimates from the learned estimates and learns without a model. The TD class of RL algorithms include Q-learning and SARSA [23].

figure002

FIGURE 22.2. Components of an RL system, where I is the set of inputs (i), R is the set of rewards (r), B is the set of policies for action selection, a is an action, T is the transition function, and s is a state.

22.4.2   RL and Its Use in Scientific Applications

22.4.2.1   An RL System   In a standard RL model, an agent is connected to its environment via perception and action (Fig. 22.2). The environment can be in any state s from a set of distinct states. The transition function T drives the change in the state of the environment and depends on the action a performed by the agent on the current state of the environment. The selection of a depends on the reward r the agent receives and also on the observation of the current state s of the environment. The reward is a scalar reinforcement signal from a set R, which communicates the changes in the states of the environment to the agent and helps in deciding whether a state is desirable or not. The reward is used to guide the learning process, to specify the wide range of planning goals, and maximizes the benefits achieved in the long run. The states and, therefore, the rewards in RL are well defined by the Markov decision process (MDP). This implies that the decisions made by the RL agent do not depend on previous states or actions but are considered to be functions of the current state of the environment. At a specific time instance, the agent is guided in the selection of actions by a policy B. A policy maps the states of the environment and the actions taken when the agent is in a particular known state [41].

The general learning scenario in an RL system is as follows. The agent receives an input i from the set of inputs I when it is in a current state and takes an action. The action is chosen from a state guided by a policy B to produce new output. The value of the transition changing the state of the environment is given as a reward. This reward reinforces the signal, allowing the agent to choose actions that tend to maximize the agent's goal function, that is, increasing the long-run sum of values of the reinforcement signal. The agent can learn to do this over time via systematic trial and error, guided by a wide variety of algorithms.

SARSA and Q-learning are RL algorithms that use a TD-based model-free approach to learning. SARSA (Fig. 22.3a) learns the transitions from a state‒action pair to another state‒action pair and finds the policy by using a greedy approach. To update the value function, it needs to know the next action to be taken. Q-learning (Fig. 22.3b) uses a delayed reinforcement and chooses an action that maximizes the policy via a Q function, directly approximating the optimal value Q* independent of the policy being followed. Both techniques use a pair of learning parameters, namely, the learning rate (α) and the discount factor (γ).

22.4.2.2   RL in Scientific Applications   In recent years, extensive research has been conducted to employ machine learning strategies for load balancing problems. RL techniques have been used for dynamic load balancing of coarse-grain data-intensive applications [19]. Each processor used an independent RL agent learning to request an optimal chunk size of data from the master. The goal was to minimize the blocking time for other processors. The experiments have only been conducted using synthetic applications, which were not iterative. The improvements were reported against a static load balancer. With an increasing number of slaves, the task of reducing the blocking time becomes harder. The prospects for applying RL within AC systems have been studied and several recent case studies have been described to demonstrate interesting challenges in applying various RL approaches to real-time systems management problems [20]. One such case study showed that combining the strengths of both RL and queuing models in a hybrid RL approach supports an autonomic resource allocation in a given computing system for a set of applications [21]. The use of a model-based RL approach to adaptively schedule tasks in heterogeneous multiprocessor systems has also been investigated, and a number reinforcement-based schedulers have been proposed [22]. The schedules are assumed to be created before runtime, considering the communication times known a priori and the processor speeds remaining constant. These assumptions pose certain limitations to the applicability of the schedulers to specific problems and systems.

In general, the techniques described above have addressed integration of RL strategies at a coarser granularity and not at a finer granularity level within an application. A time-stepping scientific application requiring dynamic load balancing provides an excellent environment for the use of RL. Irregularities in the application and in the underlying computing system may evolve unpredictably. Hence, a load balancing algorithm performing well in earlier time steps may prove to be inefficient in later time steps. However, a large number of time steps provide the agents with ample opportunities to learn. An RL agent following the model-free learning approach becomes very useful for autonomically selecting the appropriate load balancing algorithm during the lifetime of an application.

In Reference 24, an RL agent incorporating Q-learning and SARSA was embedded into a parallel scientific application, namely, simulation of wavepacket dynamics using the QTM. To our best knowledge, Reference 24 is the first work that has attempted to integrate an RL agent with a parallel scientific application for autonomic DLS algorithm selection. Subsequent works investigated and reported on a performance comparison of the QTM application in terms of Tp with and without the RL agent, with varying learning rates (α) and discount factors (γ), and the influence of a particular RL technique for a particular set of learning parameters (α, γ) [25, 42].

In the approach described in detail later in Section 22.5, the number of RL agents is equal to the number of computationally intensive parallel loops, which is smaller than in the case of each processor using an RL agent, as considered in Reference 19. In the approach proposed herein, RL agents are embedded within an application to optimize its performance, which is simultaneously improved via DLS. This is in contrast to the approach where application load imbalance is mitigated only via the use of RL agents alone [19]. The approach presented herein uses model-free RL techniques (Q-learning and SARSA), given that it is impossible to train the model of the environment for all possible conditions, as required by model-based RL techniques. This is in contrast to the case of a specific RL framework for solving scheduling problems as mentioned in Reference 22.

22.5   DESIGN STRATEGIES AND AN INTEGRATED FRAMEWORK

In this section, a framework is described for integrating RL into a class of scientific applications for an autonomic selection of DLS algorithms to improve applications'performance in parallel and distributed environments. The work presented herein is focused on one of the self-management aspects of AC systems, which is directed toward performance optimization.

The adaptive and nonadaptive DLS described in Section 22.3 have been incorporated into scientific applications, and their performance has been analyzed by the use of load balancing tools [27, 36]. However, for time-stepping scientific applications with large numbers of time steps executing in an unpredictable environment, and given several DLS algorithms, it is difficult to dynamically find the optimal one. One has to test all the scheduling algorithms in order to find the best strategy to balance the parallel loops. The current implementation of the scheduling algorithms does not explore the possibilities of choosing at runtime, among all the DLS algorithms, the optimal one, for a particular computing environment on which the application is executing and at a specific time at which it is executed. This limitation raises a compelling need for designing an autonomic agent that can use machine learning (ML) techniques to dynamically determine an optimal scheduling algorithm through learning, by simultaneously interacting with the environment and the scheduling algorithms.

22.5.1   Integrated Framework for Algorithm Selection

The foundations of the design layout for an integrated framework are described herein, with a focus on the autonomic selection of scheduling algorithms for performance optimization of large-scale scientific applications using RL strategies. The design goals for the proposed selection framework are the following: (1) automatic runtime selection of an optimal DLS in an unpredictably changing environment; (2) provision of a generic design framework for addressing application, architectural, and computing environment irregularities; (3) portability across different computing platforms; (4) cost minimization and efficient utilization of computing resources; and (5) provision for incorporating new DLS algorithms and RL techniques for future enhancements to solve a wide variety of scientific problems.

figure003

FIGURE 22.3. (a) SARSA (b) Q-learning.

figure004

FIGURE 22.4. (a) Interaction between the learning agent, scheduler, and the runtime environment.(b) State‒action transition diagram.

22.5.1.1   Integrated Framework Design   The process of selecting the best scheduling algorithm depending on the dynamic environmental conditions can be considered as an MDP (Fig. 22.4) with states, actions, parallel costs, transition functions, and a “value minimization function” or the “objective function.” Therefore, RL techniques are suitable for solving the problem. A high-level description of the proposed selection framework, illustrating the interaction between the learning agent, the scheduler, and the runtime environment, is depicted in Figure 22.4a.

The states are represented by the various scheduling algorithms, including information about the execution of a particular algorithm during the previous time steps. The scheduling algorithms are equivalent in terms of the problem they solve, and address the dynamic nature of the runtime environment at different complexity levels. The actions constitute the selection of a particular scheduling algorithm based on the a priori performance of all algorithms and of the recently executed algorithm. The state‒action transitions (Fig. 22.4a) are provided from the state‒action transition matrix, updated after every execution of the chosen algorithm and used in the action selection process during the next time step (iteration or episode). The Q-learning and SARSA reinforcement techniques are used for obtaining the optimal value selection and for finding the optimal policy. The “cost” represented by the execution time of the current time step (iteration) is used in the decision of selecting the scheduling algorithm for execution during the next time step. The goal of the selection process is to reduce the overall computation time by making effective use of the parallel execution on processors.

The state‒action transition diagram, illustrated in Figure Figure 22.4b, is described next. Each node of the graph represents a state si . The arcs from one node to another represent actions. The dotted lines represent transitions when the previous state and the next state are same. After receiving feedback, the scheduling algorithm may remain in the same state or move to a different state. In the scenario of the problem being considered, the states are defined by various DLS algorithms. The feedback value and Q-values are obtained after the execution of a particular algorithm and are used for learning the best scheduling algorithm during the next time step. Here, an algorithm is considered as defined by a state, while the reward is defined by the execution time obtained from the runtime environment (Fig. 22.4a). The action is defined by the decision to execute or not to execute an algorithm. The multiple state transitions are exhibited by the change in state during runtime.

The integrated framework for algorithm selection, illustrated in Figure 22.5, is designed to consist of three components: (1) application, (2) scheduling, and (3) RL.

figure005

FIGURE 22.5. General design of the integrated framework for algorithm selection.

(i) The application component contains the information and computations specific to a time-stepping application, which consists of multiple parallel loops with a large number of iterations. As described earlier in Section 22.2, the performance of the application is affected by predictable and unpredictable factors. The data distribution may be irregular or may dynamically change during runtime. The problem characteristics may also change for different architectures, workloads, and processors. These factors are very difficult to accurately be determined. There may be several loops with different execution patterns, requiring different algorithms to efficiently address the load imbalance problem. This component contains such applications to which scheduling strategies are applied to address the load imbalance problem dynamically at runtime. The selection of the appropriate scheduling algorithm is provided by RL techniques.

(ii)The scheduling component provides DLS algorithms for performance improvement of executing parallel loops in scientific applications that are subject to load imbalance due to problem, algorithmic, or systemic characteristics. The DLS algorithms are based on a master‒worker strategy, and their description has been provided in Section 22.3. At the beginning of the scheduling process, one processor acts as a master processor or a “scheduler” and assigns the partitions of the iteration domain to be executed by the workers. These partitions may be either “replicated” or “distributed” on the processors, and a detailed discussion on various implementations has been provided in Reference 11. All other processors act as worker processors, and each processor initially works on chunks from its own partition. When there is severe imbalance in the workloads, a fast processor may be involved in work on partitions initially assigned to slower processors. After completing the execution of all loop iterations, the computed results are sent to all processors to synchronize the data that may be required for further computations. The scheduler coordinates with the worker processors in assigning chunk sizes of loop iterations, receiving computed results, and sending termination signals to the processors when there is no pending work and they have completed their assigned work. The scheduler broadcasts the results of the computation to all the processors after the termination of the loop execution. The interaction between the scheduler and the worker processors is facilitated by transfer of messages for specific functions.

(iii)The RL component contains algorithms for the autonomic selection of the DLS algorithms during application runtime, depending on the performance characteristics of the parallel loop under consideration. This component is an interface between the runtime environment and the scheduling layer, providing control and optimization to the scheduling component. It addresses the problem of adaptive learning of an autonomous agent through its interaction with the dynamic environment by using a combination of exploration and exploitation techniques.

The RL component consists of the states, actions, rewards, and the environment. It monitors the performance of the selected DLS algorithms and specifies the optimal DLS algorithm by interacting with the runtime environment. For the problem of algorithm selection for scientific applications, the states are the “DLS algorithms,” while an action is a “decision to select the optimal DLS algorithm.” A reward is an “evaluative feedback” reflecting the performance of the selected scheduling algorithms. The application is provided with a set of scheduling algorithms. The RL system learns to select (using an RL method, such as, for instance, Q-learning or SARSA) the best scheduling algorithm through its interaction with the set of scheduling routines and the assessment of the algorithm's performance. This leads to the selection and application of a particular DLS algorithm in unpredictable environmental conditions, selection that otherwise cannot optimally be accomplished.

All these components coordinate among themselves to improve the overall performance of the scientific application. The functionality of each component is kept modular, such that the optimization and the software development in one component can take place independent of the ones in other components. One advantage of using this approach is that no a priori information is required regarding the problem characteristics or regarding the runtime environment before starting the execution of an application.

22.5.1.2   An RL System for Time-Stepping Applications   A high-level general description of a time-stepping application with a number of L computationally intensive sections expressed as parallel loops is given in Figure 22.1 and has been presented in detail in Section 22.2. As these loops execute in parallel, appropriate DLS techniques are selected and applied for load balancing, such that the overall parallel execution time is minimized, while the overall performance is maximized.

A common practice for the selection of an appropriate DLS technique is to use application profiling employing iteratively all available DLS algorithms and to choose the algorithm that gives the optimal performance for a particular application run. This is feasible only if the application has a single parallel loop section that does not significantly change its characteristics at every time step. This approach also requires that the parallel execution environment be dedicated to the current application. The difficulty of the selection problem increases even more when the application contains several parallel loops, each having unique load balancing requirements that vary with every time step. Moreover, the dynamic nature of a parallel and distributed system, represented by events such as unpredictable network latencies or operating system interferences, further complicates the selection problem. The cumulative performance degradation resulting from these uncertainties justifies the need for employing at runtime an intelligent agent for the selection of the best DLS algorithm for executing each parallel loop section.

The algorithm selection problem is addressed by providing a generic design of an RL system to autonomically determine at runtime the optimal scheduling algorithm for a time-stepping application using RL techniques. Figure 22.6 illustrates the design of the proposed RL system, derived by adding the loop scheduling context to the environment of the generic RL system in Figure 22.2.

The goal of the proposed RL system for time-stepping applications is to minimize the total time spent by the application executing the parallel loops. This implies minimizing the completion time of each loop invocation. During the first few invocations of the loop, the agent simply specifies each algorithm in the library in a round-robin fashion, in the absence of prior knowledge about the characteristics of the loop. When sufficient knowledge is obtained during this initial learning period, the agent applies an adaptive learning policy B (Q-learning or SARSA) on the accumulated information to select an algorithm (action a—select a particular DLS method) from the library of DLS algorithms, and the environment moves to another state s (application is using the particular selected DLS method).

figure006

FIGURE 22.6. RL system for autonomic selection of DLS methods, where I is the set of inputs i (set of methods, current time step, set of loop IDs); R is the set of rewards r (loop execution time); B is the set of policies for action selection (Q-learning, SARSA); a is an action (using a particular DLS method); and s is a state (application is using DLS method).

figure007

FIGURE 22.7. Integrating an RL system in time-stepping applications with parallel loops. (a) Serial form. (b) Parallel form. (c) With RL system.

The loop completion time using the selected DLS algorithm determines a performance level, which is the basis of the reward r for the action a taken by the RL agent. The application communicates information i about the changed state s and the reward r to the RL agent, for continuous learning by the policy B. If the agent takes action only after a specified number of loop invocations, the application simply reuses the algorithm associated with the current state s, denoted by the loopback arrow from the environment to the library (Fig. 22.6).

To better illustrate the use of the proposed strategy, its application for improving the performance of a time-stepping application with a single parallel loop is shown in Figure 22.7. The left part outlines the serial form of the time-stepping application. The middle part illustrates the parallel form in which a DLS algorithm is integrated. The part on the right shows the integration of RL, complementing the parallel form. Even though the above proposed system is described for a single loop, the strategy is suitable for a wide variety of scientific applications with one or several parallel loops. This can be seen in Figure 22.8, where RL is integrated into a time-stepping application with L parallel loops.

figure008

FIGURE 22.8. High-level structure of a time-stepping application with L parallel loops integrated with a number of L RL agents.

The components of the above system (Figs. 22.5 and 22.6) work independently. The application can execute the loop without the use of the loop scheduling library. The library can be linked to any other time-stepping application with parallel loops. The RL agent can also be used by other time-stepping applications. However, the integration of these components yields a highly flexible design and allows upgrades to be incorporated into any component without affecting the others, such as additional scheduling algorithms for the library, new learning policies into the agent, and more accurate formulas into the application.

22.5.2   Implementation and Evaluation of the Integrated Framework

22.5.2.1   Integrating RL in a QTM with DLS   The integrated framework is tested on a computationally intensive scientific application, namely, the wavepacket simulation using the QTM [43, 44]. This is a time-stepping application with five parallel loops. The generic serial code of the QTM application is illustrated in Figure 22.9. In this scientific application code, a set of pseudoparticles is used to represent a physical particle. Each pseudoparticle executes a quantum trajectory governed by the Lagrangian quantum fluid dynamics‒equation of motion (QFD-EOM) and the quantum potential. The pseudoparticles form a wave packet, which collectively represents the physical particle. The arrays containing values of the positions, velocities, and probability densities of the n pseudoparticles representing the wave packet are denoted by r[.], v[.], and ρ[.], respectively. These arrays are initialized with appropriate values at the beginning of the simulation. During each time step, values for the classical potential V[.], classical force fc[.], quantum potential Q[.], quantum force fq[.], and derivative of velocity dv[.] are derived from r[.], v[.], and ρ[.] of the previous time step. The moving weighted least squares (MWLS) algorithm is used for Q[.], fq[.], and dv[.] for curve fitting, and the resulting curve is differentiated to obtain the required derivatives. For each pseudoparticle, the algorithm solves an overdetermined linear system of size np × nb. A detailed description of the MWLS algorithm is given in Reference 44. The values of r[.], v[.], ρ[.] are then updated for use in the next time step. In the algorithm, loops 2‒4 can be combined into a single loop; however, these are separated to be suitable for loop scheduling with DLS algorithms.

figure009

FIGURE 22.9. Serial QTM.

The execution profile of a straightforward serial implementation of the QTM indicates that the bulk of the total execution time is spent in the MWLS routine called by loop 1 (quantum potential), loop 2 (quantum force), and loop 3 (derivative of velocity). Thus, a significant decrease in the overall simulation time can be achieved by distributing the iterations of these loops. Each of these loops is a parallel loop, for which the iterations can be executed in any order without affecting the correctness of the algorithm. Therefore, the DLS algorithms may appropriately be applied. (Loops 4 and 5 are also parallel loops, but the scheduling overhead will be more expensive than the computations as they contain only simple formulae and are computationally less intensive.)

figure010

FIGURE 22.10. QTM with DLS.

In previous studies, one specific DLS algorithm was used for all the three loops, with no concern regarding their specific computational requirements (Fig. 22.10). The algorithm was specified statically during the initiation of the computation process and did not change throughout the application execution. To cater to the specific characteristics of these different computational loops, RL techniques are employed in selecting the optimal DLS algorithm for each of the three individual loops in the QTM application (Fig. 22.11).

Considering the computation of the quantum trajectory problem using wave-packet simulations, the performance degradation due to load imbalance during its execution in a distributed environment has effectively been addressed by novel DLS algorithms [43]. The RL problem is mapped to the QTM application as follows. The design framework shown in Figure 22.6 is replicated for each of the three parallel loop sections of the QTM application, resulting in one RL agent for each of these sections (Fig. 22.8). Each RL agent is an autonomic element and acts independently, working only with the parallel loop section for which it is employed. There is no interagent communication. The scheduling algorithms for achieving load balancing are modeled as different states, dynamically selected depending on the variability of the environment. The actions determine the particular choice of an algorithm from one state to another. The transition from one state to another takes place as shown in Figure 22.4b, the arcs representing transitions from one state to another. The environment is a dynamic system of clusters on which the application is executed. The reinforcement signal or reward is modeled using the parallel execution cost, which is the product of the number of processors and the loop execution time, and is given as feedback to the reinforcement learner. Each of the three loops (loops 1, 2, and 3) has an individual reinforcement learner capable to select the best algorithm suited for that loop, thereby reducing the overall computing time when compared to the traditional approach of exhaustive selection and execution. The total number of time steps in the application is modeled as episodes.

figure011

FIGURE 22.11. QTM with DLS and RL.

22.5.2.2   Implementation of QTM with DLS and RL   The QTM application is coded in Fortran 90. The library of loop scheduling algorithms contains implementations of the nine loop scheduling methods described earlier in Section 22.3. The loop scheduling methods contained in the library can be classified into three distinct categories: equal-size chunks (STATIC, FSC, and MFSC), decreasing-size chunks (GSS, FAC, AWF, AWF-B, and AWF-C), and variable-size chunks (AF). MFSC is modified FSC, where the number of chunks is the same as in FAC (i.e., the MFSC technique has the same overhead cost as FAC). In AWF-B, the processor weights are updated after every batch, and chunks are assigned in batches (as in FAC), while AWF-C is similar to AWF-B, with the difference that weights are updated after each chunk rather than after each batch.

The loop scheduling algorithms are based on a master‒worker strategy. The distribution of data to worker processors can be implemented as follows: (1) The processors receive the data from the master together with the size of the chunk they are assigned to execute; (2) the data are initially replicated on all worker processors, and each processor works only on the data required by the chunk size it was assigned by the master; or (3) initially, the data are distributed among the participating worker processors, and each processor is assigned a specific portion of the iterations, to later work in chunk sizes determined by the master at runtime. For the experiments reported in this work, the data are replicated on all processors.

The loop scheduling algorithms contain utility functions in Fortran 90 and use the MPI message-passing paradigm for the execution of the QTM application in a distributed-memory environment. The RL component, coded in C, is called from the Fortran 90 application [24]. The architecture is generic for any operating system on which it is built. The design is independent of the underlying architecture and chooses the appropriate functions during the execution. The interface between the RL agent and the QTM simulation code requires only two additional statements for each of the three parallel loops (as illustrated in Fig. 22.7b and 22.11). Before a loop starts execution, the agent is called to compute the index of the scheduling algorithm, and after the loop ends execution, the agent is supplied with the algorithm index and the loop completion time.

22.6   EXPERIMENTAL RESULTS, ANALYSIS, AND EVALUATION

The verification and validation of this novel approach is discussed through experiments and their analysis in this section.

22.6.1   Experimental Setup and Results

To evaluate and analyze the significance and usefulness of the proposed RL system, two sets of experiments have been designed and conducted.

Experiment #1   Test the hypothesis that on a given number of processors, the QTM application using different load balancing methods selected by the RL agent during time stepping (LEARN) will perform better than the application using a single load balancing method determined before runtime (NOLEARN).

LEARN has two levels representing the learning methods used: Q-learning and SARSA. NOLEARN has eight levels representing the loop scheduling algorithms contained in the load balancing library: STATIC, FSC, MFSC, GSS, FAC, AF, AWF, and AWF-B. LEARN and NOLEARN are groups of load balancing techniques, m, which is hypothesized to affect the performance of the application. The QTM application was run on different numbers of processors, as illustrated in Figure 22.12, line 5. The experiment is, thus, a two-factor factorial experiment with m as the first factor and p as the second factor. A free particle represented as a wave packet of n = 501 pseudoparticles was simulated for N = 10,000 time steps.

figure012

FIGURE 22.12. Algorithm used for running the two-factorial < m × p > Experiment #1.

figure013

FIGURE 22.13. Algorithm used for running the quad-factorial < RL × p × α , γ > Experiment #2.

Experiment #2  (a) Study of the effects of the learning parameters α and γ on the parallel runtime Tp of the QTM application, and (b) identify the number of times a DLS method was chosen by the RL agent.

This experiment is described by the algorithm shown in Figure 22.13. The aim of this experiment is to investigate, for the two learning techniques of the RL agent, the effects of the α and γ parameters'combinations on the completion time Tp of the QTM application. An additional objective is to compare the DLS selection results obtained when the two RL techniques (Q-learning and SARSA) are used. In line 5 of Figure 22.13, the QTM application is executed using < RL, p, α, γ >, and a particular DLS algorithm is selected by the RL agent. Given that Experiment #1 indicates that, in particular, the amount of concurrency present in the selected QTM application for the chosen simulation settings is not suitable for a number of processors larger than 12, no more than 12 processors have been considered in this experiment. A free particle represented as a wave packet of n = 1001 pseudoparticles was simulated for N = 500 time steps.

The testbed for performing both experiments was a general-purpose Linux cluster of 1038 Pentium III (1.0 and 1.266 GHz) processors with the Red Hat Linux operating system. Two processors reside in one node; 32 nodes are connected via fast Ethernet switch to comprise a rack; and the racks are connected via gigabit Ethernet uplinks. On this cluster, the queuing system (PBS) attempts to assign homogeneous computing nodes to a job, but this is not guaranteed. Network traffic volume is also not predictable because the cluster is general purpose. Other jobs were running along with the simulations.

22.6.2   Performance Analysis and Evaluation

Experiment #1   To model the variability of the performance measurements induced by the underlying computing architecture, the application was run with five replicates (as indicated in Fig. 22.12, line 6). The parallel execution time Tp for each <m × p> factor combination was measured and averaged across the five replicates. The mean Tp values are graphed in Figure 22.14 and annotated with letters. The letters, ranging from “a” through “v,” represent the statistical groupings from the means least squares method. Mean values annotated with the same letter belong to one statistical group. The differences of the mean values in the same statistical group are not significantly different from zero according to the t-statistics at 0.05 significance. Following an analysis of the results in Figure 22.14, the following observations can be made:

Obs. 1   For each p, the Tp of the application with either RL technique (from the LEARN set) is significantly lower than the Tp of the application without learning (i.e., a DLS method from the NOLEARN set).

figure014

FIGURE 22.14. Mean parallel time (T p) for wavepacket simulation using QTM with 501 pseudoparticles using different load balancing methods and RL techniques at increasing the number of processors. Mean values with the same letter are not signifi cantly different at 0.05 signifi cance level via t-statistics using the means least squares method.

Obs. 2   For each p, there is no significant difference between the Tp obtained using Q-learning or SARSA.

Obs. 3   For the LEARN set, there is a significant drop in the Tp when p is increased from p = 2 to p = 8. Tp does not significantly change, however, as p is further increased from p = 12 to p = 24. When using RL, the optimum p for the application with 501 pseudoparticles is p* = 12.

Obs. 4   For the NOLEARN set, STATIC has the worst Tp from p = 2 to p = 8, but has better Tp than most other techniques in NOLEARN from p = 16 to p = 24. The explanation is that, with a fixed problem size, the performance of a dynamic scheduling method degrades with additional processors due to the increase in scheduling overhead. It is well known that STATIC has no scheduling overhead; therefore, it is not penalized as the dynamic techniques when using more processors.

Obs. 5   The Tp for the LEARN set at p = 2 is not significantly different from the Tp for the NOLEARN set at p = 4. Similarly, the Tp for LEARN at p = 4 is statistically comparable to the Tp for NOLEARN at p = 8, with the exception of STATIC and FSC. For p ≤ 8, the QTM application using RL on p processors has statistically the same Tp as the application without RL on the next higher p. The Tp for the LEARN set at p = 12 is even significantly better than the Tp for the NOLEARN set at p = 16.

These results validate the suitability of RL as a viable procedure for online selection of DLS algorithms from a library to improve the performance of a class of large, time-stepping scientific applications with computationally intensive parallel loops.

Experiment #2(a)   Figure 22.15 shows the mean Tp values for all (α, γ) values combinations, for 12 processors using Q-learning and SARSA techniques. Each of the values in charts is a mean of the completion times of five runs of the QTM application, as indicated in Figure 22.13, line 4.

A close inspection of the plots for each case indicates negligible variations in the Tp when the learning rate α and the discount factor γ-values are varied between 0.1 and 0.9. Similar charts were obtained also for four and eight processors, and also showed the same trends, except that Tp for the two cases is bound by 8000 and 4500 seconds, respectively. Figure 22.15 indicates that, for a given number of processors (i.e., p = 4, 8, or 12) and a given RL technique (i.e., Q-learning or SARSA), the values of α and γ do not affect the completion time Tp of the QTM application.

This observation is also supported by the data in Figure 22.16. In the figure, the number of the processors is denoted by p; RL technique Q-learning is denoted by “RL 0,” while the RL technique SARSA is denoted by “RL 1.” The extremely small standard deviation values relative to the mean Tp values indicate negligible variations in Tp for all six cases.

Hypothesis Test 1: The following steps were performed to statistically support the above claims:

Step 1   Arbitrarily select an α-value for a particular p, and for a particular RL technique.

Step 2   Vary the γ-value from 0.1 to 0.9.

Step 3   Perform a hypothesis test between the differences in the respective mean values.

figure015

figure016

FIGURE 22.15. Mean parallel completion time (T p) for 12 processors with varying learning rate α and discount factor γ . (a) Q-learning. (b) SARSA.

The α-values selected for Q-learning are 0.6, 0.3, and 0.2 for 4, 8, and 12 processors, respectively. The corresponding t-scores for a 95% confidence interval (CI) were calculated. Most of the t-scores are bounded by ± 2.776 (the critical points for a 95% CI). Similarly, the α-values selected for SARSA are 0.4, 0.7, and 0.5 for 4, 8, and 12 processors, respectively. Again, the corresponding t-scores for a 95% CI were calculated and identical results were obtained for SARSA as were obtained for Q-learning, indicating that most of the t-scores are bounded by the 95% CI critical values, ± 2.776 in this case as well.

figure017

FIGURE 22.16. T p statistics graph showing the mean, variance, and standard deviation for the different experiment scenarios. The numbers 0 and 1 indicate Q-learning and SARSA, respectively. The numbers 4, 8, and 12 are the number of processors used.

The calculations were repeated with a fixed γ and varying α under identical constraints. Essentially, the same quantities are assigned to fixed γ-values as were assigned for α-values in the previous calculation. In this case, however, the α-values were then varied. The t-scores obtained for 95% CI give similar indication as was obtained previously. In other words, most of the t-scores are bounded by ± 2.776. There is no significant difference in the Tp for a given RL technique and processor combination when α and γ are varied. As a result, it can be stated with 95% confidence level that the Tp of the QTM application is not sensitive to the α and γ variations for a given RL technique and for a given number of processors.

In both scenarios, however, a small number of outlier t-scores are attributed to the fact that the cluster used for our experiments was a nondedicated, shared, and general-purpose system. It was, therefore, subject to interferences, such as unpredictability in application scheduling and unpredictability in processor rack assignments of the cluster nodes.

Hypothesis Test 2: The next set of calculations are carried out to verify whether there are significant differences in Tp when different RL techniques are used. To this end, the following steps were followed:

 Step 1   Arbitrarily fix a set of nine α- and γ-values as (0.1, 0.3), (0.2, 0.7), (0.3, 0.2), (0.4, 0.6), (0.5, 0.1), (0.6, 0.9), (0.7, 0.4), (0.8, 0.5), and (0.9, 0.8).

 Step 2   For a particular p, perform the hypothesis test on the differences in the Tp mean values for RL techniques Q-learning and SARSA.

Once again, the analysis indicated that the t-scores are bounded by ± 2.776 critical values, and no significant difference in the Tp for a given α, γ, and p combination was found when the RL techniques were varied. Therefore, it can be stated with 95% confidence level that, for a given p, the completion time Tp is also not sensitive to the type of RL technique used.

Experiment #2(b)   Figure 22.17 presents the diagrams representing the number of times each of the DLS algorithms were selected for a given p and RL technique. In the figure, the numbers 4, 8, and 12 are the numbers of processors used, while “QV,” “QF,” and “DV” represent loop 1, loop 2, and loop 3 of the QTM application, respectively (Figs. 22.922.11). As can be seen, there are significant differences in the selection pattern indicating the number of times each of the DLS algorithms was selected for each of the two RL techniques. This indicates that different RL techniques will select a different DLS algorithm at a different rate in order to succeed, even though their overall performance is similar.

figure018

FIGURE 22.17. DLS algorithm selection pattern for various p and the two RL techniques. (a) Q-learning. (b) SARSA.

22.6.2.1   Summary of Experiments   Each computationally intensive parallel loop is assigned an RL agent, which automatically selects a DLS algorithm from a library to minimize the loop completion time. Two RL techniques (Q-learning and SARSA) are applied and both use a pair of learning parameters (α, γ). Investigations and comparisons on the parallel performance of the QTM application with and without the RL agent are conducted by means of pairwise comparisons via t-statistics using the means least squares method. The analysis of the results indicates that, for any number of processors, p, the simulation performs statistically better with the RL agent than without it. There is no significant difference in the performance of the simulation for any p using either of the two RL techniques. The analysis also identifies the optimal number of processors p* for the wavepacket size tested. For p < p*, the performance of the simulation with the RL agent is comparable or even superior to that when a higher number of processors are used without the RL agent. Further, investigations and comparisons about the influence of the learning parameters α and γ on the performance of the QTM application are also conducted, in conjunction with investigations on the influence of using a particular RL technique on the application performance for certain values of α and γ. Moreover, a study was conducted on investigating the differences in the number of times each of the DLS algorithms was selected by the RL techniques. The analysis of the results shows that for a fixed p, the simulation completion time is insensitive to the values of α and γ used in the experiments. In addition, there is no advantage of choosing one RL technique over the other, even though these RL techniques significantly differ in the number of times each of them selects various DLS algorithms. These claims are statistically validated by the experimental results.

22.7   CONCLUSIONS, FUTURE WORK, AND OPEN PROBLEMS

In this chapter, an AC mechanism is presented focused on performance self-optimization of scientific applications. The mechanism uses RL for the selection of DLS techniques. The benefits, significance, and usefulness of a DLS-with-RL approach are demonstrated for the performance improvement of a time-stepping scientific application (simulation of wavepacket dynamics using the QTM), which contains computationally intensive parallel loops with nonuniform iteration execution times. The optimal number of processors is also identified for a fixed problem size. The DLS-with-RL approach uses two model-free RL techniques (Q-learning and SARSA) for online DLS algorithm selection while solving the scientific problem. The results show that the DLS-with-RL approach performs statistically better than the DLS-only approach regardless of the RL technique used (and the number of times each of the DLS algorithms is selected) or the values chosen for the learning parameters.

In conclusion, these results validate the suitability of RL as a viable AC approach for improving the performance of a class of large, time-stepping scientific applications with computationally intensive parallel loops. This has been accomplished via an online selection of algorithms from a DLS library. The approach has numerous advantages, including improved performance and adaptability, and the capacity to learn a load balancing policy for a particular computing environment and application, without extensive programming and analysis. One important advantage of this approach is that the RL techniques used in this chapter consider only one state variable, namely, the loop execution time at a given time-step iteration, which results in fast execution. However, there is no loss of generality because the DLS techniques are based on probabilistic analyses and already implicitly address simultaneously all, at both algorithm and system levels, the causes of irregularity that would result in load imbalance during the execution of a time-stepping application. Furthermore, the proposed approach is portable and generic, facilitating the continuous and successive integration of new DLS or RL techniques.

Future work plans include the integration of other types of RL techniques in the present framework and evaluation with other time-stepping applications. One open question regards the scalability of RL-based approaches for performance optimization of scientific applications. The large size of computing systems today and the even larger expected size of future systems mandate the need to study the scalability of the proposed RL approach in terms of increased number of processors and increased number of RL agents.

ACKNOWLEDGMENTS

The authors would like to thank the National Science Foundation for its support of this work through the following grants: CAREER #9984465, ITR #0081303, and IIP #1034897. This work was partially also supported by the German Research Foundation (DFG) in the Collaborative Research Center 912 “Highly Adaptive Energy-Efficient Computing.” Special thanks go to Ricolindo L. Cariño, Sumithra Dhandayuthapani, Jaderick P. Pabico, and Mahbubur Rashid for their prior contribution leading to the current work.

REFERENCES

[1] N.R. Satish, “Compile time task and resource allocation of concurrent applications to multiprocessor systems.” PhD Thesis, EECS Department, University of California, Berkeley, January, 2009.

[2] C. Boneti, R. Gioiosa, F.J. Cazorla, and M. Valero, “Using hardware resource allocation to balance HPC applications,” in Parallel and Distributed Computing (A. Ros, ed.), Chapter 7. Rijeka, Croatia: InTech, 2010.

[3] S.F. Hummel, E. Schonberg, and L.E. Flynn, “Factoring: A method for scheduling parallel loops,” Communications of the ACM, 35(8): 90‒101, 1992.

[4] I. Banicescu and S.F. Hummel, “Balancing processor loads and exploiting data locality in N-body simulations,” ACM/IEEE Conference on Supercomputing (SC 1995) (on CDROM), 1995.

[5] S.F. Hummel, J. Schmidt, R.N. Uma, and J. Wein, “Load-sharing in heterogeneous systems via weighted factoring,” in 8th Annual ACM Symposium on Parallel Algorithms and Architectures (SPAA 1996), pp. 318‒328, 1996.

[6] I. Banicescu and Z. Liu, “Adaptive factoring: A dynamic scheduling method tuned to the rate of weight changes,” in High Performance Computing Symposium (HPC 2000), pp. 122‒129, 2000.

[7] I. Banicescu and V. Velusamy, “Load balancing highly irregular computations with the adaptive factoring,” in 16th IEEE International Parallel and Distributed Processing Symposium (IPDPS-HCW 2002) (on CDROM), 2002.

[8] I. Banicescu and V. Velusamy, “Performance of scheduling scientific applications with adapative weighted factoring,” in 15th IEEE International Parallel and Distributed Processing Symposium (IPDPS - HCW 2001) (on CDROM), 2001.

[9] I. Banicescu, V. Velusamy, and J. Devaprasad, “On the scalability of dynamic scheduling scientific applications with adaptive weighted factoring,” Cluster Computing, 6(3): 213‒226, 2003.

[10] R.L. Cariño and I. Banicescu, “Dynamic scheduling parallel loops with variable iterate execution times,” in 16th IEEE International Parallel and Distributed Processing Symposium (IPDPS-PDSECA 2002) (on CDROM), 2002.

[11] I. Banicescu and R.L. Cariño, “Addressing the stochastic nature of scientific computations via dynamic loop scheduling,” Electronic Transactions on Numerical Analysis, Special Issue on Combinatorial Scientific Computing, 21: 66‒80, 2005.

[12] R.L. Cariño and I. Banicescu, “Dynamic load balancing with adaptive factoring methods in scientific applications,” The Journal of Supercomputing, 44(1): 41‒63, 2008.

[13] M.L. Littman, “Markov games as a framework for multi-agent reinforcement learning,” 11th International Conference on Machine Learning (ICML 1994), pp. 157‒163, 1994.

[14] G. Tesauro, “Temporal difference learning and TD-Gammon,” Communications of the ACM, 38(3): 58‒68, 1995.

[15] S. Mahadevan and J. Connell, “Automatic programming of behavior-based robots using reinforcement learning,” 9th National Conference on Artificial Intelligence (NCAI 1991), 1991.

[16] M.J. Mataric, “Reward functions for accelerated learning,” 11th International Conference on Machine Learning (ICML 1994), 1994.

[17] S. Schaal and C.G. Atkeson, “Robot juggling: An implementation of memory-based learning,” Control Systems Magazine, 14(1): 57‒71, 1994.

[18] M.G. Lagoudakis and M.L. Littman, “Algorithm selection using reinforcement learning,” 17th International Conference on Machine Learning (ICML 2000), pp. 511‒518, 2000.

[19] J. Parent, K. Verbeeck, J. Lemeire, A. Nowe, K. Steenhaut, and E. Dirkx, “Adaptive load balancing of parallel applications with multi-agent reinforcement learning on heterogeneous systems,” Scientific Programming, Distributed Computing and Applications, 12: 71‒79, 2004.

[20] G. Tesauro, “Reinforcement learning in autonomic computing: A manifesto and case studies,” IEEE Internet Computing, 11(1): 22‒30, 2007.

[21] G. Tesauro, N.K. Jong, R. Das, and M.N. Bennani, “On the use of hybrid reinforcement learning for autonomic resource allocation,” Cluster Computing, 10: 287‒299, 2007.

[22] A.Y. Zomaya, M. Clements, and S. Olariu, “A framework for reinforcement-based scheduling in parallel processor systems,” IEEE Transactions on Parallel and Distributed Systems, 9(3): 249‒260, 1998.

[23] C.J.C.H. Watkins and P. Dyan, “Q-learning,” Machine Learning, 8(3‒4): 279‒292, 1992.

[24] S. Dhandayuthapani, “Automatic selection of dynamic loop scheduling algorithms for load balancing using reinforcement learning.” Master's Thesis, Mississippi State University, 2004.

[25] S. Dhandayuthapani, I. Banicescu, R.L. Cariño, E. Hansen, J.P. Pabico, and M. Rashid, “Automatic selection of loop scheduling algorithms using reinforcement learning,” in Challenges of Large Applications in Distributed Environments (CLADE 2005), pp. 87‒94, 2005.

[26] M. Balasubramaniam, K. Barker, I. Banicescu, N. Chrisochoides, J.P. Pabico, and R.L. Cariño, “A novel dynamic load balancing library for cluster computing,” in 3rd IEEE International Symposium on Parallel and Distributed Computing, International Workshop on Algorithms, Models and Tools for Parallel Computing on Heterogeneous Networks (ISPDC/HeteroPar '04), pp. 346‒352, 2004.

[27] I. Banicescu, R.L. Cariño, J.P. Pabico, and M. Balasubramaniam, “Design and implementation of a novel dynamic load balancing library for cluster computing,” Parallel Computing, 31: 736‒756, 2005.

[28] R.L. Cariño and I. Banicescu, “A load balancing tool for distributed parallel loops,” in International Workshop on Challenges of Large Applications in Distributed Environments (CLADE 2003), pp. 39‒46, 2003.

[29] R.L. Cariño and I. Banicescu, “A load balancing tool for distributed parallel loops,” Cluster Computing, 8: 313‒321, 2005.

[30] R.L. Cariño and I. Banicescu, “A tool for a two-level dynamic load balancing strategy in scientific applications,” Scalable Computing: Practice and Experience, Special Issue on Practical Aspects of Large-Scale Distributed Computing, 8(3): 249‒261, 2007.

[31] I. Banicescu, S. Ghafoor, V. Velusamy, S. Russ, and M. Bilderback, “Experiences from integrating algorithmic and systemic load balancing strategies,” Concurrency and Computation: Practice and Experience, 13(2): 121‒139, 2001.

[32] Z. Fang, P. Tang, P.-C. Yew, and C.-Q. Zhu, “Dynamic processor self-scheduling for general parallel nested loops,” IEEE Transactions on Computers, 39: 919‒929, 1990.

[33] S.F. Hummel, I. Banicescu, C. Wang, and J. Wein, “Load balancing and data locality via fractiling: An experimental study,” in Languages, Compilers and Run-Time Systems for Scalable Computers (B.K. Szymanski and B. Sinharoy, eds.), Chapter 7, pp. 85‒98. Boston: Kluwer Academic Publishers, 1996.

[34] C.P. Kruskal and A. Weiss, “Allocating independent subtasks on parallel processors,” IEEE Transactions on Software Engineering, 11(10): 1001‒1016, 1985.

[35] C.D. Polychronopoulos and D.J. Kuck, “Guided self-scheduling: A practical scheduling scheme for parallel supercomputers,” IEEE Transactions on Computers, 36(12): 1425‒1439, 1987.

[36] K. Govindaswamy, An API for adaptive loop scheduling in shared address space architectures. Master's Thesis, Mississippi State University, July, 2003.

[37] R.S. Sutton and A.G. Barto, Reinforcement Learning: An Introduction. Cambridge, MA: The MIT Press, 1998.

[38] A.W. Moore and C.G. Atkeson, “Prioritized sweeping: Reinforcement learning with less data and less real time,” Machine Learning, 13: 103‒130, 1993.

[39] J. Peng and R.J. Williams, “Efficient learning and planning with the Dyna framework,” Adaptive Behavior, 1(4): 437‒454, 1993.

[40] A.G. Barto, S.J. Bradtke, and S.P. Singh, “Learning to act using real-time dynamic programming,” Artificial Intelligence, 72(1): 81‒138, 1995.

[41] L.P. Kaelbling, M.L. Littman, and A.P. Moore, “Reinforcement learning: A survey,” Journal of Artificial Intelligence Research, 4: 237‒285, 1996.

[42] M. Rashid, I. Banicescu, and R.L. Cariño, “Investigating a dynamic loop scheduling with reinforcement learning approach to load balancing scientific applications,” in 7th IEEE International Symposium on Parallel and Distributed Computing (ISPDC 2008), pp. 123‒130, 2008.

[43] R.L. Cariño, I. Banicescu, R.K. Vadapalli, C.A. Weatherford, and J. Zhu, “Parallel adaptive quantum trajectory method for wavepacket simulation,” in 17th IEEE International Parallel and Distributed Processing Symposium (IPDPS-PDSECA 2003), (on CDROM), 2003.

[44] R.L. Cariño, I. Banicescu, R.K. Vadapalli, C.A. Weatherford, and J. Zhu, “Message passing parallel adaptive quantum trajectory method,” in High Performance Scientific and Engineering Computing (L.T. Yang and Y. Pan, eds.), Chapter 9, pp. 127‒139. Norwell, MA: Kluwer Academic Publishers, 2004.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset