Managing and Optimizing a Complex RAC Environment
by Syed Jaffar Hussain and Tariq Farooq
Managing any Real Application Cluster (RAC) environment, whether on a small or large scale, comes with its fair share of risks and challenges. At a certain point, balancing out the various components and layers involved in setting up a well-oiled, well-tuned Oracle RAC cluster becomes an art involving a lot of skill and a high level of expertise. This chapter gives you a first-hand shot at tackling some of the problems head-on in an effective and practical manner. There may be some overlap between this chapter and other chapters within certain knowledge streams; this is intentional, as some topics have been re-summarized within this chapter to cover the context of the subject at hand.
The overall goal of this chapter is to present you with ideas, tips, tricks, and methodologies available to manage and optimize large-scale cluster environments for cost-effectiveness, better performance, and easy-to-set-up management. It is essential that you carefully study the pros and cons of every point mentioned in this chapter before you decide to apply any changes into your existing environment.
The following topics have been presented in this chapter to optimize any large-scale cluster environments:
Shared vs. Non-Shared Oracle Homes
In a typical cluster deployment, small, medium, or large sized, each node must at least consist of a Grid Infrastructure (GI) and RDBMS home, where the Oracle Clusterware, Automatic Storage Management (ASM), and RDBMS software binaries will be installed. Though the storage requirement for an individual home is between 8 GB and 4 GB, it is best advised to have at minimum a 100-GB file system under which the homes will be deployed to sustain future requirements, such as upgrades and patching, and also to maintain an adequate free space for various cluster and database logs.
When you have a complex cluster environment with a large number of nodes, let’s say 20 or more nodes, and if the Oracle Homes on each node are configured locally on a local file system, you not only require a large amount of space to house them—say 2 TB for 20 nodes—managing the environment becomes not only complex but also expensive from a storage cost factor perspective.
The administrative complexity and cost can be cut down by sharing the Oracle Homes across nodes in a cluster. Oracle gives the flexibility to have the software installed once in a shared cluster file system, so that it is shared among all other nodes in a RAC cluster. The key advantage of this approach is not only to reduce the overall installation duration of remote copy operation, but also to drastically reduces the storage requirements for other nodes. You can use any Oracle-certified or -supported shared storage to install the Oracle Homes.
During a fresh cluster installation, when a shared storage location is specified to install the software, OUI automatically detects and adjusts to the environment and bypasses the remote node software copy option, speeding up the installation. In addition, this kind of setup requires less storage for the software installation and potentially reduces the duration when you plan to deploy a patch on the existing environment or plann to upgrade the environment. Adding additional nodes also becomes faster when shared Oracle Homes are in place.
The out-of-place upgrade option introduced recently in 11gR2 (11.2.0.2) won’t really hamper the current Oracle Homes much, as it doesn’t require a prolonged downtime during the course of the upgrade. Whereas, the patch deployment in a shared home requires a complete cluster-wide downtime in contrast to non-shared Oracle Homes.
Regardless of shared or non-shared Oracle Homes, both settings have their share of strengths and weaknesses. The following highlights some pros and cons of shared Oracle Homes:
Above all, you can also configure a mixed environment: both shared and non-shared Oracle Homes in a single environment. When you have such mixed environment, on selected nodes you can implement shared Oracle Homes to take advantage of storage and other benefits talked about a little earlier. The following configuration is possible in a single environment:
Figure 7-1 depicts non-shared, shared, and mixed OH env settings.
Figure 7-1. Shared/non-shared/mixed OH environment
A new feature introduced with 11gR2, server pools, is a dynamic grouping of Oracle RAC instances into abstracted RAC pools, by virtue of which a RAC cluster can be efficiently and effectively partitioned, managed, and optimized. Simply put, they are a logical grouping of server resources in a RAC cluster and enable the automatic on-demand startup and shutdown of RAC instances on various nodes of a RAC cluster.
Server pools generally translate into a hard decoupling of resources within a RAC cluster. Databases must be policy managed, as opposed to admin managed (the older way of RAC server partitioning: pre-11gR2) for them to be organized as/within server pools.
Each cluster-managed database service has a one-to-one relationship with server pools; a DB service cannot be a constituent of multiple server pools at the same time.
server pools and policy management of RAC database instances come in real handy when you have large RAC clusters with a lot of nodes present. However, this technology can be used for policy-based management of smaller clusters as well.
Types of Server Pools
Server pools are classified into two types, both of which are explained in detail as follows:
As the name implies, these server pools are created by default automatically with a new installation. Server pools of this type house any/all non-policy-managed databases. System-defined server pools are further broken up into two subcategories:
In addition to RAC servers assigned to a system-defined server pool, all non-policy-managed (admin-managed) databases are housed within the GENERIC server pool. The attributes of a GENERIC server pool are not modifiable.
All unassigned RAC servers automatically get attached to the FREE server pool by default. Common attributes of a server pool (MIN_SIZE, MAX_SIZE, SERVER_NAMES) are not modifiable within the FREE server pool. However, the IMPORTANCE and ACL properties of the FREE server pool can be changed.
The attributes of user-defined server pools are listed as follows. This methodology is termed “server categorization” by Oracle.
Creating and Managing Server Pools
Server pools (with RAC servers attached to them) can easily be defined at database creation time within the Database Configuration Assistant (DBCA) as well as with the SRVCTL and CRSCTL command-line utilities. Additionally, they can be created/managed from Oracle Enterprise Manager (recommended approach).
Some examples of command-line and GUI tool options for adding and managing server pools are listed as follows.
$ crsctl add serverpool bsfsrvpool0l -attr MIN_SIZE=1, MAX_SIZE=2, IMPORTANCE=100
$ crsctl delete serverpool
$ crsctl status serverpool -f
$ srvctl config serverpool
Figures 7-3. Examples: Creating/attaching to a server pool using DBCA
Figures 7-6. Creating and managing server pools within Oracle Enterprise Manager Cloud Control
Planning and Designing RAC Databases
Before a new RAC database is deployed in a complex/large-scale cluster environment, it is essential that you carefully plan and design the database resources keeping in view the requirements from the application and business points of view. When you plan a new database, you have to consider the core requirements of the application and future business demands, such as increased workload, how to control the resources on the server, etc. When multiple databases are configured on a node, each database has a different requirement for CPU and memory resources. Some databases need more resources, while others need much less. However, there will be no control over the CPU consumption by default to the instances running on the node.
In the subsequent sections, we give you some valuable tips to optimize RAC databases by introducing some of the new features introduced in the recent and current Oracle versions, for example, instance caging and policy-managed database.
From 11gR2 onward, you will be given a choice to create two alternative types of RAC database management configuration options during database creation within DBCA: admin managed and policy managed. Depending on the application and business needs and considering the business workload demands, you can choose the configuration option to create either an admin-managed or a policy-managed database. The typical admin-managed databases are sort of static and assign to a specific set of nodes in the cluster, their connect configuration as well. You will define your requirements during the course of database or its service creation, which is pinned to particular nodes on which they will be deployed and reside in. Whereas, a policy-based database, on the other hand, is more dynamic in nature and thus will act according to workload requirements. For a policy-managed database, resources are allocated in advance, keeping future requirements in mind, and can be adjusted just-in-time dynamically without much change to the connection configuration.
It is a common factor in a large-scale production RAC setup with a huge number of databases deployed that each database’s requirements and use are quite different from those of the other databases in nature. In contrast to a traditional admin-managed database, a policy-managed database has the ability to define the number of instances and resources needed to support the expected workload in advance. Policy-managed databases are generally useful in a very-large-scale cluster deployment, and the benefits include the following:
Deploying a Policy-Managed Database
A policy-managed database is configured using the DBCA. At the same time, any preexisting admin-managed database can be effortlessly converted to a policy-managed database in place. Unlike the typical admin-managed database, when you create a new policy-managed database, you will need to specify a server pool on which the instances should be run. You can select a preexisting server pool from the list to configure the RAC database or create a new server pool with the required cardinality (number of servers), as shown in Figure 7-7’s database placement screenshot.
Figure 7-7. DBCA policy-managed DB configuration
With regard to the policy-managed database configuration, the RAC configuration type and the selection or creation of a new spool pool is the only different requirement in comparison with a typical admin-based database. The rest of the database creation steps remain unchanged.
Upgrading a Policy-Managed Database
A direct upgrade of a pre-12c admin-managed database to a 12c policy-managed database is not supported; this will have to be a combo upgrade/migration process. The following steps are required:
Migrating an Admin-Managed Database to a Policy-Managed Database
This section provides the essential background required to convert an admin-managed database to a policy-managed database. The following is a practical step-by-step conversion procedure:
$ srvctl config database –d <dbname>
$ srvctl status database –d <dbname>
$ srvctl config service –d <dbname>
$ srvctl status service –d <dbname>
$ srvctl add srvpool –serverpool srvpl_pbd–MIN 0 –MAX 4 –servers node1,node2,node3,node4 - category <name> –verbose
$ srvctl config srvpool—serverpool srvpl_pub—to verify the details
$ srvctl stop database –d <dbname>
$ srvctl modify database –d <dbname>-serverpool srvpl_pbd
The preceding srvctl modify command converts an admin-managed to a policy-managed database and places the database into the new policy-managed server pool, named srvpl_pbd.
$ srvctl start database –d <dbname>
$ srvctl config database –d <dbname>
Post-conversion, when you verify the database configuration details, you will notice a change in the services type as “Database is policy managed,” instead of “Database is admin managed.” Additionally, you will also observe a change, an additional underscore (_) symbol before the instance number, in the instance name in contrast to the previous name.
You also need to run through a set of changes in your environment when the database is managed with OEM Cloud Control 12c containing database services to reflect the change.
$ mv orapwDBNAME1 orapwDBNAME_1
$ srvctl modify service –ddbname–sservice_name–serverpoolsrvpl_pdb–cardinality UNIFORM|SINGLETON
Note We strongly recommend using the SCAN IPs to connect to the database to avoid connection issues when a database is moved between the nodes defined in the server pool.
When multiple instances are consolidated in a RAC DB node, resource distribution, consumption, and management become a challenge for a DBA/DMA. Each instance typically will have different requirements and behavior during peak and non-peak business timeframes. There is a high probability that a CPU thirsty instance could impact the performance of other instances configured on the same node. As a powerful new feature with Oracle 11gR2, with instance caging, DBAs can significantly simplify and control an instance CPUs/cores overall usage by restricting each instance to a certain number of CPUs. Large-scale RAC environments can greatly benefit with this option.
Instance caging can be deployed by setting two database initialization parameters: the cpu_count and the resource manager on the instance. The following procedure explains the practical steps required to enable instance caging on the instance:
SQL> ALTER SYSTEM SET CPU_COUNT=2 SCOPE=BOTH SID=' INSTANCE_NAME';
SQL> ALTER SYSTEM SET RESOURCE_MANAGER_PLAN = default_plan|myplan SCOPE=BOTH SID=' INSTANCE_NAME';
The first SQL statement restricts the database instance’s CPU resource consumption to 2, irrespective of the number of CPUs present on the local node; the instance CPU resource use will be limited to a maximum of 2 CPU power. If the database is a policy-managed database and tends to swing between nodes, you need to carefully plan the cpu_count value considering the resource availability on those nodes mentioned in the server pool.
The second SQL statement enables the resource manager facility and activates either the default plan or a user defined resource plan. An additional background process, Virtual Scheduler for resource manager (VKRM), will be started upon enabling resource manager. To verify whether the CPU caging is enabled on an instance, use the following examples:
SQL> show parameter cpu_count
SQL> SELECT instance_caging FROM v$rsrc_plan WHERE is_top_plan = 'TRUE';
When instance caging is configured and enabled, it will restrict the CPU usage for the individual instances and will not cause excessive CPU usage.
After enabling the feature, the CPU usage can be monitored by querying the gv$rsrc_consumer_group and gv$rsrcmgrmetric_history dynamic views. The views will provide useful inputs about the instance caging usage. It provides in minute detail the CPU consumption and throttling feedback for the past hour.
To disable instance caging, you simply need to reset the cpu_count parameter to 0 on the instance.
Note cpu_count and resource_manager_plan are dynamic initialization parameters, which can be altered without having instance downtime. Individual instances within a RAC database can have different cpu_count values. However, frequent modification of the cpu_count parameter is not recommended.
Figure 7-8 shows how to partition a 16-CPU/core box (node) using instance caging to limit different instances on the node to a specific amount of CPU usage.
Figure 7-8. Example instance caging a 16-CPU/core box for consolidated RAC databases
Small- vs. Large-Scale Cluster Setups
It’s an open secret that most IT requirements are typically derived from business needs. To meet business requirements, you will have either small- or large-scale cluster setups of your environments. This part of the chapter will focus and discuss some of the useful comparisons between small- and large-scale cluster setups and also address the complexity involved in large cluster setups.
It is pretty difficult to jump in and say straightaway that whether a small-scale cluster setup is better than a large-scale cluster setup, as the needs of one organization are totally dissimilar to those of another. However, once you thoroughly realize the benefits and risks of the two types of setup, you can then decide which is the best option to proceed with.
Typically, in any cluster setup, there is a huge degree of coordination in the CRS, ASM, and instance processes involved between the nodes.
Typically, a large-scale cluster setup is a complex environment with a lot of resources deployment. In a large-scale cluster setup, the following situations are anticipated::
On the other hand, a smaller cluster setup with a lesser number of nodes will be easy to manage and will have less complexity. If it’s possible to have multiple small ranges of cluster setups in contrast to a large-scale setup, this is one of the best options, considering the complexity and effort required for the two types.
Split-Brain Scenarios and How to Avoid Them
Split-brain scenarios are a RAC DBA or DMA’s worst nightmare. They are synonymous with a few interchangeable terms: node evictions, fencing, STONITH (Shoot the Node in the Head), etc.
The term split-brain scenario originates from its namesake in the medical world: split-brain syndrome, which is a consequence of the connection between the two hemispheres of the brain being severed, results in significant impairment of normal brain function. Split-brain scenarios in a RAC cluster can be defined as functional overlapping of separate instances, each in its own world without any guaranteed consistency or coordination between them. Basically, communication is lost among the nodes of a cluster, with them being evicted/kicked out of the cluster, resulting in node/instance reboots. This can happen due to a variety of reasons ranging from hardware failure on the private cluster interconnect to nodes becoming unresponsive because of CPU/memory starvation issues.
The next section covers the anatomy of node evictions in detail.
Split-brain situations generally translate into a painful experience for the business and IT organizations, mainly because of the unplanned downtime and the corresponding impact on the business involved. Figure 7-9 shows a typical split-brain scenario:
Figure 7-9. Typical split-brain scenario within an Oracle RAC cluster
Here are some best practices that you can employ to eliminate or at least mitigate potential split-brain scenarios, which also serve as good performance tuning tips and tricks as well (a well-tuned system has significantly less potential for overall general failure, including split-brain conditions):
Understanding, Debugging, and Preventing Node Evictions
Node Evictions—Synopsis and Overview
A node eviction is the mechanism/process (piece of code) designed within the Oracle Clusterware to ensure cluster consistency and maintain overall cluster health by removing the node(s) that either suffers critical issues or doesn’t respond to other nodes’ requests in the cluster in a timely manner. For example, when a node in the cluster is hung or suffering critical problems, such as network latency or disk latency to maintain the heartbeat rate within the internal timeout value, or if the cluster stack or clusterware is unhealthy, the node will leave the cluster and do a fast self-reboot to ensure overall cluster health. When a node doesn’t respond to another node’s request in a timely manner, the node will receive a position packet through disk/network with the instruction to leave the cluster by killing itself. When the problematic node reads the position (kill) pockets, it will evict and leave the cluster. Thereby, the evicted node then will then perform a fast reboot to ensure a healthy environment between the nodes in the cluster. A fast reboot generally doesn’t wait to flush the pending I/O; therefore, it will be fenced right after the node reboot.
Node eviction is indeed one of the worst nightmares of RAC DBAs, posing never-ending challenges on every occurrence. Typically, when you maintain a complex or large-scale cluster setup with a huge number of nodes, frequent node eviction probabilities are inevitable and can be anticipated. Therefore, node eviction is one of the key areas to which you have to pay close attention.
In general, a node eviction means unscheduled server downtime, and an unscheduled downtime could cause service disruption; frequent node eviction will impact an organization’s overall reputation too. In this section, we will present you with the most common symptoms that lead to a node eviction, and also what are the crucial cluster log files in context to be verified to analyze/debug the issue to find out the root cause for the node eviction.
Since you have been a cluster administrator for a while, we are pretty sure that you might have confronted a node eviction occurrence at least once in your environment. The generic node eviction information logged in some of the cluster logs sometimes doesn’t actually provide a right direction for the actual root cause of the problem. Therefore, you need to gather and refer to various types of log file information from cluster, platforms, OS Watcher files, etc., in order to find the root cause of the issue.
A typical warning message, outlined hereunder, about a particular node being being evicted will be printed in the cluster alert.log of a surviving node just few seconds before the node eviction. Though you can’t prevent the node eviction happening in such a short span, it will bring the warning message to your attention so that you will be informed about which node is about to leave the cluster:
[ohasd(6525)]CRS-8011:reboot advisory message from host: node04,component: cssmonit,with time stamp: L-2013-03-17-05:24:38.904
[ohasd(6525)]CRS-8013:reboot advisory message text: Rebooting after limit 28348 exceeded; disk timeout 28348, network timeout 27829,
last heartbeat from CSSD at epoch seconds 1363487050.476, 28427 milliseconds ago based on invariant clock value of 4262199865
2013-03-17 05:24:53.928
[cssd(7335)]CRS-1612:Network communication with node node04 (04) missing for 50% of timeout interval. Removal of this node from cluster in 14.859 seconds
2013-03-17 05:25:02.028
[cssd(7335)]CRS-1611:Network communication with node node04 (04) missing for 75% of timeout interval. Removal of this node from cluster in 6.760 seconds
2013-03-17 05:25:06.068
[cssd(7335)]CRS-1610:Network communication with node node04 (04) missing for 90% of timeout interval. Removal of this node from cluster in 2.720 seconds
2013-03-17 05:25:09.842
[cssd(7335)]CRS-1601:CSSD Reconfiguration complete. Active nodes are node01,node02,node03.
2013-03-17 05:37:53.185
[cssd(7335)]CRS-1601:CSSD Reconfiguration complete. Active nodes are node01,node02,node03.
The preceding extract is from the node01 alert.log file of 4-node cluster setup, which indicates about node04 eviction. You may discover from the preceding input that the first warning appeared when the outgoing node missed 50% of timeout interval, followed by 75 and 90% missing warning messages. The reboot advisory message represents the component details that actually initiated the node eviction. In the preceding text, it was the cssmmonit that triggered the node eviction due to lack of memory on the node, pertaining to the example demonstrated over here.
Here is the continuous extract from the ocssd.log file from a surviving node, node01, about node04 eviction:
2013-03-17 05:24:53.928: [ CSSD][53]clssnmPollingThread: node node04 (04) at 50% heartbeatfatal, removal in 14.859 seconds
2013-03-17 05:24:53.928: [ CSSD][53]clssnmPollingThread: node node04 (04) is impending reconfig, flag 461838, misstime 15141
2013-03-17 05:24:53.938: [ CSSD][53]clssnmPollingThread: local diskTimeout set to 27000 ms, remote disk timeout set to 27000, impending reconfig status(1)
2013-03-17 05:24:53.938: [ CSSD][40]clssnmvDHBValidateNCopy: node 04, node04, has a disk HB, but no network HB, DHB has rcfg 287, wrtcnt, 77684032, LATS 4262516131, lastSeqNo 0, uniqueness 1356505641, timestamp 1363487079/4262228928
2013-03-17 05:24:53.938: [ CSSD][49]clssnmvDHBValidateNCopy: node 04, node04, has a disk HB, but no network HB, DHB has rcfg 287, wrtcnt, 77684035, LATS 4262516131, lastSeqNo 0, uniqueness 1356505641, timestamp 1363487079/4262228929
2013-03-17 05:24:54.687: [ CSSD][54]clssnmSendingThread: sending status msg to all nodes
2013-03-17 05:25:02.028: [ CSSD][53]clssnmPollingThread: node node04 (04) at 75% heartbeat fatal, removal in 6.760 seconds
2013-03-17 05:25:02.767: [ CSSD][54]clssnmSendingThread: sending status msg to all nodes
2013-03-17 05:25:06.068: [ CSSD][53]clssnmPollingThread: node node04 (04) at 90% heartbeat fatal, removal in 2.720 seconds, seedhbimpd 1
2013-03-17 05:25:06.808: [ CSSD][54]clssnmSendingThread: sending status msg to all nodes
2013-03-17 05:25:06.808: [ CSSD][54]clssnmSendingThread: sent 4 status msgs to all nodes
In the preceding, the clssnmPollingThread of CSS daemon process thread, whose responsibility is to scan/verify the active nodes members, periodically reports about the node04 missing heartbeat and also represents the timestamp of node eviction.
The following text from the ocssd.log on node01 exhibits the details about the node being evicted, node cleanup, node leaving cluster, and rejoining sequence:
2013-03-17 05:25:08.807: [ CSSD][53]clssnmPollingThread: Removal started for nodenode04 (14), flags 0x70c0e, state 3, wt4c 02013-03-17 05:25:08.807: [CSSD][53]clssnmMarkNodeForRemoval: node 04, node04 marked for removal
2013-03-17 05:25:08.807: [ CSSD][53]clssnmDiscHelper: node04, node(04) connection failed, endp (00000000000020c8), probe(0000000000000000), ninf->endp 00000000000020c8
2013-03-17 05:25:08.807: [ CSSD][53]clssnmDiscHelper: node 04 clean up, endp (00000000000020c8), init state 5, cur state 5
2013-03-17 05:25:08.807: [GIPCXCPT][53] gipcInternalDissociate: obj 600000000246e210 [00000000000020c8] { gipcEndpoint : localAddr '
2013-03-17 05:25:08.821: [ CSSD][1]clssgmCleanupNodeContexts(): cleaning up nodes, rcfg(286)
2013-03-17 05:25:08.821: [ CSSD][1]clssgmCleanupNodeContexts(): successful cleanup of nodes rcfg(286)
2013-03-17 05:25:09.724: [ CSSD][56]clssnmDeactivateNode: node 04, state 5
2013-03-17 05:25:09.724: [ CSSD][56]clssnmDeactivateNode: node 04 (node04) left cluster
Node Evictions—Top/Common Causes and Factors
The following are only a few of the most common symptoms/factors that lead to node evictions, cluster stack sudden death, reboots, and status going unhealthy:
Consult/refer to the following various trace/log files and gather crucial information in order to diagnose/identify the real symptoms of node eviction:
With the incorporation of rebootless fencing behavior with 11gR2, a node actually under certain circumstances won’t reboot completely; rather, it will attempt to stop the GI on the node gracefully to avoid complete node eviction. For example, when the GI home or /var file system runs out of space on the device, cssd/crsd,evmd processes become unhealthy, in contrast to pre-11gR2 behavior, and this prevents a total node reboot scenario. Restarting the individual stack resources after resolving the underlying issues in the context is recommended to bring the cluster stack back to a healthy situation. However, if the rebootless fencing fails to stop one or more resources, the rebootless fencing will fail and the cssdagent or cssdmonitor will favor and trigger complete node eviction and will perform a fast reboot. In this context, a warning message of resource failure will be written to the alert.log file.
Extended Distance (Stretch) Clusters—Synopsis, Overview, and Best Practices
Extended distance (stretch) clusters (Figure 7-10) are a very rare species and justifiably so. RAC clusters are generally architected to reside within a single geographical location. Stretch clusters are generally defined as RAC clusters with nodes spanning over multiple data centers, resulting in a geographically spread-out private cluster interconnect.
Figure 7-10. Example configuration: Extended distance (stretch) RAC cluster
Basically, stretch clusters are used for a very specific reason: Building Active-Active RAC clusters, spread over a geographically spread-out location. However, due to network latency and cost issues, Active-Active Multi-Master Golden Gate is a much more viable option than extended distance clusters. It is important to consider that stretch clusters are not and should not be used as a replication/Disaster Recovery technology; Data Guard is the recommended technology for DR/failover purposes.
Extended Distance (Stretch) Clusters: Setup/Configuration Best Practices
Following are good design best practices critical to the smooth operation of extended distance clusters:
Following are the pros/benefits/advantages of extended distance clusters:
By the same token, extended distance clusters are limited in their capabilities by the following disadvantages/cons:
Setup and Configuration—Learning the New Way of Things
OUI
OUI has come quite a ways in its long and distinguished history. The new installer grows more powerful, intuitive, and easy to use as a integral tool with each new version; that helps with quite of bit of automation in setting up and configuring an Oracle RAC cluster according to industry-standard best practices.
Shown in Figures 7-11, 7-12, 7-13, and 7-14 are some examples of recent enhancements in OUI.
Figures 7-11. Examples of recent enhancements/features added to the OUI for setting up a RAC cluster
Figures 7-12. Examples of recent enhancements/features added to the OUI for setting up a RAC cluster
Figures 7-13. Examples of recent enhancements/features added to the OUI for setting up a RAC cluster
Figures 7-14. Examples of recent enhancements/features added to the OUI for setting up a RAC cluster
Oracle Enterprise Manager Cloud Control 12c
Oracle Enterprise Manager is the end-to-end tool for configuring, managing, and monitoring RAC clusters. Features like conversion of Oracle databases from single-instance to RAC in Oracle Enterprise Manager Cloud Control 12c are some of the many powerful features that should be leveraged, especially when databases need to be managed or configured en masse.
It is important to note that Oracle Enterprise Manager along with RAC and ASM formulate the Database-as-a-service cloud paradigm. This underscores the significance of OEM within the RAC ecosystem as a whole: in other words, they integrally complement each other to constitute the Oracle Database Cloud.
Figures 7-15. Example activity: Configuring/managing RAC within Oracle Enterprise Manager Cloud Control 12c
Figures 7-16. Example activity: Configuring/managing RAC within Oracle Enterprise Manager Cloud Control 12c
Figures 7-17. Example activity: Configuring/managing RAC within Oracle Enterprise Manager Cloud Control 12c
RAC Installation and Setup—Considerations and Tips for OS Families: Linux, Solaris, and Windows
There may be some overlap in the material presented in this section with other chapters. The intent is to summarize proven and well-known tips, tricks, and best practices in setting up and configuring Oracle RAC clusters in the various variants of popular OS families of Linux, Solaris, and Windows. The list is long and expansive and therefore, not everything can be covered in detail; salient bullet points follow.
RAC Database Performance Tuning: A Quick n’ Easy Approach
OEM12c is a leap forward in sophisticated DB management, monitoring, and administration—a great tool for quick, easy, and intuitive performance tuning of a RAC cluster. This section is an overview, and further details are presented in other chapters.
OEM Cloud Control facilitates and accelerates identification and fixing of DB bottlenecks, problematic SQL statements, locks, contention, concurrency, and other potential crisis situations, including monitoring/logging into hung databases.
Holistic graphical analysis can quickly and easily help a RAC DBA/DMA to identify performance bottlenecks at the cluster/instance level, resulting in lower incident response times.
The 3 A’s of Performance Tuning
The 3 A’s of performance tuning in a RAC Database, all of which are easily accessible in quite a bit of drill-down detail within OEM12c, are listed in the following:
The preceding are supplemented by the following new/recent features within OEM12cR2:
Some of these much-needed tools/features are incorporated within the diagnostics and tuning pack, which is essential for enhanced and effective monitoring, troubleshooting, and performance tuning the Oracle Database Server family including RAC.
The frequency of slow times in a RAC database is way higher than downtime: the perfect tool to attack this problem is OEM12cR2 ➤ Performance Home Page, which, as the heart and soul of the visual monitoring/analysis process/methodology, gives you a clear and easy-to-interpret overview of the RAC DB health in real time as well as from a historical perspective.
OEM 12c R2 ➤ Performance Home Page
The Performance Home Page within OEM12cR2 is all about graphs, charts, and colors (OR pictures)! This is an intuitive starting point for effective and easy-to-learn performance tuning of the Oracle Database Server family including RAC.
Some examples of powerful and easy-to-use performance tuning features within OEM 12cR2 are shown in Figures 7-18 and 7-19.
Figure 7-18. OEM 12c ➤ Targets ➤ Select RAC DB ➤ Home Page gives performance, resources, and SQL summary
Figure 7-19. Performance Home Page: The stargate of Oracle RAC (and non-RAC) DB tuning in OEM Cloud Control
“Performance” Menu ➤ “Performance Home” homepage:
Understanding the colors on the various graphical depictions and the information/severity level they represent is very important to this approach: generally, the more the reddish the color, the more trouble it is for the RAC DB.
Figures 7-20. Cluster ➤ Performance: CPU, memory, and disk utilization within OEM12c
Figures 7-21. Cluster ➤ Performance: CPU, memory, and disk utilization within OEM12c
Figures 7-22. Cluster cache coherency link/menu item: The main link/entry page for RAC cluster monitoring/tuning
Figures 7-23. Cluster cache coherency link/menu item: The main link/entry page for RAC cluster monitoring/tuning
Figures 7-24. Cluster cache coherency link/menu item: The main link/entry page for RAC cluster monitoring/tuning
The preceding examples are just some of many powerful features for a quick n’ easy way of performance tuning RAC clusters within OEM12cR2.
Summary
RAC lies at the most complex end of the Oracle Database Server family spectrum and can be challenging to get one’s arms around, especially for large-scale setups. Managing and optimizing any RAC cluster require a considerable amount of skill, experience, and expertise. This chapter covered most essential knowledge streams related to easy, policy-based, and easy-to-learn management of Oracle RAC clusters. From node evictions to split-brain scenarios, tips/tricks for various OS families, server pools and stretched clusters, and a whole lot more, a wide and expansive range of topics was covered in this chapter to efficiently and effectively cope with/ease the daily load of a RAC DBA/DMA.