Chapter 22 Configuring the Database for Backup and Recovery

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Images

CHAPTER 22 Configuring the Database for Backup and Recovery

Exam Objectives

• 62.1.16.1 Identify the Types of Failures that Can Occur in an Oracle Database

• 62.1.16.2 Tune Instance Recovery

• 62.1.16.3 Identify the Importance of Checkpoints, Redo Log Files, and Archive Log Files

• 62.1.16.5 Configure ARCHIVELOG Mode

• 62.1.16.4,
63.1.3.2 Configure the Fast Recovery Area

• 63.1.3.3 Configure Control Files and Redo Log Files for Recoverability

This chapter summarizes what can go wrong in an Oracle database and the steps necessary to ensure that data will never be lost. This requires an understanding of the instance recovery mechanism and the usage of certain files.

Possible Failures and Their Resolution

Eventually your database will have some kind of failure whether it be a network failure, user error, disk drive failure, or memory corruption issue causing an instance failure. To prepare for this nearly inescapable occurrence, it is necessary to have a backup and recovery plan in place. This backup and recovery plan should be tested on a regular basis to make sure that the recovery techniques will be successful if they are ever needed.

To create a successful backup and recovery strategy, you must first understand the types of failures and how Oracle responds to each of them. Some require no user or database administrator (DBA) intervention, and others do. Which backup and recovery solutions you leverage depend on many factors, including how fast you need to recover from a failure and how many resources you want to dedicate to the solution. Your investment in recovery infrastructure is proportional to the cost of lost productivity or business income if the database is unavailable for a day, an hour, or a minute.

Failure Categories

The types of failures or errors you may encounter fall into two general categories: physical and logical. Physical errors are generally hardware errors or software errors in the applications using the database, while logical errors are typically at the end-user level (database users or administrators). The categories of failures are as follows:

• Statement failure A user’s SELECT or Data Manipulation Language (DML) statement failed because of permissions, syntax, or resource limits.

• User error The user mistakenly dropped a table or deleted the wrong rows in a table.

• User process failure The connection to the database failed because of a client disconnect or unexpected shutdown.

• Network failure As the name implies, the network connection between the client and the server (database) failed because of network hardware or protocol errors.

• Instance failure The database instance crashed because of a bug, operating system (OS) errors, memory corruption, or even power loss to the server.

• Media failure There were disk drive physical errors or a controller hardware failure.

These failure categories are in order of least serious and easiest to recover from to the most serious and harder to recover from. Specifically, statement failures are the easiest to recover from since they are almost always user managed: you rewrite the SELECT or INSERT statement so it does not generate a syntax error. Accidentally dropping an important table can be recovered by the user or by the database administrator depending on whether the dropped table is still in the recycle bin or whether the DBA has given the user privileges to use various flashback features. Recovery from network, instance, or media failure will necessitate varying actions, the choice of which will depend on the criticality and severity of any lost data.

Your backup and recovery strategy will need to account for all of these failures, even though some of these failures are easier to recover from than others. For example, an instance failure may be because of a power outage. The recovery of committed transactions is relatively painless and automatic as long as your online redo log files are intact and multiplexed.

Oracle Backup and Recovery Solutions

Your recovery time objective (RTO) is the target in which a recovery operation must be completed to meet your customer’s or client’s service level agreement (SLA). Various Oracle tools and technologies will apply depending on the RTO:

• Days or hours Recovery Manager (RMAN), and optionally Oracle Secure Backup, can recover your database in days or hours if the entire database is lost because of natural disaster or disk failures.

• Hours or minutes Various flashback technologies, either user initiated or DBA initiated, can recover database objects usually while the database is still available to other users. Although the database needs to be temporarily shut down, by using a flashback database the DBA can roll back the entire database to a point in time earlier in the day or even weeks ago depending on the storage available in the Fast Recovery Area and the restore points defined.

• Minutes or seconds If the database must be available continuously with downtimes no longer than a few minutes, Oracle solutions such as Data Guard or Real Application Cluster (RAC) can fail over to a backup database or an alternative running instance with minimal or no intervention from the DBA.

• Recovery analysis Regardless of the type of failure or RTO, the Oracle Data Recover Advisor makes it easy to quickly determine the type of failure and the fastest way to recover from a specific type of failure.

Recovery Manager

Recovery Manager is the primary tool you use to back up, restore, and recover database objects from the table level (new to Oracle Database 12c) to the datafile, tablespace, and, of course, database level. RMAN has many uses outside of backup and recovery, including the cloning or duplication of a database to another location.

A key component of RMAN is a special location for backup and recovery objects called the Fast Recovery Area (FRA). While this area is ideally a disk group in ASM, it can also be located in an operating system (OS) file system. Regardless of location, it is a centralized place for all backup and recovery objects. The FRA is managed based on size and your recovery objectives, whether that’s based on the recovery window or on the number of backups you need to retain. Using an FRA is optional but is considered to be best practice.

Oracle Secure Backup

In conjunction with RMAN, Oracle Secure Backup (OSB) will take RMAN backups and copy them to a tape device or to cloud storage to prevent loss of data from a catastrophic failure at a data center. OSB also provides an extension to RMAN at the OS level to back up Linux servers as well as any attached storage such as in a network attached storage (NAS) appliance.

Oracle Data Guard

Oracle Data Guard is one of Oracle’s high-availability (HA) solutions to ensure near-real-time availability because of a failure of the primary database or to prevent database corruptions. A standby database (several of which may be created) is instantiated from a copy of the primary database and receives redo from the primary database and applies this redo to update its datafiles. Thus, the standby is kept synchronized with the primary. A standby database can also play the temporary role of a read-only copy of the database for reporting purposes and therefore free up resources on the primary database for better response time in an online transaction processing (OLTP) environment.

Another type of standby database is called a logical standby database. Instead of continuously applying redo to a physical copy of the primary database, a logical standby database converts the redo into equivalent DML SQL. Therefore, the standby database is logically equivalent to the standby database but will almost certainly not have the identical physical structure of the primary database. A logical standby is not really part of a fault-tolerant environment; it is a separate database optimized as a data warehouse that happens to contain the same data as the primary.

Instance Recovery and the Impossibility of Database Corruption

An instance failure could be caused by a power outage, rebooting the server, issuing a SHUTDOWN ABORT command, or anything that causes the instance background processes to terminate and the System Global Area (SGA) to be destroyed—all this without any attempt to flush changed buffers in the cache to the datafiles or to roll back any in-flight transactions. In principle, instance recovery is nothing more than using the contents of the online log files to rebuild the database buffer cache to the state it was in before the crash. This will replay all changes extracted from the online redo log files that refer to blocks that had not been written to disk at the time of the crash. Once this has been done, the database can be opened. This phase of recovery, known as the roll forward, reinstates all changes—changes to data blocks and changes to undo blocks—for both committed and uncommitted transactions. Each redo record has the bare minimum of information needed to reconstruct a change: the block address and the new values. During a roll forward, each redo record is read, the appropriate block is loaded from the datafiles into the database buffer cache, and the change is applied. Then the block is written back to disk.

Once the roll forward is complete, it is as though the crash had never occurred. But at that point, there will be uncommitted transactions in the database; these must be rolled back, and Oracle will do that automatically in the rollback phase of instance recovery. However, that happens after the database has been opened for use. If a user connects and hits some data that needs to be rolled back and hasn’t yet been, this is not a problem; the roll forward phase will have populated the undo segment that was protecting the uncommitted transaction, so the server can roll back the change in the normal manner for read consistency.

Instance recovery is automatic and unavoidable, so how do you invoke it? By issuing a STARTUP command. When starting an instance, after mounting the controlfile and before opening the database, the System Monitor process (SMON) checks the file headers of all the datafiles and online redo log files. At this point, if there had been an instance failure, it is apparent because the file headers are all out of sync. So, SMON goes into the instance recovery routine, and the database is opened only after the roll-forward phase has completed.

Tuning the instance recovery process can (theoretically) be done by setting the RECOVERY_PARALLELISM instance parameter to specify the number of processes that will participate in the instance recovery operation. Setting this explicitly to 0 or 1 will disable parallel recovery. However, from release 10.1, leaving this parameter on the default will allow Oracle to launch whatever number of processes are appropriate. That actual number used is written out to the alert log. This is an excerpt from an alert log following a SHUTDOWN ABORT command:

Images

It should now be apparent that there is always enough information in the redo log stream to reconstruct all work done up to the point at which the crash occurred and furthermore that this includes reconstructing the undo information needed to roll back transactions that were in progress at the time of the crash. But for the final proof, consider the following scenario.

User JOHN has started a transaction. He has updated one row of a table with some new values, and his server process has copied the old values to an undo segment. But before these updates were done, his server process wrote out the changes to the log buffer. User ROOPESH has also started a transaction. Neither has committed; nothing has been written to disk. If the instance crashed now, there would be no record whatsoever of either transaction, not even in the redo logs. So, neither transaction would be recovered—but that is not a problem. Neither was committed, so they should not be recovered; uncommitted work must never be saved.

Then user JOHN commits his transaction. This triggers LGWR to flush the log buffer to the online redo log files, which means that the changes to both the table and the undo segments for both JOHN’s transaction and ROOPESH’s transaction are now in the redo log files, together with a commit record for JOHN’s transaction. Only when the write has completed is the “commit complete” message returned to JOHN’s user process. But there is still nothing in the datafiles. If the instance fails at this point, the roll forward phase will reconstruct both the transactions, but when all the redo has been processed, there will be no commit record for ROOPESH’s update; that signals SMON to roll back ROOPESH’s change but leave JOHN’s in place.

But what if DBWR has written some blocks to disk before the crash? It might be that JOHN (or another user) was continually requerying his data but that ROOPESH had made his uncommitted change and not looked at the data again. DBWn will therefore decide to write ROOPESH’s changes to disk in preference to JOHN’s; DBWn will always tend to write inactive blocks rather than active blocks. So now, the datafiles are storing ROOPESH’s uncommitted transaction but missing JOHN’s committed transaction. This is as bad a corruption as you can have. But think it through. If the instance crashes now—a power cut, perhaps, or a SHUTDOWN ABORT—the roll forward will still be able to sort out the mess. There will always be enough information in the redo stream to reconstruct committed changes; that is obvious, because a commit isn’t completed until the write is done. But because LGWR flushes all changes to all blocks to the log files, there will also be enough information to reconstruct the undo segment needed to roll back ROOPESH’s uncommitted transaction.

To summarize, because LGWR always writes ahead of DBWn and because it writes in real time on commit, there will always be enough information in the redo stream to reconstruct any committed changes that had not been written to the datafiles and to roll back any uncommitted changes that had been written to the data files. This instance recovery mechanism of redo and rollback makes it absolutely impossible to corrupt an Oracle database—so long as there has been no physical damage to the redo log.

Checkpoints and the Redo Log

Blocks in the buffer cache that have been changed are checkpointed to disk by the database writer (DBWn) process (or processes). The change vectors that were applied to these blocks will have already been written to the redo stream by the LGWR process. The DBWn writes according to a lazy algorithm, and the LGWR writes according to an aggressive algorithm that is near real time and is real time on COMMIT.

The Checkpointing Mechanism

The checkpoint position (the point in the redo stream from which instance recovery must start following a crash) is advanced automatically by the DBWn. This process is known as incremental checkpointing. In addition, there may be full checkpoints and partial checkpoints.

An incremental checkpoint is part of normal database activity. The DBWn processes decide there are sufficient blocks in the cache that have been updated (these are known as dirty buffers) and that it is time to write a few of them to disk. The algorithm to select which changed buffers to write is based on how long ago the change was made and how active the buffer is. In general, a buffer will be written only if it has been changed (there is no point in writing a buffer that has not been changed) and idle (there is no point in writing a buffer if it is continually being used). Never forget that there is no correlation between committing changes and writing blocks to disk and that DBWn writes only the minimum number of blocks necessary.

A full checkpoint occurs when all dirty buffers are written to disk. In normal running, there might be 100,000 dirty buffers in the cache, but the DBWn would write just a few hundred of them for the incremental checkpoint. For a full checkpoint, it will write the lot. This entails a great deal of work, including high CPU and disk usage while the checkpoint is in progress and reduced performance for user sessions. Full checkpoints are bad for business. Because of this, there will never be a full checkpoint except in two circumstances: an orderly shutdown and at the DBA’s request.

When the database is shut down with the NORMAL, IMMEDIATE, or TRANSACTIONAL option, there is a checkpoint; all dirty buffers are flushed to disk by the DBWn before the database is closed and dismounted. This means that when the database is opened again, no recovery will be needed. A clean shutdown is always desirable and is necessary before some operations (such as enabling archivelog mode). A full checkpoint can be signaled at any time with this command:

A partial checkpoint is necessary and occurs automatically as part of certain operations. Depending on the operation, the partial checkpoint will affect different buffers. These operations are detailed here:

Images

Protecting the Online Redo Log Files

Remember that an Oracle database requires at least two online log file groups to function so that it can switch between them. You may need to add more groups for performance reasons, but two are required. Each group consists of one or more members, which are the physical files. Only one member per group is required for Oracle to function, but at least two members per group are required for safety.

The one thing that a DBA is not allowed to do is to lose all copies of the current online log file group. If that happens, you will lose data. The only way to protect against data loss when you lose all members of the current group is to configure a Data Guard environment for zero data loss, which is not a trivial exercise. Why is it so critical that you not lose all members of the current group? Think about instance recovery. After a crash, SMON will use the contents of the current online log file group for roll-forward recovery to repair any corruptions in the database. If the current online log file group is not available, perhaps because it was not multiplexed and media damage has destroyed the one member, then SMON cannot do this. And if SMON cannot correct corruptions with roll forward, you cannot open the database.

If a member of a redo log file group is damaged or missing, the database will remain open if there is a surviving member. This contrasts with the controlfile, where damage to any copy will crash the database immediately. Similarly, groups can be added or removed and members of groups can be added or moved while the database is open, as long as there are always at least two groups and each group has at least one valid member.

If you create a database with the Database Configuration Assistant (DBCA), by default you will have three groups, but unless you instructed DBCA to use Oracle Managed Files and a Fast Recovery Area, they will have only one member each. Two views will tell you the state of your redo logs: V$LOG will have one row per group, and V$LOGFILE will have one row per log file member. Figure 22-1 shows an example of online redo log configuration.

Images

Figure 22-1 Online redo log file configuration

The first query shows that this database has three log file groups. The current group—the one LGWR is writing to at the moment—is group 2; the other groups are inactive, meaning first that the LGWR is not writing to them and second that in the event of an instance failure SMON would not require them for instance recovery. In other words, the checkpoint position has advanced into group 2. The SEQUENCE# column tells you that there have been 497 log switches since the database was created (or since the log sequence was reset using ALTER DATABASE OPEN RESETLOGS). This number is incremented with each log switch. The MEMBERS column shows that each group consists of only one member—and that is seriously bad news and should be corrected as soon as possible.

The second query shows the individual online redo log files. Each file is part of one group (identified by GROUP#, which is the join column back to V$LOG) and has a unique name. The STATUS column should always be null, as shown. If the member has not yet been used, typically because the database has only just been opened and no log switches have occurred, the status will be INVALID; this will be there only until the first log switch. If the status is persistently INVALID, you have a problem.

Then there is an ALTER SYSTEM SWITCH LOGFILE command to force a log switch. The log switch would happen automatically, eventually, if there were any DML statements in progress. The previous query shows that after the log switch, group 3 is now the current group that LGWR is writing to, at log switch sequence 498. The previously current group, group 2, has status ACTIVE. This means that it would still be needed by SMON for instance recovery if the instance failed now. In a short time, as the checkpoint position advances, it will become INACTIVE. Issuing an ALTER SYSTEM CHECKPOINT command would force the checkpoint position to come up to date, and group 2 would then become inactive immediately.

To protect the database against loss of data in the event of damage to an online redo log file group, multiplex it. Following the example in Figure 22-1, to add multiplexed copies to the online log, you would use a command such as this for each log file group:

Archivelog Mode and the Archiver Processes

Oracle guarantees that your database is never corrupted through use of the online redo log files to repair any corruptions caused by an instance failure. This is automatic and unavoidable. But to guarantee no loss of data following a media failure, it is necessary to have a record of all changes applied to the database since the last backup of the database; this is not enabled by default. The online redo log files are overwritten as log switches occur, so the history of change vectors is, by default, not kept—but the transition to archivelog mode ensures that no online redo log file is overwritten unless it has been copied to an archive log file first. So, there will be a series of archive log files that represent a complete history of all changes ever applied to the database. If a datafile is damaged at any time, it will then be possible to restore a backup of the datafile and apply the changes from the archive log redo stream to bring it up to date.

By default, a database is created in noarchivelog mode. This means that online redo log files are overwritten by log switches with no copy being made first. It is still impossible to corrupt the database, but data would be lost if the datafiles were damaged by media failure. Once the database is transitioned to archivelog mode, it is impossible to lose data—provided that all the archive log files generated since the last backup are available.

Once a database is converted to archivelog mode, a new background process will start automatically. This is the archiver process, ARCn. By default Oracle will start four of these processes, but you can have up to 30. In earlier releases of the database, it was necessary to start this process either by using a SQL*Plus command or by setting the initialization parameter LOG_ARCHIVE_START, but from release 10g onward Oracle automatically starts the archiver processes if the database is in archivelog mode.

The archiver processes will copy the current online redo log file to an archive log file after each log switch, thus generating a continuous chain of log files that can be used for recovering a backup. The name and location of these archive log files are controlled by initialization parameters. For safety, the archive log files can be multiplexed, just as the online log files can be multiplexed—but eventually they should be migrated to offline storage, such as a tape library. The Oracle instance takes care of creating the archive logs with the ARCn process, but the migration to tape must be controlled by the DBA, either by using operating system commands or by using the Recovery Manager utility RMAN.

The transition to archivelog mode can be done only while the database is in mount mode after a clean shutdown, and it must therefore be done by a user with a SYSDBA connection. It is also necessary to set the initialization parameters that control the names and locations of the archive logs generated. Clearly, these names must be unique or archive logs could be overwritten by other archive logs. To ensure unique filenames, it is possible to embed variables such as the log switch sequence number in the archive log filenames. These variables may be used to embed unique values in archive log filenames:

Images

The minimum archiving necessary to ensure that recovery from a restored backup will be possible is to set one archive destination. But for safety, it will usually be a requirement to multiplex the archive log files by specifying two or more destinations, ideally on different disks served by different controllers. It is possible to specify up to 31 archive destinations, giving you that many copies of each filled online redo log file. This is perhaps excessive for safety.

One archive destination? Good idea. Two destinations? Sure, why not? But 30?

The reason for so many possible destinations is distributed systems. For the purposes of this book and the OCP exam, an archive log destination will always be a directory on the machine hosting the database—and two destinations on local disks will usually be sufficient. But the destination can be an Oracle Net alias, specifying the address of a listener on a remote computer. This is the key to zero data loss: The redo stream can be shipped across the network to a remote database, where it can be applied to give a real-time backup. This is the Data Guard mechanism.

Backup and Recovery: Configuration

Configuring a database for recoverability means ensuring that certain critical files are multiplexed. These files are the online redo log files and the controlfile. Adjusting the online redo log file configuration is an online operation, whereas adjusting the controlfile configuration requires a shutdown/startup.

To determine the names and locations of the controlfile copies, query either a view or a parameter.

Images

The preceding queries show that this database is running with two copies of the controlfile.

Exercise 22-1: Investigate and Adjust the Redo Log Configuration In this exercise, you will investigate the configuration of the redo log.

1. Connect to your database as user SYSTEM using SQL*Plus.

2. Document the configuration of redo log with this query:

This will show the log file members, their status, their size, and the group to which they belong. If your database is the default database, it will have three groups each of one member, with a size of 50MB.

3. Determine the archivelog mode of the database and whether ARCn is running with these commands:

Note that the mode is an attribute of the database, but archiving is an attribute of the instance.

4. Add another member to each of the online log file groups. Choose any directory and filename suitable for your operating system. For example, following the situation described in Figure 22-1, the commands could be as follows:

Images

5. Run this query to check the status of your log file members:

Note that the new members will be INVALID.

6. Cycle through the log file groups by executing this command a few times:

7. Rerun the query from step 5. The status of each member should now have cleared. If it is still invalid, something is wrong. Look at the database instance alert log to diagnose what the problem might be.

Configure the Fast Recovery Area

The Fast Recovery Area is a disk destination used as the default location for recovery-related files. It is controlled with two instance parameters:

• db_recovery_file_dest

• db_recovery_file_dest_size

The first of these parameters nominates the location. This can be a file system directory or an Automatic Storage Management (ASM) disk group. It is possible for several databases to share a common destination; each database will have its own directory structure (created automatically) in the destination. The second parameter limits the amount of space in the destination that the database will occupy; it says nothing about how much space is actually available in the destination. The configuration and usage of the Fast Recovery Area are shown in two views:

• v$recovery_file_dest

• v$recovery_area_usage

The files that will be written to the Fast Recovery Area (unless specified otherwise) include the following:

• Recovery Manager backups

• Archive redo log files

• Database flashback logs

RMAN, the Recovery Manager, can manage space within the flash recovery area; it can delete files that are no longer needed according to its configured policies for retaining copies and backup of files. In an ideal situation, the Fast Recovery Area will be large enough to store a complete copy of the database, plus any archive logs and incremental backups that would be necessary to recover the copy plus multiplexed copies of the online redo log files and the controlfile.

The database backup routines should also include backing up the Fast Recovery Area to tape, thus implementing a strategy of primary, secondary, and tertiary storage.

• Primary storage is the live database, on disk.

• Secondary storage is a disk-based copy of the database plus files needed for fast recovery.

• Tertiary storage is long-term backups, usually in a tape library.

RMAN can manage the whole cycle, namely, backup of the database from primary to secondary and migration of backups from secondary to tertiary storage. Such a system can be implemented in a fashion that will allow near-instant recovery following a failure, combined with the ability to take the database back in time if this is ever necessary.

The Fast Recovery Area can be reconfigured at any time, without affecting any files within it. Changes will apply only to files created subsequently. Figure 22-2 shows how to determine the Fast Recovery Area configuration, followed by statements to adjust both its location and its size.

Images

Figure 22-2 Determining the Fast Recovery Area configuration

Configure ARCHIVELOG Mode

A database is, by default, created in noarchivelog mode. The transition to archivelog mode is straightforward, but it does require downtime. The process is as follows:

1. Shut down the database cleanly.

2. Start up in mount mode.

3. Issue the command ALTER DATABASE ARCHIVELOG;.

4. Open the database.

5. Perform a full backup.

Following a default installation, the archive logs will be written to only one destination, which will be the Fast Recovery Area. This is specified by an implicit setting for the LOG_ARCHIVE_DEST_1 parameter, visible in the V$ARCHIVE_DEST view. If the parameters that enable the Fast Recovery Area have not been set, they will go to a platform-specific destination (the $ORACLE_HOME/dbs directory for Unix systems). The final command in Figure 22-3, ARCHIVE LOG LIST, shows summary information about the archiving configuration, including that the database is running in archivelog mode, that the ARCn process is running, and that the archive log files are being written to the fast recovery area.

Images

Figure 22-3 Enabling and verifying archivelog mode with SQL*Plus

A full backup is an essential step for the transition to archivelog mode. Following the transition, all backups made earlier are useless. The backup can be made while the database is open or closed, but until it is made, the database is not protected at all.

Exercise 22-2: Enable Archivelog Mode In this exercise, you will transition your database into archivelog mode. This is an essential procedure for completing later chapters on backup and recovery.

1. Connect to your database with Database Control as user SYS. (Why SYS? Because you will have to stop and start the instance.)

2. Follow the steps shown in Figure 22-3 to enable archivelog mode.

3. Confirm that archiving is working by forcing a log switch and an archive.

4. Confirm that the archive log file has been generated in the fast recovery area.