11.1. Understanding Database Failure Types

There are six general categories for database-related failures. Understanding what category a failure belongs in will help you to more quickly understand the nature of the recovery effort you need to use to reverse the effects of the failure and maintain a high level of availability and performance in your database. The six general categories of failures are as follows:

Statement A single database operation fails, such as a DML (Data Manipulation Language) statement— INSERT, UPDATE, and so on.

User process A single database connection fails.

Network A network component between the client and the database server fails, and the session is disconnected from the database.

User error An error message is not generated, but the operation's result, such as dropping a table, is not what the user intended.

Instance The database instance fails unexpectedly.

Media One or more of the database files is lost, deleted, or corrupted.

In the next six sections, we'll provide details on these failure types and suggest some possible solutions for each failure type. For one particular type of failure, media failure, we'll provide more detailed solutions for recovery later in this chapter.

11.1.1. Statement Failures

Statement failures occur when a single database operation fails, such as a single INSERT statement or the creation of a table. In the list that follows are a few of the most common problems and their solutions when a statement fails.

Attempts to access tables without the appropriate privileges Provide the appropriate privileges or create views on the tables and grant privileges on the view.

Running out of space Add space to the tablespace, increase the user's quota on the tablespace, or enable resumable space allocation.

Logic errors in applications Work with developers to correct program errors or provide additional logic in the application to recover gracefully from unavoidable errors.

Although granting user privileges or additional quotas within a tablespace solves many of these problems, also consider whether there are any gaps in the user education process that might lead to some of these problems in the first place.

11.1.2. User Process Failures

The abnormal termination of a user session is categorized as a user process failure ; any uncommitted transaction must be cleaned up. The PMON (process monitor) background process periodically checks all user processes to ensure that the session is still connected. If the PMON finds a disconnected session, it rolls back the uncommitted transaction and releases all locks held by the disconnected process. Causes for user process failures typically fall into one of these categories:

  • A user closes their SQL*Plus window without logging out.

  • The workstation reboots suddenly before the application can be closed.

  • The application program causes an exception and closes before the application can be terminated normally.

  • A user process times out and Oracle disconnects the session.

A small percentage of user process failures is generally no cause for concern unless it becomes chronic; it may be a sign that user education is lacking—for example, training users to terminate the application gracefully before shutting down their workstation.

11.1.3. Network Failures

Depending on the locations of your workstation and your server, getting from your workstation to the server over the network might involve a number of hops: you might traverse several local switches and WAN routers to get to the database. From a network perspective, this configuration provides a number of points where failure can occur. These types of failures are called network failures.

In addition to hardware failures between the server and client, a listener process on the Oracle server can fail or the network card on the server itself can fail. To guard against these kinds of failures, you can provide redundant network paths from your clients to the server, as well as additional listener connections on the Oracle server and redundant network cards on the server.

11.1.4. User Error Failures

Even if all your redundant hardware is at peak performance, and your users have been trained to disconnect from their Oracle sessions properly, users can still inadvertently delete or modify data in tables or drop an index. This is known as a user error failure. Although these operations succeed from a statement point of view, they might not be logically correct: the DROP TABLE command worked fine, but you really didn't want to drop that table!

If data was inadvertently deleted from a table, and not yet committed, a ROLLBACK statement will undo the damage. If a COMMIT has already been performed, you have a number of options at your disposal, such as using data in the undo tablespace for a Flashback Query or using data in the archived and online redo logs with the LogMiner utility, available as a command-line or GUI interface.

You can recover a dropped table using Oracle's recycle bin functionality: a dropped table is stored in a special structure in the tablespace and is available for retrieval as long as the space occupied by the table in the tablespace is not needed for new objects. Even if the table is no longer in the tablespace's recycle bin, depending on the criticality of the dropped table, you can use either tablespace point in time recovery (TSPITR) or Flashback Database Recovery to recover the table, taking into consideration the potential data loss for other objects stored in the same tablespace for TSPITR or in the database if you use Flashback Database Recovery.

NOTE

TSPITR and Flashback Database Recovery are beyond the scope of this book but are covered in more detail in OCP: Oracle 10 g Administration II Study Guide (Sybex, 2005).

If the inadvertent changes are limited to a small number of tables that have few or no interdependencies with other database objects, Flashback Table functionality is most likely the right tool to bring back the table to a point of time in the past.

Later in this chapter, in the section "Performing Recovery Operations," we'll show you how to recover dropped tables from the recycle bin using Flashback Drop, how to retrieve deleted rows from a table using Flashback Query functionality, use Flashback Table to bring a table back to a point of time in the past along with its dependent objects, and use LogMiner to query online and archived redo logs for the previous state of modified rows.

11.1.5. Instance Failures

An instance failure occurs when the instance shuts down without synchronizing all the database files to the same system change number (SCN), requiring a recovery operation the next time the instance is started. Many of the reasons for an instance failure are out of your direct control; in these situations, you can minimize the impact of these failures by tuning instance recovery.

NOTE

We will show you how to tune instance recovery later in this chapter in the section "Tuning Instance Recovery."

Here are a few causes for instance failure:

  • A power outage

  • A server hardware failure

  • Failure of an Oracle background process

  • Emergency shutdown procedures (intentional power outage or SHUTDOWN ABORT)

In all these scenarios, the solution is easy: run the STARTUP command, and let Oracle automatically perform instance recovery using the online redo logs and undo data in the undo tablespace. If the cause of the instance failure is related to an Oracle background process failure, you can use the alert log and process-specific trace files to debug the problem. The EM Database Control makes it easy to review the contents of the alert log and any other alerts generated right before the point of failure.

11.1.6. Media Failures

Another type of failure that is somewhat out of your control is media failure. A media failure is any type of failure that results in the loss of one or more database files: datafiles, control files, or redo log files. Although the loss of other database-related files such as an init.ora file or a server parameter file (SPFILE) is of great concern, Oracle Corporation does not consider it a media failure. The database file can be lost or corrupted for a number of reasons:

  • Failure of a disk drive

  • Failure of a disk controller

  • Inadvertent deletion or corruption of a database file

Following the best practices defined in Chapter 10 by adequately mirroring control files, redo log files, and ensuring that full backups and their subsequent archived redo log files are available will keep you prepared for any type of media failure.

In the next section, we will show you how to recover from the loss of control files, datafiles, and redo log files.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset