2Event management

2.1PURPOSE AND OBJECTIVES (SO 4.1.1)

The purpose of event management is to manage events throughout their lifecycle. This lifecycle detects events, makes sense of them and determines the appropriate control action, all of which are coordinated by the event management process.

Event management is therefore the basis for operational monitoring and control. If events are programmed to communicate operational information as well as warnings and exceptions, they can be used as a basis for automating many routine operations management activities.

The objectives of the event management process are to:

  • Detect all changes of state that have significance for the management of a configuration item (CI) or IT service
  • Determine the appropriate action for events and ensure communication to the appropriate functions
  • Provide the trigger for the execution of many processes and operations management activities
  • Provide comparison of actual operating performance against design standards and service level agreements (SLAs)
  • Provide a basis for service assurance, reporting and service improvement.

2.2SCOPE (SO 4.1.2)

Event management can be applied to any aspect of service management that needs to be controlled and can be automated. This includes:

  • Configuration items (CIs): monitoring of CIs to confirm they remain in a required state or automating frequent changing of a CI state, and updating the configuration management system (CMS) accordingly
  • Environmental conditions
  • Software licence monitoring to ensure optimum and legal licence utilization and allocation
  • Security
  • Normal activity such as tracking usage or performance.

Event management and monitoring are closely related but different. Event management generates and detects specific notifications for monitoring, whereas monitoring detects and tracks these notifications but also actively monitors conditions that do not generate events; for example, to check that devices are operating within acceptable limits.

2.3VALUE TO THE BUSINESS AND SERVICE LIFECYCLE (SO 4.1.3)

Event management typically provides indirect value to the business, which can be determined on the basis of the following:

  • Early detection of incidents, often leading to assignment for resolution prior to any actual service outage
  • Enabling automated activities to be managed by exception, reducing the need for costly real-time monitoring and downtime
  • Integration into other service management processes, which can enable detection and notification of status changes or exceptions, triggering an early response and improving process performance
  • Automated operations, which increase efficiency and reduce the need for expensive human resources.

2.4POLICIES, PRINCIPLES AND BASIC CONCEPTS (SO 4.1.4)

Examples of event management policies might include:

  • Event notifications should go only to those responsible for the handling of their actions, with event routing information being constantly maintained
  • Event management and support should be centralized as much as reasonably possible
  • Changes and additions for the rule base will need to be under the control of change management
  • Common messaging and logging standards and protocols should be used
  • Event handling actions should be automated wherever possible
  • There should be a standard classification scheme referencing common handling and escalation processes. Notification of incidents and problems should be aligned to the organization’s existing categorization and prioritization policies
  • All recognized events should be captured and logged, available for data manipulation, filtering and reporting to support incident and problem diagnosis activities.

2.4.1Types of event

There are three types of event:

  • Informational This is an event not requiring action, usually logged and retained for an agreed time. It is typically used in regular operations to check that a device or service status or activity has been completed (e.g. a notification that a scheduled task has finished or a user has logged in). This type of event can be used to generate activity statistics
  • Warning This is an unusual but not exceptional operation, usually when a device is approaching a threshold, indicating closer monitoring or checking is required. This type of situation may either resolve itself or require operator intervention; for example, if the completion time for a transaction is 10% longer than normal. These rules or policies are defined in the monitoring and control objectives for the device or server
  • Exception This is when a service or device is operating abnormally and action is required: for example, an attempted logon with the incorrect password; a device with an unacceptable utilization rate; or a service that is down. This may mean an SLA and an operational level agreement (OLA) have been breached and the business impacted.

2.4.2Filtering of events

Filtering enables focused management and control of significant events. Strategies for filtering include:

  • Integrating filtering into service management processes
  • Designing new services that consider event management
  • Formally evaluating the effectiveness of filtering
  • Planning for the deployment of event management across the entire IT infrastructure.

2.4.3Designing for event management (SO 4.1.4.3)

Design for event management should take place during the design of a service supported by service operations functions. Event management is the basis for monitoring service performance and availability against targets agreed during the availability and capacity management processes. The designed events should then be tested and evaluated during service transition.

Once event management has been deployed, day-to-day operations may identify additional events and other improvements through continual improvement.

Key design considerations include:

  • What needs to be monitored?
  • What type of monitoring is required?
  • When should an event be generated?
  • What information needs communicating?
  • Who are the messages intended for?
  • Who will be responsible for handling the event?

Specific design areas include:

  • Instrumentation Defining and designing how to monitor and control the IT infrastructure and services. Mechanisms to be designed include event generation, classification, communication, escalation, logging and storage
  • Error messaging Providing meaningful error messages and codes within software applications for inclusion in events
  • Event detection and alert mechanisms Designing and populating tools with the criteria and rules to filter, correlate and escalate events. Design of event detection and alert mechanisms requires knowledge of:
    • Business processes and service level requirements
    • Who is going to support the CI and what they need to know to support the event and diagnose problems
    • Normal and abnormal operation levels and the significance of repeat events
    • CI dependencies and relationships
    • Significance of multiple similar events.

2.4.4Event rule sets and correlation engines (SO 4.1.4.4)

A rule set consists of a number of rules that define how the event messages for a specific event will be processed and evaluated. The rules are typically embedded into monitoring and event handling technologies, consisting of algorithms which correlate events that have been generated (e.g. CI state changes) to create logical additional events that need to be communicated (e.g. service or business impact events). These algorithms can be coded into event management tools referred to as correlation engines.

2.5PROCESS ACTIVITIES, METHODS AND TECHNIQUES (SO 4.1.5)

2.5.1Event occurrence

Events occur continuously but not all are detected or registered. It is therefore important that those which need to be detected are understood so they can be appropriately designed and managed (see Figure 2.1).

2.5.2Event notification

There are two ways in which notification of events can take place:

  • A management tool interrogates devices to collect data, i.e. polling
  • CIs generate notifications under predefined conditions that were designed and built into the CI.

Service design should define which events need to be generated and, for each CI, specify how this should be done. During service transition, event generation is set up and tested. In many organizations, a standard set of events is used initially and tuned over time.

>Images

Decision-making about events is much easier when meaningful data, targeted for a specific audience, is included in event notifications.

2.5.3Event detection

Event notifications are detected by an agent running on the same system or by a management tool.

2.5.4Event logging

All events should be recorded, either as an event record or as an entry in the systems log. Where system logs are used they need to be routinely and regularly checked with instructions for any actions required. Event management procedures need to define how long events and logs are kept before being archived.

2.5.5First-level correlation and filtering

This stage determines whether to communicate an event or ignore it. Ignored events are typically logged with no further action.

Filtering eliminates duplicates and unwanted events that cannot be disabled.

Filtering undertakes the initial level of ‘correlation’, i.e. an assessment of whether the event is informational, a warning or an exception. Filtering is not always necessary; for some CIs every event is significant and events go straight to event correlation.

2.5.6Event significance

Events need to be categorized: recommended categories are ‘informational’, ‘warning’ and ‘exception’, as described in section 2.4.

2.5.7Second-level event correlation

The meaning of the event is normally determined by the correlation engine which compares the event with a set of criteria (called business rules) in a predefined order to establish the level and type of business impact. The correlation engine is programmed in line with the performance standards defined during service design, plus any additional guidance specific to the operating environment, such as the number of similar events or a comparison of utilization information in the event of reaching minimum or maximum thresholds.

2.5.8Action and response selection

If the correlation activity recognizes an event, a response is required. The action initiates the appropriate response.

At this point one or more responses can be chosen in any combination. Figure 2.1 shows some options. Options include:

  • Auto-responses generated for defined events where the response will initiate the action and then evaluate whether it was completed successfully, such as when rebooting a device
  • An alert raised for human intervention, containing all the information necessary for the person to determine the appropriate action
  • Incident, problem and/or change records generated:
    • Incident records can be generated immediately when an exception is detected or as determined by the correlation engine, including as much information about the event as possible
    • Problem records are typically updated to link an incident to an existing problem
    • Requests for change (RFCs) can be generated immediately when an exception is detected or when correlation identifies that a change is needed.

2.5.9Event review

Because of the high volumes involved, not all events can be formally reviewed. However, significant events or exceptions do need to be reviewed and trends monitored. Reviews should not duplicate any other reviews done as part of other processes such as change management. They should check that events have been handled properly and that the handover between event management and the other processes is effective.

Event reviews also provide input into continual improvement, and the evaluation and audit of the event management process.

2.5.10Event closure

Most events are not ‘opened’ or ‘closed’; informational events are only logged. Auto-response events are typically closed by the generation of a second event, triggered on completion of the action initiated. Events linked to incidents, problems or changes are formally closed with a link to the relevant record from the other process.

2.6TRIGGERS, INPUTS, OUTPUTS AND INTERFACES (SO 4.1.6)

Triggers include:

  • Exceptions to any level of CI performance defined in the design specifications, OLAs or standard operating procedures
  • Exceptions to an automated procedure or process
  • An exception within a business process monitored by event management
  • Completion of an automated task or job
  • A status change in a server or database CI
  • Access of an application or database by a user or automated procedure or job
  • A predefined threshold is reached; for example, by a device, database or application.

Inputs include:

  • Operational and service level requirements associated with events
  • Alarms, alerts and thresholds for recognizing events
  • Event correlation tables, rules, event codes and automated response solutions
  • Roles and responsibilities for recognizing and communicating events
  • Operational procedures for recognizing, logging, escalating and communicating events.

Outputs include:

  • Events communications and escalations
  • Event logs
  • Events that indicate an incident has occurred
  • Events that indicate the potential breach of an SLA or OLA objective
  • Events and alerts that indicate completion status of deployment, operational or other support activities
  • A service knowledge management system (SKMS) populated with event information and history.

Event management can interface with any process that requires monitoring and control. Examples of interfacing include:

  • Service level management (SLM) Ensures that any event with potential impact on SLAs is detected early and any failures are rectified as soon as possible
  • Information security management Allows potentially significant business security events to be detected and acted upon
  • Capacity and availability management Defines significant events, thresholds and responses for event management to monitor, detect and respond to when they occur. Also event management should produce reports on patterns of events and potential areas of improvement
  • Service asset and configuration management Uses events to determine the current status of CIs in the infrastructure
  • Knowledge management Processes events for inclusion in knowledge management systems. For example, patterns of performance can be correlated with business activity and used as input into future design and strategy decisions
  • Change management Interfaces with event management to identify conditions that may require a response or action
  • Incident and problem management Requires information on events that may require a response or action to resolve incidents and problems
  • Access management Events can be used to detect unauthorized access attempts and security breaches.

2.7INFORMATION MANAGEMENT (SO 4.1.7)

The following information is used in event management:

  • Simple network management protocol (SNMP) messages: a standard way of communicating technical information about the status of components of an IT infrastructure
  • Management information bases (MIBs) of IT devices: an MIB is the database on each device that contains information about that device, including, for example, its operating system and configuration of system parameters. The ability to interrogate MIBs and compare them to a norm is critical to being able to generate events
  • Vendor’s monitoring software
  • Correlation engines containing detailed rules to determine the significance and appropriate response to events
  • Event records for all types of event: the format and content depend on the tool being used, but typically include the device, component, type of failure, date and time, parameters in exception, unique identifier and value.

2.8CRITICAL SUCCESS FACTORS AND KEY PERFORMANCE INDICATORS (SO 4.1.8)

The efficiency and effectiveness of the process can be measured by identifying critical success factors (CSFs) for the process, each CSF being supported by key performance indicators (KPIs):

  • CSF Detecting all changes of state that have significance for the management of CIs and IT services:
    • KPI Number and ratio of events compared with the number of incidents
    • KPI Number and percentage of each type of event per platform or application versus total number of platforms and applications underpinning live IT services (looking to identify IT services that may be at risk for lack of capability to detect their events)
  • CSF Ensuring all events are communicated to the appropriate functions that need to be informed or take further control actions:
    • KPI Number and percentage of events that required human intervention and whether this was performed
    • KPI Number of incidents that occurred and percentage of these that were triggered without a corresponding event
  • CSF Providing the trigger, or entry point, for the execution of many service operation processes and operations management activities:
    • KPI Number and percentage of events that required human intervention and whether this was performed
  • CSF Provide the means to compare actual operating performance and behaviour against design standards and SLAs:
    • KPI Number and percentage of incidents that were resolved without impact to the business (indicates the overall effectiveness of the event management process and underpinning solutions)
    • KPI Number and percentage of events that resulted in incidents or changes
    • KPI Number and percentage of events caused by existing problems or known errors (this may result in a change to the priority of work on that problem or known error)
    • KPI Number and percentage of events indicating performance issues (for example, growth in the number of times an application exceeded its transaction thresholds over the past six months)
    • KPI Number and percentage of events indicating potential availability issues (e.g. failovers to alternative devices, or excessive workload swapping)
  • CSF Providing a basis for service assurance, reporting and service improvement:
    • KPI Number and percentage of repeated or duplicated events (this will help in the tuning of the correlation engine to eliminate unnecessary event generation and can also be used to assist in the design of better event generation functionality in new services)
    • KPI Number of events/alerts generated without actual degradation of service/functionality (false positives – indication of the accuracy of the instrumentation parameters, important for continual service improvement).

2.9CHALLENGES AND RISKS (SO 4.1.9)

Challenges include the following:

  • Difficulty of obtaining funding for the necessary tools and effort needed to install them and exploit their benefits
  • Setting the correct level of filtering
  • Difficulty and high cost of deploying the necessary monitoring agents across the entire IT infrastructure
  • The additional network traffic generated by automated monitoring activities might have negative impacts on the planned capacity levels of the network
  • Time needed to acquire the necessary skills, and high cost
  • Setting up the necessary processes in order to deploy the event management tools.

Risks include:

  • Failure to obtain adequate funding
  • Ensuring an incorrect level of filtering
  • Failure to maintain momentum in deploying monitoring agents across the IT infrastructure.

2.10ROLES AND RESPONSIBILITIES (SO 6.7.8)

2.10.1Event management process owner

Responsibilities include:

  • Carrying out the generic process owner role for the event management process (see section 1.5 for more detail)
  • Planning and managing support for event management tools and processes
  • Working with other process owners to ensure an integrated approach to the design and implementation of event, incident, request fulfilment, access and problem management.

2.10.2Event management process manager

Responsibilities include:

  • Carrying out the generic process manager role for the event management process (see section 1.5 for more detail)
  • Planning and managing support for event management tools and processes
  • Coordinating interfaces between event management and other service management processes.

2.10.3Other event management roles

The service desk is not typically involved in event management, unless an event requires some response that is within the scope of the service desk’s defined activity, such as notifying a user that a report is ready. Generally, this type of activity is performed by the operations bridge, unless the service desk and operations bridge have been combined.

However, for events identified as incidents the service desk is responsible for:

  • Investigation and resolution of events identified as incidents and then escalation to the appropriate service operation team
  • Communication of information about this type of incident to the relevant technical or application management team and, where appropriate, the user.

Technical and application management staff play several important roles:

  • During service design: participation in designing the warranty aspects of the service such as classifying events, updating correlation engines, or ensuring that any auto-responses are defined
  • During service transition: testing the service to ensure that events are properly generated and that the defined responses are appropriate
  • During service operation: performing event management for the systems under their control; dealing with incidents and problems related to events
  • If event management activities are delegated, ensuring that the staff are adequately trained and that they have access to the appropriate tools to enable them to perform these tasks.

IT operations management staff fulfil the following roles:

  • Event monitoring and first-line response may be delegated to IT operations management
  • Event monitoring is commonly delegated to the operations bridge where it exists. The operations bridge can coordinate or perform the responses required or provide first-level support.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset