The checklist for the capture of requirements is as follows:
- Ability to have mission critical systems with a percentage uptime of X, for example, 99.2% per month. Critical systems are defined as:
- Ad placement (storefront, paper production, auto-publish, bulk upload)
- Core search
- Transactions
- Ad management (renew)
- Notifications
- Lead generation
- Ability to have non-mission critical systems with a percentage uptime of X, for example, 98.3% per month.
- Ability to schedule downtime for maintenance on the site and/or components/systems
- Have the key stakeholders signed-off the availability requirements?
- Are the requirements driven by business goals and needs?
- Can requirements be met by the selected software and hardware platform?
- Are the strategies for business continuity and disaster recovery defined?
- Do stakeholders have realistic expectations pertaining to unplanned downtime?
- Do requirements consider different classes of service?
- Do requirements strike a balance between business goals and cost?
- Do requirements take into account future business needs such as moving to a longer day?
- What is the business/financial cost for a one minute outage?
- How much of an outage is acceptable?
- Are scheduled outages tolerable?
- Do availability requirements consider batch availability and online?
- Do requirements take into account variations, such as period end?
- What are industry and domain benchmarks for availability?
- What are end users' availability expectations based on surveys?
- What is the maximum timeframe before which the business is severely impacted? This helps in understanding RTO.
- What is the acceptable data loss? This helps in understanding RPO.
- Are there analyst recommendations for application availability?
- What is the business/financial cost for a one minute outage?
- What is the maximum time period before which the business is severely impacted? This helps in understanding RTO.
- Are there any analyst recommendations for application availability?
- What is the acceptable data loss? This helps us in understanding RPO.
- How much of an outage is tolerable?
- What are end users availability expectations based on surveys?
- Are scheduled outages acceptable?
The checklist for architecture definition are as follows:
- Is emphasis given to restoring data from incomplete or corrupt backups?
- Does the application respond elegantly to faults, including logging and reporting them appropriately?
- Does the solution establish the time needed to recover from failure?
- Does the backup technique provide transactional integrity of restored data?
- Does the target architecture meet the availability requirements?
- Does the backup support online backup, with acceptable degradation in performance?
- Is a standby site defined, if appropriate? Is the standby site identical to the primary site or with reduced performance?
- Have you established the mechanisms to switch from production to the standby site?
- Is the impact of the availability solution on performance assessed and acceptable?
- Is the architecture evaluated for bottlenecks, single points of failure, and other weaknesses?
- Does the fault tolerant model extend to all vulnerable entities in the landscape?