06 Performance and Availability Management
Performance Management
Performance Management is the process of managing the performance of IT services to ensure that they meet the agreed service levels. Performance Management includes the following activities:
- Monitoring: Monitoring the performance of IT services to identify performance issues and bottlenecks.
- Analysis: Analyzing the performance data to identify the root cause of performance issues.
- Optimization: Optimizing the performance of IT services to improve performance (tuning) and reduce bottlenecks.
- Reporting: Reporting on the performance of IT services to stakeholders and management.
Performance Tuning
Performance tuning is the process of optimizing the performance of IT services to improve performance and reduce bottlenecks.
This can include generic activities, like deactivating unnecessary services, or activities like defragmenting hard drives. It also includes operating system specific activities, like cleaning registry entries or enlarging page files on Windows or optimizing kernel parameters on Linux.
Adding more resources or clustering/parallelizing services can also be part of performance tuning, but comes with greater additional costs.
Capacity Planning
Capacity Planning is the process of planning the capacity of IT services to ensure that they meet the current and future demand. Capacity planning has the following benefits for an organization:
- Cost Reduction: By planning the capacity of IT services, organizations can reduce costs by avoiding over-provisioning.
- Improved Performance: By planning the capacity of IT services, organizations can improve performance by avoiding bottlenecks.
- Shared Understanding: By planning the capacity of IT services, organizations can ensure that all stakeholders have a shared understanding of the capacity requirements.
- Investment: By planning the capacity of IT services, organizations can ensure that investments in IT services are aligned with business requirements.
Availability Management
Availability Management is the process of managing the availability of IT services to ensure that they meet the agreed service levels.
Availability can be guaranteed in two ways:
- proactive: redundancy, failover, clustering, monitoring, …
- reactive: incident management, problem management, …
Application Performance Management
Application Performance Management (APM) is the process of monitoring and managing the performance of software applications (instead of at the service level).
Simlpe Network Management Protocol (SNMP)
SNMP is a protocol for monitoring and managing network devices. The SNMP is a widely supported protocol for monitoring and managing network devices. It supports GET and SET operations to read and write data and also TRAP operations to notify the management system of events.
The Management Information Base (MIB) is a database of objects that can be monitored and managed by SNMP. The MIB is organized in a tree structure, with each object identified by an Object Identifier (OID). MIB is split into a standardized part and a vendor-specific part.
High Availability
High Availability is the ability of a system to remain operational continuously for a long period of time. High Availability is achieved by implementing redundancy and failover mechanisms to ensure that the system remains operational even in the event of a failure.
Types of High Availability
- Active/Passive: One system is active and the other is passive. The passive system takes over when the active system fails.
- Active/Active: Both systems are active and share the load. If one system fails, the other system takes over the load.
- N+1: Multiple systems are active, but one system is kept in reserve to take over if one of the active systems fails.
Failure Behavior
- Fail-Safe: The system fails in a safe state.
- Fail Passive: The system fails with no result if it fails.
- Fail Operational: The system continues to operate despite failures (e.g. quorum-based systems).
- Fail-Stop: The system stops when it fails.
Tolerance
Availability environment classes:
| HRG Class | Description | Explanation |
|---|---|---|
| AEC-0 | conventional | can be interrupted, data integrity not essential |
| AEC-1 | high reliable | can be interrupted, data integrity must be guaranteed |
| AEC-2 | high availability | cannot be interrupted, or only for a short time |
| AEC-3 | fault resilient | must not be interrupted during defined timeslots |
| AEC-4 | fault tolerant | uninterrupted operation must be guaranteed 24/7 |
| AEC-5 | disaster tolerant | must be operational even in case of a disaster |