On this page

06 Performance and Availability Management

Performance Management

Performance Management is the process of managing the performance of IT services to ensure that they meet the agreed service levels. Performance Management includes the following activities:

Monitoring: Monitoring the performance of IT services to identify performance issues and bottlenecks.
Analysis: Analyzing the performance data to identify the root cause of performance issues.
Optimization: Optimizing the performance of IT services to improve performance (tuning) and reduce bottlenecks.
Reporting: Reporting on the performance of IT services to stakeholders and management.

Performance Tuning

Performance tuning is the process of optimizing the performance of IT services to improve performance and reduce bottlenecks.

This can include generic activities, like deactivating unnecessary services, or activities like defragmenting hard drives. It also includes operating system specific activities, like cleaning registry entries or enlarging page files on Windows or optimizing kernel parameters on Linux.

Adding more resources or clustering/parallelizing services can also be part of performance tuning, but comes with greater additional costs.

Capacity Planning

Capacity Planning is the process of planning the capacity of IT services to ensure that they meet the current and future demand. Capacity planning has the following benefits for an organization:

Cost Reduction: By planning the capacity of IT services, organizations can reduce costs by avoiding over-provisioning.
Improved Performance: By planning the capacity of IT services, organizations can improve performance by avoiding bottlenecks.
Shared Understanding: By planning the capacity of IT services, organizations can ensure that all stakeholders have a shared understanding of the capacity requirements.
Investment: By planning the capacity of IT services, organizations can ensure that investments in IT services are aligned with business requirements.

Availability Management

Availability Management is the process of managing the availability of IT services to ensure that they meet the agreed service levels.

Availability can be guaranteed in two ways:

proactive: redundancy, failover, clustering, monitoring, …
reactive: incident management, problem management, …

Application Performance Management

Application Performance Management (APM) is the process of monitoring and managing the performance of software applications (instead of at the service level).

Simlpe Network Management Protocol (SNMP)

SNMP is a protocol for monitoring and managing network devices. The SNMP is a widely supported protocol for monitoring and managing network devices. It supports GET and SET operations to read and write data and also TRAP operations to notify the management system of events.

The Management Information Base (MIB) is a database of objects that can be monitored and managed by SNMP. The MIB is organized in a tree structure, with each object identified by an Object Identifier (OID). MIB is split into a standardized part and a vendor-specific part.

High Availability

High Availability is the ability of a system to remain operational continuously for a long period of time. High Availability is achieved by implementing redundancy and failover mechanisms to ensure that the system remains operational even in the event of a failure.

Types of High Availability

Active/Passive: One system is active and the other is passive. The passive system takes over when the active system fails.
Active/Active: Both systems are active and share the load. If one system fails, the other system takes over the load.
N+1: Multiple systems are active, but one system is kept in reserve to take over if one of the active systems fails.

Failure Behavior

Fail-Safe: The system fails in a safe state.
Fail Passive: The system fails with no result if it fails.
Fail Operational: The system continues to operate despite failures (e.g. quorum-based systems).
Fail-Stop: The system stops when it fails.

Tolerance

Availability environment classes:

HRG Class	Description	Explanation
AEC-0	conventional	can be interrupted, data integrity not essential
AEC-1	high reliable	can be interrupted, data integrity must be guaranteed
AEC-2	high availability	cannot be interrupted, or only for a short time
AEC-3	fault resilient	must not be interrupted during defined timeslots
AEC-4	fault tolerant	uninterrupted operation must be guaranteed 24/7
AEC-5	disaster tolerant	must be operational even in case of a disaster

05 Configuration and Asset Management

00 Intro

06 Performance and Availability Management

Performance Management link

Performance Tuning link

Capacity Planning link

Availability Management link

Application Performance Management link

Simlpe Network Management Protocol (SNMP) link

High Availability link

Types of High Availability link

Failure Behavior link

Tolerance link