Linux Mastery 100 #81

Intermediate

Tutorial

10 min read

November 23, 2025

luc

Linux Clustering with Pacemaker: High Availability Setup Guide

Reading Progress 0%

Quick Actions

All Guides

Knowledge Overview

Time Investment

10 minutes reading time
20-30 minutes hands-on practice

Guide Content

Linux clustering with Pacemaker provides enterprise-grade high availability solutions by managing cluster resources and automated failover mechanisms. Moreover, Pacemaker integrates seamlessly with Corosync to deliver robust cluster communication and ensures service continuity during node failures. Therefore, implementing Linux clustering with Pacemaker becomes essential for mission-critical applications requiring 99.9% uptime.

What is Linux Clustering with Pacemaker?
How Does Pacemaker Cluster Architecture Work?
Why Choose Pacemaker for High Availability Clustering?
How to Install Pacemaker Cluster Software?
How to Configure Corosync Communication Layer?
How to Setup Basic Pacemaker Cluster?
How to Configure Cluster Resources?
How to Implement STONITH Fencing?
How to Test Failover Scenarios?
How to Monitor Cluster Health?
Troubleshooting Common Clustering Issues
Advanced Pacemaker Configuration
Frequently Asked Questions

What is Linux Clustering with Pacemaker?

Linux clustering with Pacemaker represents a comprehensive high availability solution that manages cluster resources across multiple nodes. Furthermore, Pacemaker serves as the cluster resource manager (CRM) responsible for starting, stopping, and monitoring services within a cluster environment. Additionally, this clustering technology ensures automatic failover when primary nodes experience failures or maintenance windows.

Core Components of Pacemaker Clustering

Pacemaker clustering consists of several essential components working together:

1. Pacemaker Cluster Resource Manager

Bash

# Check Pacemaker version and status
pcs --version
systemctl status pacemaker

2. Corosync Communication Engine

Bash

# Verify Corosync cluster communication
systemctl status corosync
corosync-cmapctl | grep members

3. Resource Agents

Bash

# List available resource agents
pcs resource agents
pcs resource agents ocf:heartbeat

4. STONITH/Fencing Mechanisms

Bash

# Check available fence agents
pcs stonith list
fence_ipmilan --help

Benefits of Linux Clustering with Pacemaker

Implementing Linux clustering with Pacemaker delivers significant operational advantages:

Automated Failover: Seamlessly transfers services between nodes without manual intervention
Resource Management: Intelligently manages service dependencies and startup sequences
Split-Brain Protection: Prevents data corruption through advanced fencing mechanisms
Scalable Architecture: Supports clusters from 2 to 32+ nodes depending on requirements
Enterprise Integration: Compatible with major Linux distributions and enterprise applications

How Does Pacemaker Cluster Architecture Work?

Understanding Pacemaker cluster architecture enables effective implementation and troubleshooting. Furthermore, the architecture follows a layered approach where each component serves specific functions in maintaining high availability.

Cluster Communication Stack

Bash

# Display cluster stack information
pcs status
pcs cluster pcsd-status

The Pacemaker communication stack consists of:

1: Hardware and Network Infrastructure

Dedicated cluster networks for heartbeat communication
Shared storage systems (SAN, NFS, DRBD)
Power management interfaces (IPMI, iLO, DRAC)

2: Corosync Messaging Layer

Bash

# Configure Corosync authentication
corosync-keygen
systemctl restart corosync

3: Pacemaker Resource Management

Bash

# View cluster resource management
pcs resource show
pcs constraint show

Resource Management Workflow

Pacemaker manages cluster resources through a sophisticated workflow:

Resource Discovery: Identifies available resources and their current states
Policy Engine: Applies configuration rules and constraints
Transition Engine: Coordinates resource state changes
Local Resource Manager: Executes resource operations on nodes

Bash

# Monitor resource management workflow
pcs status --full
crm_mon --one-shot

Quorum and Split-Brain Prevention

Cluster quorum mechanisms prevent split-brain scenarios:

Bash

# Configure quorum settings
pcs quorum status
pcs quorum expected-votes 3
pcs property set no-quorum-policy=ignore

Why Choose Pacemaker for High Availability Clustering?

Pacemaker high availability clustering offers compelling advantages over alternative solutions. Moreover, enterprise environments benefit from Pacemaker's mature feature set and extensive documentation. Therefore, understanding these advantages helps justify implementation decisions.

Technical Advantages

Feature	Pacemaker	Alternative Solutions
Resource Agents	100+ supported agents	Limited agent support
Fencing Methods	Multiple STONITH types	Basic fencing only
Constraint Types	Location, order, colocation	Limited constraint options
Node Limits	32+ nodes supported	Often limited to 2-4 nodes
Documentation	Extensive official docs	Varying documentation quality

Enterprise Integration Features

1. Red Hat Enterprise Linux Integration

Bash

# Install on RHEL/CentOS
sudo yum install pcs pacemaker corosync fence-agents-all
sudo systemctl enable pcsd pacemaker corosync

2. SUSE Linux Enterprise Server Support

Bash

# Install on SLES
sudo zypper install ha-cluster-bootstrap crmsh
sudo ha-cluster-init

3. Ubuntu Server Compatibility

Bash

# Install on Ubuntu
sudo apt update
sudo apt install pacemaker corosync crmsh fence-agents

Performance Characteristics

Pacemaker delivers excellent performance metrics:

Failover Time: Typically 30-60 seconds depending on configuration
Resource Overhead: Minimal CPU and memory consumption
Network Traffic: Efficient heartbeat protocol with configurable intervals
Scalability: Linear performance scaling with additional nodes

Bash

# Monitor cluster performance
pcs status
iostat -x 1
iftop -i eth0

How to Install Pacemaker Cluster Software?

Installing Pacemaker cluster software requires careful preparation and systematic execution. Furthermore, proper installation creates the foundation for reliable cluster operations. Additionally, installation procedures vary slightly between Linux distributions but follow consistent principles.

Prerequisites and System Requirements

Before installing Pacemaker, ensure systems meet minimum requirements:

Hardware Requirements:

Minimum 2 nodes with identical hardware configurations
At least 2GB RAM per node (4GB+ recommended)
Dedicated network interfaces for cluster communication
Shared storage or data replication mechanisms

Network Configuration:

Bash

# Configure static IP addresses
sudo nmcli con mod eth0 ipv4.addresses 192.168.1.10/24
sudo nmcli con mod eth0 ipv4.gateway 192.168.1.1
sudo nmcli con mod eth0 ipv4.dns 8.8.8.8
sudo nmcli con up eth0

# Test connectivity between nodes
ping -c 3 192.168.1.11
telnet 192.168.1.11 22

System Preparation:

Bash

# Synchronize system clocks
sudo systemctl enable chronyd
sudo systemctl start chronyd
chrony sources -v

# Configure hostname resolution
sudo hostnamectl set-hostname node1.cluster.local
echo "192.168.1.10 node1.cluster.local node1" >> /etc/hosts
echo "192.168.1.11 node2.cluster.local node2" >> /etc/hosts

Red Hat Enterprise Linux Installation

1: Enable High Availability Repository

Bash

# Register system and enable HA add-on
sudo subscription-manager repos --enable=rhel-8-for-x86_64-highavailability-rpms

# Install cluster packages
sudo dnf install pcs pacemaker corosync fence-agents-all

2: Configure Firewall Rules

Bash

# Allow cluster communication ports
sudo firewall-cmd --permanent --add-service=high-availability
sudo firewall-cmd --permanent --add-port=2224/tcp
sudo firewall-cmd --permanent --add-port=3121/tcp
sudo firewall-cmd --permanent --add-port=5405/udp
sudo firewall-cmd --reload

3: Start and Enable Services

Bash

# Enable cluster services
sudo systemctl enable pcsd
sudo systemctl start pcsd

# Set cluster user password
echo 'clusterpw' | sudo passwd --stdin hacluster

Ubuntu Server Installation

Step 1: Install Packages

Bash

# Update package repository
sudo apt update

# Install clustering software
sudo apt install pacemaker corosync crmsh fence-agents resource-agents

Step 2: Configure Authentication

Bash

# Set hacluster user password
sudo passwd hacluster

# Configure SSH key authentication
sudo -u hacluster ssh-keygen -t rsa -N ""
sudo -u hacluster ssh-copy-id hacluster@node2

SUSE Linux Enterprise Server Installation

Step 1: Install HA Pattern

Bash

# Install HA cluster pattern
sudo zypper install -t pattern ha_sles
sudo zypper install ha-cluster-bootstrap

Step 2: Initialize Cluster

Bash

# Run cluster bootstrap wizard
sudo ha-cluster-init

# Configure cluster authentication
sudo ha-cluster-join -c node1

Installation Verification

Verify Installation Success:

Bash

# Check service status
sudo systemctl status pacemaker corosync pcsd

# Verify cluster software versions
pacemaker --version
corosync -v
pcs --version

# Test cluster communication
pcs cluster auth node1 node2 -u hacluster -p clusterpw
pcs cluster status

How to Configure Corosync Communication Layer?

Configuring Corosync communication layer establishes reliable cluster messaging infrastructure. Moreover, proper Corosync configuration ensures efficient heartbeat communication and prevents split-brain conditions. Therefore, understanding Corosync settings becomes crucial for cluster stability.

Corosync Configuration File Structure

Primary Configuration File: /etc/corosync/corosync.conf

Bash

# Generate initial configuration
sudo pcs cluster setup mycluster node1 node2 --start --enable

# View generated configuration
sudo cat /etc/corosync/corosync.conf

Basic Corosync Configuration

Bash

# Create cluster configuration
totem {
    version: 2
    cluster_name: mycluster
    clear_node_high_bit: yes
    crypto_cipher: aes256
    crypto_hash: sha256
    
    interface {
        ringnumber: 0
        bindnetaddr: 192.168.1.0
        broadcast: yes
        mcastport: 5405
    }
}

logging {
    fileline: off
    to_logfile: yes
    to_syslog: yes
    logfile: /var/log/cluster/corosync.log
    debug: off
    timestamp: on
    logger_subsys {
        subsys: QUORUM
        debug: off
    }
}

nodelist {
    node {
        ring0_addr: 192.168.1.10
        name: node1
        nodeid: 1
    }
    node {
        ring0_addr: 192.168.1.11
        name: node2
        nodeid: 2
    }
}

quorum {
    provider: corosync_votequorum
    expected_votes: 2
    two_node: 1
}

Advanced Communication Settings

Redundant Ring Configuration:

Bash

# Configure dual network rings for reliability
interface {
    ringnumber: 0
    bindnetaddr: 192.168.1.0
    broadcast: yes
    mcastport: 5405
}

interface {
    ringnumber: 1
    bindnetaddr: 10.0.0.0
    broadcast: yes
    mcastport: 5406
}

Heartbeat Timing Optimization:

Bash

# Configure heartbeat intervals
totem {
    token: 3000
    token_retransmits_before_loss_const: 10
    join: 100
    consensus: 3000
    max_messages: 20
    send_join: 45
}

Corosync Authentication

Generate Authentication Key:

Bash

# Create cluster authentication key
sudo corosync-keygen
sudo scp /etc/corosync/authkey node2:/etc/corosync/
sudo chmod 400 /etc/corosync/authkey

Verify Authentication:

Bash

# Test authentication across nodes
sudo corosync-cmapctl | grep members
sudo corosync-cmapctl | grep runtime

Communication Testing

Monitor Cluster Communication:

Bash

# Real-time communication monitoring
sudo corosync-cmapctl -b runtime.totem.pg.mrp.srp.members
sudo journalctl -f -u corosync

# Network traffic analysis
sudo tcpdump -i eth0 port 5405
sudo netstat -tulpn | grep 5405

Performance Tuning:

Bash

# Optimize network buffers
echo 'net.core.rmem_max = 134217728' >> /etc/sysctl.conf
echo 'net.core.wmem_max = 134217728' >> /etc/sysctl.conf
sudo sysctl -p

# Configure multicast settings
echo 1 > /proc/sys/net/ipv4/icmp_echo_ignore_broadcasts

How to Setup Basic Pacemaker Cluster?

Setting up a basic Pacemaker cluster involves systematic configuration steps that establish cluster identity and operational parameters. Furthermore, proper cluster setup provides the foundation for resource management and failover capabilities. Additionally, initial configuration affects long-term cluster stability and performance.

Cluster Creation Process

1: Cluster Authentication

Bash

# Authenticate cluster nodes
sudo pcs host auth node1 node2 -u hacluster -p clusterpw

# Verify authentication status
sudo pcs host auth

2: Cluster Initialization

Bash

# Create and start cluster
sudo pcs cluster setup mycluster node1 node2 --start --enable

# Alternative method with advanced options
sudo pcs cluster setup mycluster \
    node1 addr=192.168.1.10 \
    node2 addr=192.168.1.11 \
    transport knet \
    --start --enable

3: Cluster Verification

Bash

# Check cluster status
sudo pcs cluster status
sudo pcs status

# Verify node membership
sudo pcs node attribute
sudo corosync-cmapctl | grep members

Initial Cluster Configuration

Configure Global Cluster Properties:

Bash

# Set basic cluster properties
sudo pcs property set stonith-enabled=false  # Temporarily disable
sudo pcs property set no-quorum-policy=ignore  # For 2-node clusters
sudo pcs property set default-resource-stickiness=100

# Configure failure handling
sudo pcs property set migration-threshold=3
sudo pcs property set failure-timeout=60s
sudo pcs property set cluster-recheck-interval=2min

View Current Configuration:

Bash

# Display cluster properties
sudo pcs property show
sudo pcs property list

# Export cluster configuration
sudo pcs config
sudo crm configure show

Node Management

Add Additional Nodes:

Bash

# Add new node to existing cluster
sudo pcs cluster node add node3 addr=192.168.1.12

# Update cluster membership
sudo pcs cluster reload corosync
sudo pcs cluster start node3

Remove Nodes:

Bash

# Gracefully remove node from cluster
sudo pcs cluster node remove node3
sudo pcs cluster destroy node3

Node Maintenance Mode:

Bash

# Put node in maintenance mode
sudo pcs node maintenance node2

# Remove from maintenance mode
sudo pcs node maintenance node2 --off

# Standby mode for planned maintenance
sudo pcs node standby node2
sudo pcs node unstandby node2

Cluster Communication Testing

Verify Cluster Messaging:

Bash

# Test cluster communication
sudo pcs cluster stop --all
sudo pcs cluster start --all

# Monitor cluster logs
sudo journalctl -f -u pacemaker -u corosync

# Check cluster timing
sudo crm_mon --one-shot --timing-details

Network Connectivity Tests:

Bash

# Test multicast communication
omping -c 10 -T 30 192.168.1.10 192.168.1.11

# Monitor network latency
ping -f 192.168.1.11
hping3 -c 100 -i u100 192.168.1.11

Basic Troubleshooting

Common Startup Issues:

Bash

# Check service dependencies
sudo systemctl status pacemaker corosync pcsd

# Verify authentication
sudo pcs host auth node1 node2 --force

# Reset cluster if necessary
sudo pcs cluster destroy --all

Configuration Validation:

Bash

# Validate cluster configuration
sudo pcs cluster verify
sudo crm_verify -LV

# Test configuration changes
sudo pcs cluster cib-push scope=configuration --config

How to Configure Cluster Resources?

Configuring cluster resources enables automated management of services, applications, and dependencies. Moreover, proper resource configuration ensures reliable failover and optimal resource utilization across cluster nodes. Therefore, understanding resource types and constraints becomes essential for effective cluster management.

Resource Types Overview

Pacemaker supports multiple resource types through standardized resource agents:

1. OCF (Open Cluster Framework) Resources

Bash

# List OCF resource agents
sudo pcs resource agents ocf:heartbeat
sudo pcs resource agents ocf:pacemaker

# Common OCF resources
sudo pcs resource agents ocf:heartbeat:IPaddr2
sudo pcs resource agents ocf:heartbeat:apache
sudo pcs resource agents ocf:heartbeat:mysql

2. LSB (Linux Standard Base) Resources

Bash

# List system service resources
sudo pcs resource agents lsb
sudo pcs resource agents systemd

# Service-based resources
sudo pcs resource agents systemd:httpd
sudo pcs resource agents systemd:postgresql

3. STONITH (Fencing) Resources

Bash

# List fencing agents
sudo pcs resource agents stonith
sudo pcs stonith list

Creating Basic Resources

Virtual IP Resource:

Bash

# Create floating IP resource
sudo pcs resource create VirtualIP IPaddr2 \
    ip=192.168.1.100 \
    cidr_netmask=24 \
    nic=eth0 \
    op monitor interval=30s

# Verify resource creation
sudo pcs resource show VirtualIP
sudo pcs status

Apache Web Server Resource:

Bash

# Create Apache HTTP resource
sudo pcs resource create WebServer apache \
    configfile="/etc/httpd/conf/httpd.conf" \
    statusurl="http://localhost/server-status" \
    op monitor interval=20s timeout=40s \
    op start timeout=60s \
    op stop timeout=60s

# Check resource status
sudo pcs resource show WebServer

Database Resource:

Bash

# Create MySQL database resource
sudo pcs resource create Database mysql \
    binary="/usr/bin/mysqld_safe" \
    config="/etc/mysql/my.cnf" \
    datadir="/var/lib/mysql" \
    user="mysql" \
    op monitor interval=30s timeout=30s

# Monitor database resource
sudo pcs resource show Database

Resource Groups

Create Resource Groups:

Bash

# Group related resources together
sudo pcs resource group add WebServerGroup VirtualIP WebServer

# Add resource to existing group
sudo pcs resource group add WebServerGroup Database

# View group configuration
sudo pcs resource show WebServerGroup

Resource Group Benefits:

Simplified management of related resources
Automatic start/stop ordering within group
Colocation constraints applied automatically
Reduced configuration complexity

Resource Constraints

Colocation Constraints:

Bash

# Ensure resources run on same node
sudo pcs constraint colocation add WebServer with VirtualIP INFINITY

# Prevent resources from running together
sudo pcs constraint colocation add Database with WebServer -INFINITY

# Show colocation constraints
sudo pcs constraint colocation show

Order Constraints:

Bash

# Define resource startup sequence
sudo pcs constraint order VirtualIP then WebServer

# Mandatory ordering with timing
sudo pcs constraint order start VirtualIP then start WebServer \
    kind=Mandatory symmetrical=true

# View order constraints
sudo pcs constraint order show

Location Constraints:

Bash

# Prefer specific nodes for resources
sudo pcs constraint location WebServer prefers node1=50

# Avoid certain nodes
sudo pcs constraint location Database avoids node2=INFINITY

# Show location constraints
sudo pcs constraint location show

Advanced Resource Configuration

Clone Resources:

Bash

# Create clone resource for active/active services
sudo pcs resource create SharedStorage Filesystem \
    device="/dev/sdb1" \
    directory="/shared" \
    fstype="ext4" \
    --clone

# Configure clone parameters
sudo pcs resource clone SharedStorage \
    clone-max=2 \
    clone-node-max=1 \
    notify=true

Master/Slave Resources:

Bash

# Create master/slave resource
sudo pcs resource create DBReplication mysql \
    binary="/usr/bin/mysqld_safe" \
    config="/etc/mysql/my.cnf" \
    replication_user="repl" \
    replication_passwd="replpass" \
    master

# Configure master/slave parameters
sudo pcs resource master DBMaster DBReplication \
    master-max=1 \
    master-node-max=1 \
    clone-max=2 \
    clone-node-max=1 \
    notify=true

Resource Monitoring and Operations

Operations:

Bash

# Manual resource operations
sudo pcs resource disable WebServer
sudo pcs resource enable WebServer
sudo pcs resource restart WebServer

# Move resource to specific node
sudo pcs resource move WebServer node2
sudo pcs resource clear WebServer

# Cleanup failed resources
sudo pcs resource cleanup WebServer

Monitoring:

Bash

# Monitor resource status
sudo pcs resource show --full
sudo crm_mon --one-shot --inactive

# Resource history and failures
sudo pcs resource failcount show WebServer
sudo pcs resource debug-start WebServer

How to Implement STONITH Fencing?

Implementing STONITH fencing prevents data corruption and split-brain scenarios in cluster environments. Furthermore, STONITH (Shoot The Other Node In The Head) ensures failed nodes cannot interfere with cluster operations. Additionally, proper fencing configuration remains critical for production cluster deployments.

STONITH Concepts and Requirements

Understanding STONITH Mechanisms:

Power-based Fencing: Controls node power through IPMI, iLO, or PDUs
Network-based Fencing: Isolates nodes through network switches
Storage-based Fencing: Blocks storage access from failed nodes
Hypervisor Fencing: Controls virtual machines through hypervisor APIs

STONITH Requirements:

Bash

# Check available fencing agents
sudo pcs stonith list | grep -i ipmi
sudo pcs stonith list | grep -i vmware
sudo pcs stonith list | grep -i libvirt

# Verify hardware fencing capabilities
ipmitool -I lanplus -H 192.168.1.20 -U admin -P password power status

IPMI Fencing Configuration

Configure IPMI Fencing:

Bash

# Create IPMI fence device for node1
sudo pcs stonith create fence-node1 fence_ipmilan \
    pcmk_host_list="node1" \
    ipaddr="192.168.1.20" \
    username="admin" \
    password="fencepass" \
    lanplus="true" \
    op monitor interval="60s"

# Create IPMI fence device for node2
sudo pcs stonith create fence-node2 fence_ipmilan \
    pcmk_host_list="node2" \
    ipaddr="192.168.1.21" \
    username="admin" \
    password="fencepass" \
    lanplus="true" \
    op monitor interval="60s"

STONITH Location Constraints:

Bash

# Prevent nodes from fencing themselves
sudo pcs constraint location fence-node1 avoids node1=INFINITY
sudo pcs constraint location fence-node2 avoids node2=INFINITY

# Verify STONITH configuration
sudo pcs stonith show
sudo pcs constraint show --full

VMware vSphere Fencing

Configure VMware Fencing:

Bash

# Create VMware fence device
sudo pcs stonith create fence-vmware fence_vmware_soap \
    ipaddr="vcenter.example.com" \
    username="cluster@vsphere.local" \
    password="vmwarepass" \
    ssl="1" \
    op monitor interval="60s"

# Map VMs to cluster nodes
sudo pcs stonith create fence-vm1 fence_vmware_soap \
    ipaddr="vcenter.example.com" \
    username="cluster@vsphere.local" \
    password="vmwarepass" \
    plug="cluster-node1-vm" \
    pcmk_host_list="node1" \
    ssl="1"

Shared Storage Fencing

Configure SAN-based Fencing:

Bash

# Create SAN fence device
sudo pcs stonith create fence-san fence_scsi \
    devices="/dev/sdb,/dev/sdc" \
    pcmk_host_list="node1,node2" \
    op monitor interval="60s"

# Configure reservation keys
echo "node1:0x123456789" > /etc/fence_scsi.conf
echo "node2:0x987654321" >> /etc/fence_scsi.conf

Network Switch Fencing

Configure Network Fencing:

Bash

# Create network switch fence device
sudo pcs stonith create fence-switch fence_cisco_ucs \
    ipaddr="switch.example.com" \
    username="admin" \
    password="switchpass" \
    plug="1/1/1,1/1/2" \
    pcmk_host_list="node1,node2" \
    ssl="1"

STONITH Testing and Validation

Enable STONITH:

Bash

# Enable STONITH globally
sudo pcs property set stonith-enabled=true

# Configure STONITH timeout
sudo pcs property set stonith-timeout=120s
sudo pcs property set stonith-action=reboot

Test STONITH Operations:

Bash

# Test fence devices
sudo stonith_admin --reboot=node2 --verbose
sudo fence_ipmilan -a 192.168.1.21 -u admin -p fencepass -l admin -o reboot

# Verify fencing history
sudo stonith_admin --history=node2
sudo pcs stonith history show

STONITH Monitoring:

Bash

# Monitor STONITH resources
sudo pcs stonith show --full
sudo crm_mon --watch-fencing

# Check STONITH logs
sudo journalctl -u pacemaker | grep -i stonith
sudo grep -i stonith /var/log/cluster/corosync.log

Advanced STONITH Configuration

Multi-level Fencing:

Bash

# Configure cascaded fencing levels
sudo pcs stonith level add 1 node1 fence-ipmi-node1
sudo pcs stonith level add 2 node1 fence-pdu-node1
sudo pcs stonith level add 3 node1 fence-switch-node1

# Show fencing levels
sudo pcs stonith level show

STONITH Resource Groups:

Bash

# Group STONITH resources
sudo pcs resource group add fencing-group fence-node1 fence-node2

# Clone STONITH for redundancy
sudo pcs stonith create fence-shared fence_vmware_soap \
    --clone \
    clone-max=2 \
    clone-node-max=1

How to Test Failover Scenarios?

Testing failover scenarios validates cluster reliability and identifies potential issues before production deployment. Moreover, comprehensive testing ensures proper resource migration and service continuity during various failure conditions. Therefore, systematic failover testing becomes essential for cluster validation.

Planned Failover Testing

Manual Resource Migration:

Bash

# Move resource between nodes
sudo pcs resource move WebServer node2

# Verify resource migration
sudo pcs status
sudo pcs resource show WebServer

# Clear movement constraint
sudo pcs resource clear WebServer

Node Standby Testing:

Bash

# Put node in standby mode
sudo pcs node standby node1

# Monitor resource migration
watch 'sudo pcs status'

# Return node to service
sudo pcs node unstandby node1

Service Failure Simulation:

Bash

# Stop Apache service manually
sudo systemctl stop httpd

# Observe cluster response
sudo pcs status
sudo journalctl -u pacemaker -f

# Verify automatic restart
sudo pcs resource cleanup WebServer

Network Failure Testing

Network Interface Failure:

Bash

# Simulate network interface failure
sudo ip link set eth0 down

# Monitor cluster behavior
sudo pcs status
sudo corosync-cmapctl | grep members

# Restore network interface
sudo ip link set eth0 up

Split-Brain Scenario Testing:

Bash

# Block cluster communication (node1)
sudo iptables -A INPUT -s 192.168.1.11 -j DROP
sudo iptables -A OUTPUT -d 192.168.1.11 -j DROP

# Monitor quorum behavior
sudo pcs status
sudo corosync-quorumtool

# Restore communication
sudo iptables -F

Hardware Failure Simulation

Disk Failure Testing:

Bash

# Simulate disk failure
echo offline > /sys/block/sdb/device/state

# Monitor shared storage resources
sudo pcs resource show SharedStorage
sudo df -h /shared

# Restore disk
echo running > /sys/block/sdb/device/state

Memory Pressure Testing:

Bash

# Create memory pressure
stress --vm 2 --vm-bytes 1G --timeout 300s

# Monitor cluster response
sudo pcs status
sudo free -m
sudo top

Application-Level Testing

Database Failover Testing:

Bash

# Kill MySQL process
sudo pkill -9 mysqld

# Verify cluster response
sudo pcs resource show Database
sudo pcs status

# Check database recovery
mysql -u root -p -e "SHOW STATUS"

Web Server Testing:

Bash

# Test web service availability
while true; do
    curl -s http://192.168.1.100 || echo "Failed at $(date)"
    sleep 1
done

# Monitor during failover
sudo pcs resource move WebServer node2

STONITH Testing

Controlled STONITH Testing:

Bash

# Test fence device manually
sudo fence_ipmilan -a 192.168.1.21 -u admin -p fencepass -l admin -o status

# Trigger STONITH through cluster
sudo stonith_admin --reboot=node2 --verbose

# Monitor STONITH logs
sudo journalctl -u pacemaker | grep stonith

Node Crash Simulation:

Bash

# Simulate kernel panic (WARNING: Will reboot node!)
echo c > /proc/sysrq-trigger

# Monitor cluster response from surviving node
sudo pcs status
sudo crm_mon --one-shot

Performance Testing

Load Testing During Failover:

Bash

# Generate load on web server
ab -n 10000 -c 50 http://192.168.1.100/

# Trigger failover during load test
sudo pcs resource move WebServer node2

# Measure downtime
time curl http://192.168.1.100

Resource Utilization Testing:

Bash

# Monitor resource usage
iostat -x 1
sar -u 1
iftop -i eth0

# Test with high load
stress --cpu 4 --io 2 --vm 2 --vm-bytes 512M --timeout 300s

Automated Testing Scripts

Failover Test Script:

Bash

#!/bin/bash
# Comprehensive failover testing script

# Function to test resource migration
test_resource_failover() {
    local resource=$1
    local target_node=$2
    
    echo "Testing $resource failover to $target_node"
    sudo pcs resource move $resource $target_node
    
    # Wait for migration
    sleep 30
    
    # Verify resource status
    if sudo pcs status | grep -q "$resource.*$target_node.*Started"; then
        echo "PASS: $resource successfully failed over to $target_node"
    else
        echo "FAIL: $resource failover to $target_node failed"
    fi
    
    # Clear constraints
    sudo pcs resource clear $resource
}

# Test all resources
test_resource_failover "WebServer" "node2"
test_resource_failover "VirtualIP" "node2"
test_resource_failover "Database" "node2"

Continuous Monitoring Script:

Bash

#!/bin/bash
# Monitor cluster health during testing

while true; do
    timestamp=$(date '+%Y-%m-%d %H:%M:%S')
    
    # Check cluster status
    if sudo pcs status >/dev/null 2>&1; then
        status="HEALTHY"
    else
        status="UNHEALTHY"
    fi
    
    # Log status
    echo "$timestamp - Cluster Status: $status" >> cluster_test.log
    
    # Check individual resources
    sudo pcs resource show | grep -E "(Started|Stopped|Failed)" >> cluster_test.log
    
    sleep 10
done

How to Monitor Cluster Health?

Monitoring cluster health ensures early detection of issues and proactive maintenance capabilities. Furthermore, comprehensive monitoring encompasses resource status, node health, and performance metrics. Additionally, effective monitoring strategies prevent service disruptions and optimize cluster performance.

Real-time Monitoring Tools

Native Pacemaker Monitoring:

Bash

# Real-time cluster monitoring
sudo crm_mon -1rfA
sudo crm_mon --watch-fencing --one-shot

# Continuous monitoring with refresh
watch -n 5 'sudo pcs status'
sudo crm_mon -r -f

Pacemaker Status Commands:

Bash

# Comprehensive status view
sudo pcs status --full
sudo pcs resource show --full
sudo pcs constraint show --full

# Node-specific information
sudo pcs node attribute
sudo pcs node utilization
sudo pcs cluster pcsd-status

Log Analysis and Monitoring

Cluster Log Monitoring:

Bash

# Real-time log monitoring
sudo journalctl -f -u pacemaker -u corosync
sudo tail -f /var/log/cluster/corosync.log

# Log filtering for specific events
sudo journalctl -u pacemaker --since "1 hour ago" | grep -i error
sudo journalctl -u corosync --since today | grep -i warn

Log Analysis Scripts:

Bash

#!/bin/bash
# Cluster log analyzer

# Function to analyze recent errors
analyze_cluster_logs() {
    echo "=== Cluster Error Analysis ==="
    
    # Pacemaker errors
    echo "Pacemaker errors in last hour:"
    journalctl -u pacemaker --since "1 hour ago" | grep -i error | tail -10
    
    # Corosync warnings
    echo "Corosync warnings in last hour:"
    journalctl -u corosync --since "1 hour ago" | grep -i warn | tail -10
    
    # Resource failures
    echo "Resource failures:"
    crm_mon --one-shot --failures | grep -A 5 "Failed Resource Actions"
}

analyze_cluster_logs

Performance Monitoring

Resource Performance Metrics:

Bash

# Monitor resource CPU usage
ps aux | grep -E "(httpd|mysqld|corosync|pacemaker)"

# Memory usage monitoring
sudo pmap -d $(pgrep pacemaker)
sudo pmap -d $(pgrep corosync)

# Network traffic monitoring
sudo iftop -i eth0 -f "port 5405"
sudo netstat -i

System Resource Monitoring:

Bash

# Comprehensive system monitoring
iostat -x 1 5
vmstat 1 5
sar -u -r -n DEV 1 5

# Disk I/O monitoring for shared storage
sudo iotop -a -o -d 1
sudo blkiotop

Custom Monitoring Scripts

Cluster Health Check Script:

Bash

#!/bin/bash
# Comprehensive cluster health checker

LOGFILE="/var/log/cluster_health.log"
ALERT_EMAIL="admin@example.com"

# Function to check cluster status
check_cluster_status() {
    if ! sudo pcs status >/dev/null 2>&1; then
        echo "$(date): CRITICAL - Cluster status check failed" >> $LOGFILE
        return 1
    fi
    
    # Check for failed resources
    failed_resources=$(sudo pcs status | grep -c "FAILED")
    if [ "$failed_resources" -gt 0 ]; then
        echo "$(date): WARNING - $failed_resources failed resources detected" >> $LOGFILE
        return 1
    fi
    
    return 0
}

# Function to check node connectivity
check_node_connectivity() {
    local nodes=("node1" "node2")
    
    for node in "${nodes[@]}"; do
        if ! ping -c 1 "$node" >/dev/null 2>&1; then
            echo "$(date): CRITICAL - Node $node unreachable" >> $LOGFILE
            return 1
        fi
    done
    
    return 0
}

# Function to check quorum
check_quorum() {
    quorum_status=$(sudo corosync-quorumtool -s | grep "Quorate" | awk '{print $2}')
    
    if [ "$quorum_status" != "Yes" ]; then
        echo "$(date): CRITICAL - Cluster lost quorum" >> $LOGFILE
        return 1
    fi
    
    return 0
}

# Main health check
main() {
    if ! check_cluster_status || ! check_node_connectivity || ! check_quorum; then
        # Send alert email
        echo "Cluster health check failed. See $LOGFILE for details." | \
        mail -s "Cluster Alert" $ALERT_EMAIL
        exit 1
    fi
    
    echo "$(date): INFO - Cluster health check passed" >> $LOGFILE
}

main

SNMP Monitoring Integration

Configure SNMP for Cluster Monitoring:

Bash

# Install SNMP packages
sudo yum install net-snmp net-snmp-utils

# Configure SNMP daemon
echo "rocommunity public" >> /etc/snmp/snmpd.conf
echo "syslocation Datacenter" >> /etc/snmp/snmpd.conf
echo "syscontact admin@example.com" >> /etc/snmp/snmpd.conf

# Start SNMP service
sudo systemctl enable snmpd
sudo systemctl start snmpd

SNMP Monitoring Queries:

Bash

# Query system information
snmpwalk -v2c -c public node1 1.3.6.1.2.1.1

# Monitor network interfaces
snmpwalk -v2c -c public node1 1.3.6.1.2.1.2.2.1.10

# CPU and memory monitoring
snmpwalk -v2c -c public node1 1.3.6.1.4.1.2021.11

Integration with External Monitoring

Prometheus Integration:

Bash

# Install Prometheus node exporter
wget https://github.com/prometheus/node_exporter/releases/latest/download/node_exporter-*-linux-amd64.tar.gz
tar xvfz node_exporter-*.tar.gz
sudo cp node_exporter /usr/local/bin/

# Create systemd service
cat << 'EOF' > /etc/systemd/system/node_exporter.service
[Unit]
Description=Node Exporter
After=network.target

[Service]
Type=simple
ExecStart=/usr/local/bin/node_exporter

[Install]
WantedBy=multi-user.target
EOF

sudo systemctl enable node_exporter
sudo systemctl start node_exporter

Grafana Dashboard Configuration:

Bash

# Sample Grafana dashboard query
up{job="cluster-nodes"}
rate(node_cpu_seconds_total[5m])
node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100

Alerting and Notification

Configure Email Alerts:

Bash

# Install mail utilities
sudo yum install mailx postfix

# Configure Postfix for local delivery
echo "relayhost = smtp.example.com" >> /etc/postfix/main.cf
sudo systemctl enable postfix
sudo systemctl start postfix

Cluster Event Notifications:

Bash

#!/bin/bash
# Cluster event notification script

# Monitor cluster events
sudo journalctl -f -u pacemaker | while read line; do
    if echo "$line" | grep -qE "(CRITICAL|ERROR|FAIL)"; then
        echo "$(date): $line" | mail -s "Cluster Alert" admin@example.com
    fi
done

Troubleshooting Common Clustering Issues

Troubleshooting common clustering issues requires systematic diagnostic approaches and understanding of typical failure patterns. Moreover, effective troubleshooting minimizes downtime and prevents cascading failures. Therefore, mastering diagnostic techniques becomes crucial for cluster administrators.

Split-Brain Scenarios

Identifying Split-Brain Conditions:

Bash

# Check quorum status on all nodes
sudo corosync-quorumtool -s
sudo pcs status | grep -i quorum

# Verify node membership
sudo corosync-cmapctl | grep members
sudo crm_node -l

Split-Brain Resolution:

Bash

# Force quorum on surviving partition
sudo corosync-quorumtool -e

# Alternative: Set expected votes
sudo pcs quorum expected-votes 1

# Restart failed nodes after resolution
sudo pcs cluster start node2
sudo pcs cluster status

Preventing Split-Brain:

Bash

# Configure quorum device (QDevice)
sudo pcs quorum device add model net host=qnetd.example.com algorithm=lms

# Enable two-node cluster mode
sudo pcs property set no-quorum-policy=ignore
sudo pcs property set stonith-enabled=true

Resource Startup Failures

Diagnosing Resource Issues:

Bash

# Check resource status and errors
sudo pcs resource show WebServer
sudo pcs resource failcount show WebServer

# Detailed resource information
sudo crm_resource --resource WebServer --locate
sudo crm_resource --resource WebServer --get-parameter configfile

Resource Debugging:

Bash

# Debug resource startup
sudo pcs resource debug-start WebServer
sudo pcs resource debug-stop WebServer

# Manual resource agent testing
sudo /usr/lib/ocf/resource.d/heartbeat/apache start
sudo OCF_ROOT=/usr/lib/ocf OCF_RESOURCE_INSTANCE=WebServer \
     /usr/lib/ocf/resource.d/heartbeat/apache monitor

Common Resource Fixes:

Bash

# Clear resource failures
sudo pcs resource cleanup WebServer

# Reset failcount
sudo pcs resource failcount reset WebServer

# Force resource restart
sudo pcs resource restart WebServer

Network Communication Problems

Corosync Communication Issues:

Bash

# Check Corosync status
sudo corosync-cmapctl | grep -E "(members|status)"
sudo corosync-cfgtool -s

# Test multicast connectivity
omping -c 5 192.168.1.10 192.168.1.11

Network Troubleshooting:

Bash

# Verify cluster ports
sudo netstat -tulpn | grep -E "(5405|2224|3121)"
sudo firewall-cmd --list-all | grep high-availability

# Test TCP connectivity
telnet 192.168.1.11 2224
nc -zv 192.168.1.11 3121

Communication Recovery:

Bash

# Restart Corosync service
sudo systemctl restart corosync
sudo systemctl status corosync

# Reset cluster communication
sudo pcs cluster stop --all
sudo pcs cluster start --all

Authentication and Authorization Issues

PCsd Authentication Problems:

Bash

# Reset hacluster password
echo 'newpassword' | sudo passwd --stdin hacluster

# Re-authenticate cluster nodes
sudo pcs host auth node1 node2 -u hacluster -p newpassword --force

# Check pcsd service
sudo systemctl status pcsd
sudo systemctl restart pcsd

Certificate Issues:

Bash

# Check SSL certificates
sudo pcs pcsd certkey
ls -la /var/lib/pcsd/

# Regenerate certificates if needed
sudo pcs pcsd sync-certificates

STONITH and Fencing Issues

Troubleshooting:

Bash

# Test fence devices manually
sudo fence_ipmilan -a 192.168.1.20 -u admin -p password -l admin -o status

# Check STONITH resource status
sudo pcs stonith show fence-node1
sudo pcs constraint show | grep stonith

Configuration Validation:

Bash

# Validate fence device configuration
sudo stonith_admin --reboot=node2 --test

# Check STONITH history
sudo stonith_admin --history=*
sudo pcs stonith history show

Recovery:

Bash

# Cleanup STONITH resources
sudo pcs stonith cleanup fence-node1

# Reset STONITH if necessary
sudo pcs stonith disable fence-node1
sudo pcs stonith enable fence-node1

Performance and Timing Issues

Cluster Timing Problems:

Bash

# Check cluster timing
sudo crm_mon --timing-details
sudo corosync-cfgtool -s

# Adjust timing parameters
sudo pcs property set cluster-recheck-interval=1min
sudo pcs property set migration-threshold=3

Resource Timeout Issues:

Bash

# Increase operation timeouts
sudo pcs resource update WebServer op monitor interval=30s timeout=60s
sudo pcs resource update WebServer op start timeout=120s

# Check resource operation history
sudo pcs resource op defaults
sudo crm_resource --resource WebServer --get-parameter start-delay

Log Analysis for Troubleshooting

Comprehensive Log Analysis:

Bash

# Analyze cluster logs for errors
sudo grep -i error /var/log/cluster/corosync.log
sudo journalctl -u pacemaker --since "1 hour ago" | grep -E "(error|warn|crit)"

# Extract resource operation logs
sudo grep -A 5 -B 5 "WebServer.*failed" /var/log/messages

Log Rotation and Cleanup:

Bash

# Configure log rotation
echo "/var/log/cluster/*.log {
    daily
    rotate 30
    compress
    delaycompress
    missingok
    notifempty
}" > /etc/logrotate.d/cluster

# Manual log cleanup
sudo find /var/log/cluster -name "*.log" -mtime +30 -delete

Emergency Recovery Procedures

Single Node Recovery:

Bash

# Start cluster in single-node mode
sudo pcs property set no-quorum-policy=ignore
sudo pcs cluster start node1

# Import resources from offline node
sudo pcs resource show --full
sudo pcs resource move WebServer node1

Complete Cluster Recovery:

Bash

# Reset cluster configuration
sudo pcs cluster destroy --all

# Rebuild cluster from backup
sudo pcs cluster setup mycluster node1 node2 --start --enable
sudo pcs cluster cib-push cluster-backup.xml

Disaster Recovery:

Bash

# Emergency cluster shutdown
sudo pcs cluster stop --all --force
sudo pcs cluster destroy --all

# Restore from configuration backup
sudo cp /root/cluster-backup.xml /tmp/cluster-restore.xml
sudo pcs cluster cib-push /tmp/cluster-restore.xml --config

Advanced Pacemaker Configuration

Advanced Pacemaker configuration enables sophisticated cluster management scenarios and optimization for complex environments. Furthermore, advanced features support multi-site clustering, complex dependencies, and performance tuning. Additionally, understanding advanced configuration options maximizes cluster flexibility and reliability.

Multi-Site Clustering

Booth Ticket Manager:

Bash

# Install booth for multi-site clustering
sudo yum install booth

# Configure booth arbitrator
echo 'transport="UDP"
port="9929"
arbitrator="arbitrator.example.com"
site="192.168.1.10"
site="192.168.2.10"
ticket="web-ticket"
expire="600"
timeout="10"
retries="5"' > /etc/booth/booth.conf

# Synchronize booth configuration
sudo scp /etc/booth/booth.conf node2:/etc/booth/
sudo scp /etc/booth/booth.conf arbitrator:/etc/booth/

Geo-Clustering Configuration:

Bash

# Create ticket-dependent resources
sudo pcs resource create WebCluster apache \
    configfile="/etc/httpd/conf/httpd.conf" \
    ticket="web-ticket" \
    loss-policy="stop"

# Configure booth resource
sudo pcs resource create booth booth \
    config="/etc/booth/booth.conf" \
    op monitor interval="10s"

Complex Resource Dependencies

Advanced Constraint Configuration:

Bash

# Complex ordering constraints
sudo pcs constraint order set VirtualIP WebServer Database \
    kind=Optional symmetrical=false

# Conditional constraints based on resource state
sudo pcs constraint order promote DatabaseMaster then start WebServer
sudo pcs constraint colocation add WebServer with master DatabaseMaster INFINITY

Resource Sets and Groups:

Bash

# Create complex resource sets
sudo pcs constraint order set DatabaseMaster WebServer VirtualIP \
    sequential=true \
    require-all=false \
    action=start \
    role=Master

# Advanced colocation with roles
sudo pcs constraint colocation add WebServer with DatabaseMaster \
    INFINITY master node-attribute=datacenter

Performance Optimization

Cluster Performance Tuning:

Bash

# Optimize cluster timing parameters
sudo pcs property set dc-deadtime=20s
sudo pcs property set election-timeout=5s
sudo pcs property set shutdown-escalation=5min

# Configure batch processing
sudo pcs property set batch-limit=30
sudo pcs property set migration-limit=1

Resource Operation Optimization:

Bash

# Optimize resource monitoring
sudo pcs resource op defaults record-pending=true

# Configure operation intervals
sudo pcs resource update WebServer \
    op monitor interval=30s timeout=20s \
    op start timeout=60s interval=0s \
    op stop timeout=60s interval=0s

Custom Resource Agents

Creating Custom OCF Resource Agent:

Bash

# Create custom resource agent directory
sudo mkdir -p /usr/lib/ocf/resource.d/custom

# Sample custom resource agent
cat << 'EOF' > /usr/lib/ocf/resource.d/custom/myapp
#!/bin/bash
#
# Custom application resource agent
#
# OCF parameters:
# OCF_RESKEY_config
# OCF_RESKEY_pid_file

. ${OCF_ROOT}/lib/heartbeat/ocf-shellfuncs

MYAPP_CONFIG=${OCF_RESKEY_config:-"/etc/myapp/myapp.conf"}
MYAPP_PID=${OCF_RESKEY_pid_file:-"/var/run/myapp.pid"}

myapp_start() {
    myapp_monitor
    if [ $? -eq $OCF_SUCCESS ]; then
        return $OCF_SUCCESS
    fi
    
    /usr/bin/myapp -c $MYAPP_CONFIG &
    echo $! > $MYAPP_PID
    
    if myapp_monitor; then
        return $OCF_SUCCESS
    else
        return $OCF_ERR_GENERIC
    fi
}

myapp_stop() {
    if [ -f $MYAPP_PID ]; then
        pid=$(cat $MYAPP_PID)
        kill $pid
        rm -f $MYAPP_PID
    fi
    return $OCF_SUCCESS
}

myapp_monitor() {
    if [ -f $MYAPP_PID ]; then
        pid=$(cat $MYAPP_PID)
        if ps -p $pid > /dev/null 2>&1; then
            return $OCF_SUCCESS
        fi
    fi
    return $OCF_NOT_RUNNING
}

case "$1" in
    start)      myapp_start;;
    stop)       myapp_stop;;
    monitor)    myapp_monitor;;
    *)          echo "Usage: $0 {start|stop|monitor}"
                exit $OCF_ERR_UNIMPLEMENTED;;
esac

exit $?
EOF

# Make executable and register
sudo chmod +x /usr/lib/ocf/resource.d/custom/myapp
sudo pcs resource agents ocf:custom:myapp

Cluster Policies and Rules

Advanced Cluster Policies:

Bash

# Configure cluster policies
sudo pcs property set cluster-infrastructure=corosync
sudo pcs property set start-failure-is-fatal=false
sudo pcs property set stop-orphan-resources=true
sudo pcs property set stop-orphan-actions=true

# Maintenance policies
sudo pcs property set maintenance-mode=false
sudo pcs property set enable-startup-probes=true

Resource Stickiness and Preferences:

Bash

# Configure resource stickiness
sudo pcs resource defaults resource-stickiness=100

# Node preference scoring
sudo pcs constraint location WebServer rule score=100 \
    datacenter eq primary

# Time-based constraints
sudo pcs constraint location WebServer rule score=-INFINITY \
    date gt 2024-12-31

Monitoring and Alerting Integration

Advanced Monitoring Configuration:

Bash

# Configure cluster monitoring resource
sudo pcs resource create cluster-mon ClusterMon \
    extra_options="-r -n" \
    htmlfile="/var/www/html/cluster.html" \
    op monitor interval=15s

# SNMP monitoring integration
sudo pcs resource create snmp-trap ocf:heartbeat:MailTo \
    email="admin@example.com" \
    subject="Cluster Alert" \
    op monitor interval=10s

Custom Monitoring Scripts:

Bash

# Create monitoring resource agent
sudo pcs resource create custom-monitor systemd:custom-monitor \
    op monitor interval=30s timeout=20s \
    op start timeout=60s interval=0s \
    op stop timeout=60s interval=0s

Security Hardening

Cluster Security Configuration:

Bash

# Configure secure communication
sudo pcs property set cluster-name=production-cluster
sudo corosync-keygen -l

# SSL/TLS for cluster communication
echo 'crypto_cipher: aes256
crypto_hash: sha256' >> /etc/corosync/corosync.conf

Access Control and Authentication:

Bash

# Configure role-based access
sudo pcs acl role create monitor-role \
    read xpath //crm_config \
    read xpath //nodes \
    read xpath //resources \
    read xpath //status

# Assign roles to users
sudo pcs acl user create monitor-user monitor-role

Frequently Asked Questions

What is the difference between Pacemaker and Keepalived?

Pacemaker provides comprehensive cluster resource management with advanced features like constraint-based policies, complex dependencies, and multiple resource types. Furthermore, Pacemaker supports enterprise-grade fencing mechanisms and multi-node clusters. Keepalived, conversely, offers lightweight high availability focused primarily on IP failover and simple service monitoring. Additionally, Keepalived uses VRRP protocol for IP address management and requires less configuration complexity.

How many nodes can Pacemaker support in a cluster?

Pacemaker clustering officially supports clusters with 2 to 32 nodes, though practical limitations depend on network configuration and resource requirements. Moreover, larger clusters require careful tuning of heartbeat intervals and communication timeouts. Additionally, most production environments successfully operate with 2-8 node clusters for optimal performance and management complexity.

What happens during a split-brain scenario?

During split-brain scenarios, cluster partitions believe other nodes have failed and attempt to acquire exclusive resource control. Furthermore, STONITH fencing prevents data corruption by forcibly shutting down suspected failed nodes. Additionally, quorum mechanisms ensure only the partition with majority votes continues operating, while minority partitions enter standby mode until connectivity restores.

Can Pacemaker work with Docker containers?

Pacemaker container integration supports Docker through custom resource agents and orchestration frameworks. Furthermore, Pacemaker can manage container lifecycle events including startup, monitoring, and failover scenarios. However, Kubernetes typically provides better native container orchestration capabilities, while Pacemaker excels at managing traditional applications and infrastructure services.

How does Pacemaker handle shared storage?

Pacemaker shared storage management requires careful coordination to prevent data corruption and ensure consistent access. Furthermore, Pacemaker integrates with DRBD, SAN, and NAS systems through specialized resource agents. Additionally, proper storage fencing mechanisms prevent multiple nodes from accessing storage simultaneously during failure scenarios.

What is the recommended heartbeat interval?

Heartbeat interval configuration typically ranges from 1-3 seconds depending on network latency and cluster requirements. Furthermore, shorter intervals provide faster failure detection but increase network overhead. Additionally, production environments often configure token timeout of 3000ms with token retransmits of 10 for optimal balance between responsiveness and stability.

How do I backup Pacemaker configuration?

Pacemaker configuration backup involves exporting the Cluster Information Base (CIB) and related configuration files:

Bash

# Export CIB configuration
sudo pcs cluster cib cluster-backup.xml

# Backup important configuration files
sudo tar -czf cluster-config-backup.tar.gz \
    /etc/corosync/corosync.conf \
    /etc/corosync/authkey \
    cluster-backup.xml

Can I run active/active configuration with Pacemaker?

Active/active clustering with Pacemaker depends on application support for concurrent access. Furthermore, applications requiring shared storage typically need clone resources or master/slave configurations. Additionally, stateless applications like web servers easily support active/active deployment, while databases often require master/slave or active/passive configurations.

How do I upgrade Pacemaker in production?

Production Pacemaker upgrades require careful planning and rolling upgrade procedures:

Bash

# Rolling upgrade procedure
sudo pcs node maintenance node1
sudo yum update pacemaker corosync
sudo pcs node maintenance node1 --off
sudo pcs node maintenance node2
sudo yum update pacemaker corosync
sudo pcs node maintenance node2 --off

What monitoring tools integrate with Pacemaker?

Pacemaker monitoring integration supports various enterprise monitoring solutions including Nagios, Zabbix, Prometheus, and SNMP-based systems. Furthermore, native tools like crm_mon provide real-time cluster status information. Additionally, most monitoring platforms offer Pacemaker-specific plugins for comprehensive cluster health monitoring.

Additional Resources

Official Documentation and References

Community Resources and Forums

Related LinuxTips.pro Articles

Post #79: SonarQube Code Quality Analysis on Linux
Post #80: Nexus Repository Manager on Linux
Post #82: GlusterFS Distributed File System
Post #83: Keepalived VRRP for High Availability

Prerequisites:

Intermediate Linux system administration experience
Understanding of networking concepts and TCP/IP
Familiarity with systemd service management
Basic knowledge of virtualization or hardware management
Experience with package management and firewall configuration

Learning Outcomes: After completing this guide, readers will understand Pacemaker cluster architecture, successfully configure two-node high availability clusters, implement proper STONITH fencing mechanisms, create and manage cluster resources with constraints, perform comprehensive failover testing, and troubleshoot common clustering issues effectively.

Estimated Reading Time: 45-60 minutes Difficulty Level: Advanced Last Updated: November 2025

Knowledge Overview

Time Investment

Guide Content

Table of Contents

What is Linux Clustering with Pacemaker?

Core Components of Pacemaker Clustering

Benefits of Linux Clustering with Pacemaker

How Does Pacemaker Cluster Architecture Work?

Cluster Communication Stack

Resource Management Workflow

Quorum and Split-Brain Prevention

Why Choose Pacemaker for High Availability Clustering?

Technical Advantages

Enterprise Integration Features

Performance Characteristics

How to Install Pacemaker Cluster Software?

Prerequisites and System Requirements

Red Hat Enterprise Linux Installation

Ubuntu Server Installation

SUSE Linux Enterprise Server Installation

Installation Verification

How to Configure Corosync Communication Layer?

Corosync Configuration File Structure

Basic Corosync Configuration

Advanced Communication Settings

Corosync Authentication

Communication Testing

How to Setup Basic Pacemaker Cluster?

Cluster Creation Process

Initial Cluster Configuration

Node Management

Cluster Communication Testing

Basic Troubleshooting

How to Configure Cluster Resources?

Resource Types Overview

Creating Basic Resources

Resource Groups

Resource Constraints

Advanced Resource Configuration

Resource Monitoring and Operations

How to Implement STONITH Fencing?

STONITH Concepts and Requirements

IPMI Fencing Configuration

VMware vSphere Fencing

Shared Storage Fencing

Network Switch Fencing

STONITH Testing and Validation

Advanced STONITH Configuration

How to Test Failover Scenarios?

Planned Failover Testing

Network Failure Testing

Hardware Failure Simulation

Application-Level Testing

STONITH Testing

Performance Testing

Automated Testing Scripts

How to Monitor Cluster Health?

Real-time Monitoring Tools

Log Analysis and Monitoring

Performance Monitoring

Custom Monitoring Scripts

SNMP Monitoring Integration

Integration with External Monitoring

Alerting and Notification

Troubleshooting Common Clustering Issues

Split-Brain Scenarios

Resource Startup Failures

Network Communication Problems

Authentication and Authorization Issues

STONITH and Fencing Issues

Performance and Timing Issues

Log Analysis for Troubleshooting

Emergency Recovery Procedures

Advanced Pacemaker Configuration

Multi-Site Clustering

Complex Resource Dependencies

Performance Optimization

Custom Resource Agents

Cluster Policies and Rules

Monitoring and Alerting Integration