Linux Clustering with Pacemaker: High Availability Setup Guide
Knowledge Overview
Time Investment
10 minutes reading time
20-30 minutes hands-on practice
Guide Content
Linux clustering with Pacemaker provides enterprise-grade high availability solutions by managing cluster resources and automated failover mechanisms. Moreover, Pacemaker integrates seamlessly with Corosync to deliver robust cluster communication and ensures service continuity during node failures. Therefore, implementing Linux clustering with Pacemaker becomes essential for mission-critical applications requiring 99.9% uptime.
Table of Contents
- What is Linux Clustering with Pacemaker?
- How Does Pacemaker Cluster Architecture Work?
- Why Choose Pacemaker for High Availability Clustering?
- How to Install Pacemaker Cluster Software?
- How to Configure Corosync Communication Layer?
- How to Setup Basic Pacemaker Cluster?
- How to Configure Cluster Resources?
- How to Implement STONITH Fencing?
- How to Test Failover Scenarios?
- How to Monitor Cluster Health?
- Troubleshooting Common Clustering Issues
- Advanced Pacemaker Configuration
- Frequently Asked Questions
What is Linux Clustering with Pacemaker?
Linux clustering with Pacemaker represents a comprehensive high availability solution that manages cluster resources across multiple nodes. Furthermore, Pacemaker serves as the cluster resource manager (CRM) responsible for starting, stopping, and monitoring services within a cluster environment. Additionally, this clustering technology ensures automatic failover when primary nodes experience failures or maintenance windows.
Core Components of Pacemaker Clustering
Pacemaker clustering consists of several essential components working together:
1. Pacemaker Cluster Resource Manager
# Check Pacemaker version and status
pcs --version
systemctl status pacemaker
2. Corosync Communication Engine
# Verify Corosync cluster communication
systemctl status corosync
corosync-cmapctl | grep members
3. Resource Agents
# List available resource agents
pcs resource agents
pcs resource agents ocf:heartbeat
4. STONITH/Fencing Mechanisms
# Check available fence agents
pcs stonith list
fence_ipmilan --help
Benefits of Linux Clustering with Pacemaker
Implementing Linux clustering with Pacemaker delivers significant operational advantages:
- Automated Failover: Seamlessly transfers services between nodes without manual intervention
- Resource Management: Intelligently manages service dependencies and startup sequences
- Split-Brain Protection: Prevents data corruption through advanced fencing mechanisms
- Scalable Architecture: Supports clusters from 2 to 32+ nodes depending on requirements
- Enterprise Integration: Compatible with major Linux distributions and enterprise applications
How Does Pacemaker Cluster Architecture Work?
Understanding Pacemaker cluster architecture enables effective implementation and troubleshooting. Furthermore, the architecture follows a layered approach where each component serves specific functions in maintaining high availability.
Cluster Communication Stack
# Display cluster stack information
pcs status
pcs cluster pcsd-status
The Pacemaker communication stack consists of:
1: Hardware and Network Infrastructure
- Dedicated cluster networks for heartbeat communication
- Shared storage systems (SAN, NFS, DRBD)
- Power management interfaces (IPMI, iLO, DRAC)
2: Corosync Messaging Layer
# Configure Corosync authentication
corosync-keygen
systemctl restart corosync
3: Pacemaker Resource Management
# View cluster resource management
pcs resource show
pcs constraint show
Resource Management Workflow
Pacemaker manages cluster resources through a sophisticated workflow:
- Resource Discovery: Identifies available resources and their current states
- Policy Engine: Applies configuration rules and constraints
- Transition Engine: Coordinates resource state changes
- Local Resource Manager: Executes resource operations on nodes
# Monitor resource management workflow
pcs status --full
crm_mon --one-shot
Quorum and Split-Brain Prevention
Cluster quorum mechanisms prevent split-brain scenarios:
# Configure quorum settings
pcs quorum status
pcs quorum expected-votes 3
pcs property set no-quorum-policy=ignore
Why Choose Pacemaker for High Availability Clustering?
Pacemaker high availability clustering offers compelling advantages over alternative solutions. Moreover, enterprise environments benefit from Pacemaker's mature feature set and extensive documentation. Therefore, understanding these advantages helps justify implementation decisions.
Technical Advantages
| Feature | Pacemaker | Alternative Solutions |
|---|---|---|
| Resource Agents | 100+ supported agents | Limited agent support |
| Fencing Methods | Multiple STONITH types | Basic fencing only |
| Constraint Types | Location, order, colocation | Limited constraint options |
| Node Limits | 32+ nodes supported | Often limited to 2-4 nodes |
| Documentation | Extensive official docs | Varying documentation quality |
Enterprise Integration Features
1. Red Hat Enterprise Linux Integration
# Install on RHEL/CentOS
sudo yum install pcs pacemaker corosync fence-agents-all
sudo systemctl enable pcsd pacemaker corosync
2. SUSE Linux Enterprise Server Support
# Install on SLES
sudo zypper install ha-cluster-bootstrap crmsh
sudo ha-cluster-init
3. Ubuntu Server Compatibility
# Install on Ubuntu
sudo apt update
sudo apt install pacemaker corosync crmsh fence-agents
Performance Characteristics
Pacemaker delivers excellent performance metrics:
- Failover Time: Typically 30-60 seconds depending on configuration
- Resource Overhead: Minimal CPU and memory consumption
- Network Traffic: Efficient heartbeat protocol with configurable intervals
- Scalability: Linear performance scaling with additional nodes
# Monitor cluster performance
pcs status
iostat -x 1
iftop -i eth0
How to Install Pacemaker Cluster Software?
Installing Pacemaker cluster software requires careful preparation and systematic execution. Furthermore, proper installation creates the foundation for reliable cluster operations. Additionally, installation procedures vary slightly between Linux distributions but follow consistent principles.
Prerequisites and System Requirements
Before installing Pacemaker, ensure systems meet minimum requirements:
Hardware Requirements:
- Minimum 2 nodes with identical hardware configurations
- At least 2GB RAM per node (4GB+ recommended)
- Dedicated network interfaces for cluster communication
- Shared storage or data replication mechanisms
Network Configuration:
# Configure static IP addresses
sudo nmcli con mod eth0 ipv4.addresses 192.168.1.10/24
sudo nmcli con mod eth0 ipv4.gateway 192.168.1.1
sudo nmcli con mod eth0 ipv4.dns 8.8.8.8
sudo nmcli con up eth0
# Test connectivity between nodes
ping -c 3 192.168.1.11
telnet 192.168.1.11 22
System Preparation:
# Synchronize system clocks
sudo systemctl enable chronyd
sudo systemctl start chronyd
chrony sources -v
# Configure hostname resolution
sudo hostnamectl set-hostname node1.cluster.local
echo "192.168.1.10 node1.cluster.local node1" >> /etc/hosts
echo "192.168.1.11 node2.cluster.local node2" >> /etc/hosts
Red Hat Enterprise Linux Installation
1: Enable High Availability Repository
# Register system and enable HA add-on
sudo subscription-manager repos --enable=rhel-8-for-x86_64-highavailability-rpms
# Install cluster packages
sudo dnf install pcs pacemaker corosync fence-agents-all
2: Configure Firewall Rules
# Allow cluster communication ports
sudo firewall-cmd --permanent --add-service=high-availability
sudo firewall-cmd --permanent --add-port=2224/tcp
sudo firewall-cmd --permanent --add-port=3121/tcp
sudo firewall-cmd --permanent --add-port=5405/udp
sudo firewall-cmd --reload
3: Start and Enable Services
# Enable cluster services
sudo systemctl enable pcsd
sudo systemctl start pcsd
# Set cluster user password
echo 'clusterpw' | sudo passwd --stdin hacluster
Ubuntu Server Installation
Step 1: Install Packages
# Update package repository
sudo apt update
# Install clustering software
sudo apt install pacemaker corosync crmsh fence-agents resource-agents
Step 2: Configure Authentication
# Set hacluster user password
sudo passwd hacluster
# Configure SSH key authentication
sudo -u hacluster ssh-keygen -t rsa -N ""
sudo -u hacluster ssh-copy-id hacluster@node2
SUSE Linux Enterprise Server Installation
Step 1: Install HA Pattern
# Install HA cluster pattern
sudo zypper install -t pattern ha_sles
sudo zypper install ha-cluster-bootstrap
Step 2: Initialize Cluster
# Run cluster bootstrap wizard
sudo ha-cluster-init
# Configure cluster authentication
sudo ha-cluster-join -c node1
Installation Verification
Verify Installation Success:
# Check service status
sudo systemctl status pacemaker corosync pcsd
# Verify cluster software versions
pacemaker --version
corosync -v
pcs --version
# Test cluster communication
pcs cluster auth node1 node2 -u hacluster -p clusterpw
pcs cluster status
How to Configure Corosync Communication Layer?
Configuring Corosync communication layer establishes reliable cluster messaging infrastructure. Moreover, proper Corosync configuration ensures efficient heartbeat communication and prevents split-brain conditions. Therefore, understanding Corosync settings becomes crucial for cluster stability.
Corosync Configuration File Structure
Primary Configuration File: /etc/corosync/corosync.conf
# Generate initial configuration
sudo pcs cluster setup mycluster node1 node2 --start --enable
# View generated configuration
sudo cat /etc/corosync/corosync.conf
Basic Corosync Configuration
# Create cluster configuration
totem {
version: 2
cluster_name: mycluster
clear_node_high_bit: yes
crypto_cipher: aes256
crypto_hash: sha256
interface {
ringnumber: 0
bindnetaddr: 192.168.1.0
broadcast: yes
mcastport: 5405
}
}
logging {
fileline: off
to_logfile: yes
to_syslog: yes
logfile: /var/log/cluster/corosync.log
debug: off
timestamp: on
logger_subsys {
subsys: QUORUM
debug: off
}
}
nodelist {
node {
ring0_addr: 192.168.1.10
name: node1
nodeid: 1
}
node {
ring0_addr: 192.168.1.11
name: node2
nodeid: 2
}
}
quorum {
provider: corosync_votequorum
expected_votes: 2
two_node: 1
}
Advanced Communication Settings
Redundant Ring Configuration:
# Configure dual network rings for reliability
interface {
ringnumber: 0
bindnetaddr: 192.168.1.0
broadcast: yes
mcastport: 5405
}
interface {
ringnumber: 1
bindnetaddr: 10.0.0.0
broadcast: yes
mcastport: 5406
}
Heartbeat Timing Optimization:
# Configure heartbeat intervals
totem {
token: 3000
token_retransmits_before_loss_const: 10
join: 100
consensus: 3000
max_messages: 20
send_join: 45
}
Corosync Authentication
Generate Authentication Key:
# Create cluster authentication key
sudo corosync-keygen
sudo scp /etc/corosync/authkey node2:/etc/corosync/
sudo chmod 400 /etc/corosync/authkey
Verify Authentication:
# Test authentication across nodes
sudo corosync-cmapctl | grep members
sudo corosync-cmapctl | grep runtime
Communication Testing
Monitor Cluster Communication:
# Real-time communication monitoring
sudo corosync-cmapctl -b runtime.totem.pg.mrp.srp.members
sudo journalctl -f -u corosync
# Network traffic analysis
sudo tcpdump -i eth0 port 5405
sudo netstat -tulpn | grep 5405
Performance Tuning:
# Optimize network buffers
echo 'net.core.rmem_max = 134217728' >> /etc/sysctl.conf
echo 'net.core.wmem_max = 134217728' >> /etc/sysctl.conf
sudo sysctl -p
# Configure multicast settings
echo 1 > /proc/sys/net/ipv4/icmp_echo_ignore_broadcasts
How to Setup Basic Pacemaker Cluster?
Setting up a basic Pacemaker cluster involves systematic configuration steps that establish cluster identity and operational parameters. Furthermore, proper cluster setup provides the foundation for resource management and failover capabilities. Additionally, initial configuration affects long-term cluster stability and performance.
Cluster Creation Process
1: Cluster Authentication
# Authenticate cluster nodes
sudo pcs host auth node1 node2 -u hacluster -p clusterpw
# Verify authentication status
sudo pcs host auth
2: Cluster Initialization
# Create and start cluster
sudo pcs cluster setup mycluster node1 node2 --start --enable
# Alternative method with advanced options
sudo pcs cluster setup mycluster \
node1 addr=192.168.1.10 \
node2 addr=192.168.1.11 \
transport knet \
--start --enable
3: Cluster Verification
# Check cluster status
sudo pcs cluster status
sudo pcs status
# Verify node membership
sudo pcs node attribute
sudo corosync-cmapctl | grep members
Initial Cluster Configuration
Configure Global Cluster Properties:
# Set basic cluster properties
sudo pcs property set stonith-enabled=false # Temporarily disable
sudo pcs property set no-quorum-policy=ignore # For 2-node clusters
sudo pcs property set default-resource-stickiness=100
# Configure failure handling
sudo pcs property set migration-threshold=3
sudo pcs property set failure-timeout=60s
sudo pcs property set cluster-recheck-interval=2min
View Current Configuration:
# Display cluster properties
sudo pcs property show
sudo pcs property list
# Export cluster configuration
sudo pcs config
sudo crm configure show
Node Management
Add Additional Nodes:
# Add new node to existing cluster
sudo pcs cluster node add node3 addr=192.168.1.12
# Update cluster membership
sudo pcs cluster reload corosync
sudo pcs cluster start node3
Remove Nodes:
# Gracefully remove node from cluster
sudo pcs cluster node remove node3
sudo pcs cluster destroy node3
Node Maintenance Mode:
# Put node in maintenance mode
sudo pcs node maintenance node2
# Remove from maintenance mode
sudo pcs node maintenance node2 --off
# Standby mode for planned maintenance
sudo pcs node standby node2
sudo pcs node unstandby node2
Cluster Communication Testing
Verify Cluster Messaging:
# Test cluster communication
sudo pcs cluster stop --all
sudo pcs cluster start --all
# Monitor cluster logs
sudo journalctl -f -u pacemaker -u corosync
# Check cluster timing
sudo crm_mon --one-shot --timing-details
Network Connectivity Tests:
# Test multicast communication
omping -c 10 -T 30 192.168.1.10 192.168.1.11
# Monitor network latency
ping -f 192.168.1.11
hping3 -c 100 -i u100 192.168.1.11
Basic Troubleshooting
Common Startup Issues:
# Check service dependencies
sudo systemctl status pacemaker corosync pcsd
# Verify authentication
sudo pcs host auth node1 node2 --force
# Reset cluster if necessary
sudo pcs cluster destroy --all
Configuration Validation:
# Validate cluster configuration
sudo pcs cluster verify
sudo crm_verify -LV
# Test configuration changes
sudo pcs cluster cib-push scope=configuration --config
How to Configure Cluster Resources?
Configuring cluster resources enables automated management of services, applications, and dependencies. Moreover, proper resource configuration ensures reliable failover and optimal resource utilization across cluster nodes. Therefore, understanding resource types and constraints becomes essential for effective cluster management.
Resource Types Overview
Pacemaker supports multiple resource types through standardized resource agents:
1. OCF (Open Cluster Framework) Resources
# List OCF resource agents
sudo pcs resource agents ocf:heartbeat
sudo pcs resource agents ocf:pacemaker
# Common OCF resources
sudo pcs resource agents ocf:heartbeat:IPaddr2
sudo pcs resource agents ocf:heartbeat:apache
sudo pcs resource agents ocf:heartbeat:mysql
2. LSB (Linux Standard Base) Resources
# List system service resources
sudo pcs resource agents lsb
sudo pcs resource agents systemd
# Service-based resources
sudo pcs resource agents systemd:httpd
sudo pcs resource agents systemd:postgresql
3. STONITH (Fencing) Resources
# List fencing agents
sudo pcs resource agents stonith
sudo pcs stonith list
Creating Basic Resources
Virtual IP Resource:
# Create floating IP resource
sudo pcs resource create VirtualIP IPaddr2 \
ip=192.168.1.100 \
cidr_netmask=24 \
nic=eth0 \
op monitor interval=30s
# Verify resource creation
sudo pcs resource show VirtualIP
sudo pcs status
Apache Web Server Resource:
# Create Apache HTTP resource
sudo pcs resource create WebServer apache \
configfile="/etc/httpd/conf/httpd.conf" \
statusurl="http://localhost/server-status" \
op monitor interval=20s timeout=40s \
op start timeout=60s \
op stop timeout=60s
# Check resource status
sudo pcs resource show WebServer
Database Resource:
# Create MySQL database resource
sudo pcs resource create Database mysql \
binary="/usr/bin/mysqld_safe" \
config="/etc/mysql/my.cnf" \
datadir="/var/lib/mysql" \
user="mysql" \
op monitor interval=30s timeout=30s
# Monitor database resource
sudo pcs resource show Database
Resource Groups
Create Resource Groups:
# Group related resources together
sudo pcs resource group add WebServerGroup VirtualIP WebServer
# Add resource to existing group
sudo pcs resource group add WebServerGroup Database
# View group configuration
sudo pcs resource show WebServerGroup
Resource Group Benefits:
- Simplified management of related resources
- Automatic start/stop ordering within group
- Colocation constraints applied automatically
- Reduced configuration complexity
Resource Constraints
Colocation Constraints:
# Ensure resources run on same node
sudo pcs constraint colocation add WebServer with VirtualIP INFINITY
# Prevent resources from running together
sudo pcs constraint colocation add Database with WebServer -INFINITY
# Show colocation constraints
sudo pcs constraint colocation show
Order Constraints:
# Define resource startup sequence
sudo pcs constraint order VirtualIP then WebServer
# Mandatory ordering with timing
sudo pcs constraint order start VirtualIP then start WebServer \
kind=Mandatory symmetrical=true
# View order constraints
sudo pcs constraint order show
Location Constraints:
# Prefer specific nodes for resources
sudo pcs constraint location WebServer prefers node1=50
# Avoid certain nodes
sudo pcs constraint location Database avoids node2=INFINITY
# Show location constraints
sudo pcs constraint location show
Advanced Resource Configuration
Clone Resources:
# Create clone resource for active/active services
sudo pcs resource create SharedStorage Filesystem \
device="/dev/sdb1" \
directory="/shared" \
fstype="ext4" \
--clone
# Configure clone parameters
sudo pcs resource clone SharedStorage \
clone-max=2 \
clone-node-max=1 \
notify=true
Master/Slave Resources:
# Create master/slave resource
sudo pcs resource create DBReplication mysql \
binary="/usr/bin/mysqld_safe" \
config="/etc/mysql/my.cnf" \
replication_user="repl" \
replication_passwd="replpass" \
master
# Configure master/slave parameters
sudo pcs resource master DBMaster DBReplication \
master-max=1 \
master-node-max=1 \
clone-max=2 \
clone-node-max=1 \
notify=true
Resource Monitoring and Operations
Operations:
# Manual resource operations
sudo pcs resource disable WebServer
sudo pcs resource enable WebServer
sudo pcs resource restart WebServer
# Move resource to specific node
sudo pcs resource move WebServer node2
sudo pcs resource clear WebServer
# Cleanup failed resources
sudo pcs resource cleanup WebServer
Monitoring:
# Monitor resource status
sudo pcs resource show --full
sudo crm_mon --one-shot --inactive
# Resource history and failures
sudo pcs resource failcount show WebServer
sudo pcs resource debug-start WebServer
How to Implement STONITH Fencing?
Implementing STONITH fencing prevents data corruption and split-brain scenarios in cluster environments. Furthermore, STONITH (Shoot The Other Node In The Head) ensures failed nodes cannot interfere with cluster operations. Additionally, proper fencing configuration remains critical for production cluster deployments.
STONITH Concepts and Requirements
Understanding STONITH Mechanisms:
- Power-based Fencing: Controls node power through IPMI, iLO, or PDUs
- Network-based Fencing: Isolates nodes through network switches
- Storage-based Fencing: Blocks storage access from failed nodes
- Hypervisor Fencing: Controls virtual machines through hypervisor APIs
STONITH Requirements:
# Check available fencing agents
sudo pcs stonith list | grep -i ipmi
sudo pcs stonith list | grep -i vmware
sudo pcs stonith list | grep -i libvirt
# Verify hardware fencing capabilities
ipmitool -I lanplus -H 192.168.1.20 -U admin -P password power status
IPMI Fencing Configuration
Configure IPMI Fencing:
# Create IPMI fence device for node1
sudo pcs stonith create fence-node1 fence_ipmilan \
pcmk_host_list="node1" \
ipaddr="192.168.1.20" \
username="admin" \
password="fencepass" \
lanplus="true" \
op monitor interval="60s"
# Create IPMI fence device for node2
sudo pcs stonith create fence-node2 fence_ipmilan \
pcmk_host_list="node2" \
ipaddr="192.168.1.21" \
username="admin" \
password="fencepass" \
lanplus="true" \
op monitor interval="60s"
STONITH Location Constraints:
# Prevent nodes from fencing themselves
sudo pcs constraint location fence-node1 avoids node1=INFINITY
sudo pcs constraint location fence-node2 avoids node2=INFINITY
# Verify STONITH configuration
sudo pcs stonith show
sudo pcs constraint show --full
VMware vSphere Fencing
Configure VMware Fencing:
# Create VMware fence device
sudo pcs stonith create fence-vmware fence_vmware_soap \
ipaddr="vcenter.example.com" \
username="cluster@vsphere.local" \
password="vmwarepass" \
ssl="1" \
op monitor interval="60s"
# Map VMs to cluster nodes
sudo pcs stonith create fence-vm1 fence_vmware_soap \
ipaddr="vcenter.example.com" \
username="cluster@vsphere.local" \
password="vmwarepass" \
plug="cluster-node1-vm" \
pcmk_host_list="node1" \
ssl="1"
Shared Storage Fencing
Configure SAN-based Fencing:
# Create SAN fence device
sudo pcs stonith create fence-san fence_scsi \
devices="/dev/sdb,/dev/sdc" \
pcmk_host_list="node1,node2" \
op monitor interval="60s"
# Configure reservation keys
echo "node1:0x123456789" > /etc/fence_scsi.conf
echo "node2:0x987654321" >> /etc/fence_scsi.conf
Network Switch Fencing
Configure Network Fencing:
# Create network switch fence device
sudo pcs stonith create fence-switch fence_cisco_ucs \
ipaddr="switch.example.com" \
username="admin" \
password="switchpass" \
plug="1/1/1,1/1/2" \
pcmk_host_list="node1,node2" \
ssl="1"
STONITH Testing and Validation
Enable STONITH:
# Enable STONITH globally
sudo pcs property set stonith-enabled=true
# Configure STONITH timeout
sudo pcs property set stonith-timeout=120s
sudo pcs property set stonith-action=reboot
Test STONITH Operations:
# Test fence devices
sudo stonith_admin --reboot=node2 --verbose
sudo fence_ipmilan -a 192.168.1.21 -u admin -p fencepass -l admin -o reboot
# Verify fencing history
sudo stonith_admin --history=node2
sudo pcs stonith history show
STONITH Monitoring:
# Monitor STONITH resources
sudo pcs stonith show --full
sudo crm_mon --watch-fencing
# Check STONITH logs
sudo journalctl -u pacemaker | grep -i stonith
sudo grep -i stonith /var/log/cluster/corosync.log
Advanced STONITH Configuration
Multi-level Fencing:
# Configure cascaded fencing levels
sudo pcs stonith level add 1 node1 fence-ipmi-node1
sudo pcs stonith level add 2 node1 fence-pdu-node1
sudo pcs stonith level add 3 node1 fence-switch-node1
# Show fencing levels
sudo pcs stonith level show
STONITH Resource Groups:
# Group STONITH resources
sudo pcs resource group add fencing-group fence-node1 fence-node2
# Clone STONITH for redundancy
sudo pcs stonith create fence-shared fence_vmware_soap \
--clone \
clone-max=2 \
clone-node-max=1
How to Test Failover Scenarios?
Testing failover scenarios validates cluster reliability and identifies potential issues before production deployment. Moreover, comprehensive testing ensures proper resource migration and service continuity during various failure conditions. Therefore, systematic failover testing becomes essential for cluster validation.
Planned Failover Testing
Manual Resource Migration:
# Move resource between nodes
sudo pcs resource move WebServer node2
# Verify resource migration
sudo pcs status
sudo pcs resource show WebServer
# Clear movement constraint
sudo pcs resource clear WebServer
Node Standby Testing:
# Put node in standby mode
sudo pcs node standby node1
# Monitor resource migration
watch 'sudo pcs status'
# Return node to service
sudo pcs node unstandby node1
Service Failure Simulation:
# Stop Apache service manually
sudo systemctl stop httpd
# Observe cluster response
sudo pcs status
sudo journalctl -u pacemaker -f
# Verify automatic restart
sudo pcs resource cleanup WebServer
Network Failure Testing
Network Interface Failure:
# Simulate network interface failure
sudo ip link set eth0 down
# Monitor cluster behavior
sudo pcs status
sudo corosync-cmapctl | grep members
# Restore network interface
sudo ip link set eth0 up
Split-Brain Scenario Testing:
# Block cluster communication (node1)
sudo iptables -A INPUT -s 192.168.1.11 -j DROP
sudo iptables -A OUTPUT -d 192.168.1.11 -j DROP
# Monitor quorum behavior
sudo pcs status
sudo corosync-quorumtool
# Restore communication
sudo iptables -F
Hardware Failure Simulation
Disk Failure Testing:
# Simulate disk failure
echo offline > /sys/block/sdb/device/state
# Monitor shared storage resources
sudo pcs resource show SharedStorage
sudo df -h /shared
# Restore disk
echo running > /sys/block/sdb/device/state
Memory Pressure Testing:
# Create memory pressure
stress --vm 2 --vm-bytes 1G --timeout 300s
# Monitor cluster response
sudo pcs status
sudo free -m
sudo top
Application-Level Testing
Database Failover Testing:
# Kill MySQL process
sudo pkill -9 mysqld
# Verify cluster response
sudo pcs resource show Database
sudo pcs status
# Check database recovery
mysql -u root -p -e "SHOW STATUS"
Web Server Testing:
# Test web service availability
while true; do
curl -s http://192.168.1.100 || echo "Failed at $(date)"
sleep 1
done
# Monitor during failover
sudo pcs resource move WebServer node2
STONITH Testing
Controlled STONITH Testing:
# Test fence device manually
sudo fence_ipmilan -a 192.168.1.21 -u admin -p fencepass -l admin -o status
# Trigger STONITH through cluster
sudo stonith_admin --reboot=node2 --verbose
# Monitor STONITH logs
sudo journalctl -u pacemaker | grep stonith
Node Crash Simulation:
# Simulate kernel panic (WARNING: Will reboot node!)
echo c > /proc/sysrq-trigger
# Monitor cluster response from surviving node
sudo pcs status
sudo crm_mon --one-shot
Performance Testing
Load Testing During Failover:
# Generate load on web server
ab -n 10000 -c 50 http://192.168.1.100/
# Trigger failover during load test
sudo pcs resource move WebServer node2
# Measure downtime
time curl http://192.168.1.100
Resource Utilization Testing:
# Monitor resource usage
iostat -x 1
sar -u 1
iftop -i eth0
# Test with high load
stress --cpu 4 --io 2 --vm 2 --vm-bytes 512M --timeout 300s
Automated Testing Scripts
Failover Test Script:
#!/bin/bash
# Comprehensive failover testing script
# Function to test resource migration
test_resource_failover() {
local resource=$1
local target_node=$2
echo "Testing $resource failover to $target_node"
sudo pcs resource move $resource $target_node
# Wait for migration
sleep 30
# Verify resource status
if sudo pcs status | grep -q "$resource.*$target_node.*Started"; then
echo "PASS: $resource successfully failed over to $target_node"
else
echo "FAIL: $resource failover to $target_node failed"
fi
# Clear constraints
sudo pcs resource clear $resource
}
# Test all resources
test_resource_failover "WebServer" "node2"
test_resource_failover "VirtualIP" "node2"
test_resource_failover "Database" "node2"
Continuous Monitoring Script:
#!/bin/bash
# Monitor cluster health during testing
while true; do
timestamp=$(date '+%Y-%m-%d %H:%M:%S')
# Check cluster status
if sudo pcs status >/dev/null 2>&1; then
status="HEALTHY"
else
status="UNHEALTHY"
fi
# Log status
echo "$timestamp - Cluster Status: $status" >> cluster_test.log
# Check individual resources
sudo pcs resource show | grep -E "(Started|Stopped|Failed)" >> cluster_test.log
sleep 10
done
How to Monitor Cluster Health?
Monitoring cluster health ensures early detection of issues and proactive maintenance capabilities. Furthermore, comprehensive monitoring encompasses resource status, node health, and performance metrics. Additionally, effective monitoring strategies prevent service disruptions and optimize cluster performance.
Real-time Monitoring Tools
Native Pacemaker Monitoring:
# Real-time cluster monitoring
sudo crm_mon -1rfA
sudo crm_mon --watch-fencing --one-shot
# Continuous monitoring with refresh
watch -n 5 'sudo pcs status'
sudo crm_mon -r -f
Pacemaker Status Commands:
# Comprehensive status view
sudo pcs status --full
sudo pcs resource show --full
sudo pcs constraint show --full
# Node-specific information
sudo pcs node attribute
sudo pcs node utilization
sudo pcs cluster pcsd-status
Log Analysis and Monitoring
Cluster Log Monitoring:
# Real-time log monitoring
sudo journalctl -f -u pacemaker -u corosync
sudo tail -f /var/log/cluster/corosync.log
# Log filtering for specific events
sudo journalctl -u pacemaker --since "1 hour ago" | grep -i error
sudo journalctl -u corosync --since today | grep -i warn
Log Analysis Scripts:
#!/bin/bash
# Cluster log analyzer
# Function to analyze recent errors
analyze_cluster_logs() {
echo "=== Cluster Error Analysis ==="
# Pacemaker errors
echo "Pacemaker errors in last hour:"
journalctl -u pacemaker --since "1 hour ago" | grep -i error | tail -10
# Corosync warnings
echo "Corosync warnings in last hour:"
journalctl -u corosync --since "1 hour ago" | grep -i warn | tail -10
# Resource failures
echo "Resource failures:"
crm_mon --one-shot --failures | grep -A 5 "Failed Resource Actions"
}
analyze_cluster_logs
Performance Monitoring
Resource Performance Metrics:
# Monitor resource CPU usage
ps aux | grep -E "(httpd|mysqld|corosync|pacemaker)"
# Memory usage monitoring
sudo pmap -d $(pgrep pacemaker)
sudo pmap -d $(pgrep corosync)
# Network traffic monitoring
sudo iftop -i eth0 -f "port 5405"
sudo netstat -i
System Resource Monitoring:
# Comprehensive system monitoring
iostat -x 1 5
vmstat 1 5
sar -u -r -n DEV 1 5
# Disk I/O monitoring for shared storage
sudo iotop -a -o -d 1
sudo blkiotop
Custom Monitoring Scripts
Cluster Health Check Script:
#!/bin/bash
# Comprehensive cluster health checker
LOGFILE="/var/log/cluster_health.log"
ALERT_EMAIL="admin@example.com"
# Function to check cluster status
check_cluster_status() {
if ! sudo pcs status >/dev/null 2>&1; then
echo "$(date): CRITICAL - Cluster status check failed" >> $LOGFILE
return 1
fi
# Check for failed resources
failed_resources=$(sudo pcs status | grep -c "FAILED")
if [ "$failed_resources" -gt 0 ]; then
echo "$(date): WARNING - $failed_resources failed resources detected" >> $LOGFILE
return 1
fi
return 0
}
# Function to check node connectivity
check_node_connectivity() {
local nodes=("node1" "node2")
for node in "${nodes[@]}"; do
if ! ping -c 1 "$node" >/dev/null 2>&1; then
echo "$(date): CRITICAL - Node $node unreachable" >> $LOGFILE
return 1
fi
done
return 0
}
# Function to check quorum
check_quorum() {
quorum_status=$(sudo corosync-quorumtool -s | grep "Quorate" | awk '{print $2}')
if [ "$quorum_status" != "Yes" ]; then
echo "$(date): CRITICAL - Cluster lost quorum" >> $LOGFILE
return 1
fi
return 0
}
# Main health check
main() {
if ! check_cluster_status || ! check_node_connectivity || ! check_quorum; then
# Send alert email
echo "Cluster health check failed. See $LOGFILE for details." | \
mail -s "Cluster Alert" $ALERT_EMAIL
exit 1
fi
echo "$(date): INFO - Cluster health check passed" >> $LOGFILE
}
main
SNMP Monitoring Integration
Configure SNMP for Cluster Monitoring:
# Install SNMP packages
sudo yum install net-snmp net-snmp-utils
# Configure SNMP daemon
echo "rocommunity public" >> /etc/snmp/snmpd.conf
echo "syslocation Datacenter" >> /etc/snmp/snmpd.conf
echo "syscontact admin@example.com" >> /etc/snmp/snmpd.conf
# Start SNMP service
sudo systemctl enable snmpd
sudo systemctl start snmpd
SNMP Monitoring Queries:
# Query system information
snmpwalk -v2c -c public node1 1.3.6.1.2.1.1
# Monitor network interfaces
snmpwalk -v2c -c public node1 1.3.6.1.2.1.2.2.1.10
# CPU and memory monitoring
snmpwalk -v2c -c public node1 1.3.6.1.4.1.2021.11
Integration with External Monitoring
Prometheus Integration:
# Install Prometheus node exporter
wget https://github.com/prometheus/node_exporter/releases/latest/download/node_exporter-*-linux-amd64.tar.gz
tar xvfz node_exporter-*.tar.gz
sudo cp node_exporter /usr/local/bin/
# Create systemd service
cat << 'EOF' > /etc/systemd/system/node_exporter.service
[Unit]
Description=Node Exporter
After=network.target
[Service]
Type=simple
ExecStart=/usr/local/bin/node_exporter
[Install]
WantedBy=multi-user.target
EOF
sudo systemctl enable node_exporter
sudo systemctl start node_exporter
Grafana Dashboard Configuration:
# Sample Grafana dashboard query
up{job="cluster-nodes"}
rate(node_cpu_seconds_total[5m])
node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100
Alerting and Notification
Configure Email Alerts:
# Install mail utilities
sudo yum install mailx postfix
# Configure Postfix for local delivery
echo "relayhost = smtp.example.com" >> /etc/postfix/main.cf
sudo systemctl enable postfix
sudo systemctl start postfix
Cluster Event Notifications:
#!/bin/bash
# Cluster event notification script
# Monitor cluster events
sudo journalctl -f -u pacemaker | while read line; do
if echo "$line" | grep -qE "(CRITICAL|ERROR|FAIL)"; then
echo "$(date): $line" | mail -s "Cluster Alert" admin@example.com
fi
done
Troubleshooting Common Clustering Issues
Troubleshooting common clustering issues requires systematic diagnostic approaches and understanding of typical failure patterns. Moreover, effective troubleshooting minimizes downtime and prevents cascading failures. Therefore, mastering diagnostic techniques becomes crucial for cluster administrators.
Split-Brain Scenarios
Identifying Split-Brain Conditions:
# Check quorum status on all nodes
sudo corosync-quorumtool -s
sudo pcs status | grep -i quorum
# Verify node membership
sudo corosync-cmapctl | grep members
sudo crm_node -l
Split-Brain Resolution:
# Force quorum on surviving partition
sudo corosync-quorumtool -e
# Alternative: Set expected votes
sudo pcs quorum expected-votes 1
# Restart failed nodes after resolution
sudo pcs cluster start node2
sudo pcs cluster status
Preventing Split-Brain:
# Configure quorum device (QDevice)
sudo pcs quorum device add model net host=qnetd.example.com algorithm=lms
# Enable two-node cluster mode
sudo pcs property set no-quorum-policy=ignore
sudo pcs property set stonith-enabled=true
Resource Startup Failures
Diagnosing Resource Issues:
# Check resource status and errors
sudo pcs resource show WebServer
sudo pcs resource failcount show WebServer
# Detailed resource information
sudo crm_resource --resource WebServer --locate
sudo crm_resource --resource WebServer --get-parameter configfile
Resource Debugging:
# Debug resource startup
sudo pcs resource debug-start WebServer
sudo pcs resource debug-stop WebServer
# Manual resource agent testing
sudo /usr/lib/ocf/resource.d/heartbeat/apache start
sudo OCF_ROOT=/usr/lib/ocf OCF_RESOURCE_INSTANCE=WebServer \
/usr/lib/ocf/resource.d/heartbeat/apache monitor
Common Resource Fixes:
# Clear resource failures
sudo pcs resource cleanup WebServer
# Reset failcount
sudo pcs resource failcount reset WebServer
# Force resource restart
sudo pcs resource restart WebServer
Network Communication Problems
Corosync Communication Issues:
# Check Corosync status
sudo corosync-cmapctl | grep -E "(members|status)"
sudo corosync-cfgtool -s
# Test multicast connectivity
omping -c 5 192.168.1.10 192.168.1.11
Network Troubleshooting:
# Verify cluster ports
sudo netstat -tulpn | grep -E "(5405|2224|3121)"
sudo firewall-cmd --list-all | grep high-availability
# Test TCP connectivity
telnet 192.168.1.11 2224
nc -zv 192.168.1.11 3121
Communication Recovery:
# Restart Corosync service
sudo systemctl restart corosync
sudo systemctl status corosync
# Reset cluster communication
sudo pcs cluster stop --all
sudo pcs cluster start --all
Authentication and Authorization Issues
PCsd Authentication Problems:
# Reset hacluster password
echo 'newpassword' | sudo passwd --stdin hacluster
# Re-authenticate cluster nodes
sudo pcs host auth node1 node2 -u hacluster -p newpassword --force
# Check pcsd service
sudo systemctl status pcsd
sudo systemctl restart pcsd
Certificate Issues:
# Check SSL certificates
sudo pcs pcsd certkey
ls -la /var/lib/pcsd/
# Regenerate certificates if needed
sudo pcs pcsd sync-certificates
STONITH and Fencing Issues
Troubleshooting:
# Test fence devices manually
sudo fence_ipmilan -a 192.168.1.20 -u admin -p password -l admin -o status
# Check STONITH resource status
sudo pcs stonith show fence-node1
sudo pcs constraint show | grep stonith
Configuration Validation:
# Validate fence device configuration
sudo stonith_admin --reboot=node2 --test
# Check STONITH history
sudo stonith_admin --history=*
sudo pcs stonith history show
Recovery:
# Cleanup STONITH resources
sudo pcs stonith cleanup fence-node1
# Reset STONITH if necessary
sudo pcs stonith disable fence-node1
sudo pcs stonith enable fence-node1
Performance and Timing Issues
Cluster Timing Problems:
# Check cluster timing
sudo crm_mon --timing-details
sudo corosync-cfgtool -s
# Adjust timing parameters
sudo pcs property set cluster-recheck-interval=1min
sudo pcs property set migration-threshold=3
Resource Timeout Issues:
# Increase operation timeouts
sudo pcs resource update WebServer op monitor interval=30s timeout=60s
sudo pcs resource update WebServer op start timeout=120s
# Check resource operation history
sudo pcs resource op defaults
sudo crm_resource --resource WebServer --get-parameter start-delay
Log Analysis for Troubleshooting
Comprehensive Log Analysis:
# Analyze cluster logs for errors
sudo grep -i error /var/log/cluster/corosync.log
sudo journalctl -u pacemaker --since "1 hour ago" | grep -E "(error|warn|crit)"
# Extract resource operation logs
sudo grep -A 5 -B 5 "WebServer.*failed" /var/log/messages
Log Rotation and Cleanup:
# Configure log rotation
echo "/var/log/cluster/*.log {
daily
rotate 30
compress
delaycompress
missingok
notifempty
}" > /etc/logrotate.d/cluster
# Manual log cleanup
sudo find /var/log/cluster -name "*.log" -mtime +30 -delete
Emergency Recovery Procedures
Single Node Recovery:
# Start cluster in single-node mode
sudo pcs property set no-quorum-policy=ignore
sudo pcs cluster start node1
# Import resources from offline node
sudo pcs resource show --full
sudo pcs resource move WebServer node1
Complete Cluster Recovery:
# Reset cluster configuration
sudo pcs cluster destroy --all
# Rebuild cluster from backup
sudo pcs cluster setup mycluster node1 node2 --start --enable
sudo pcs cluster cib-push cluster-backup.xml
Disaster Recovery:
# Emergency cluster shutdown
sudo pcs cluster stop --all --force
sudo pcs cluster destroy --all
# Restore from configuration backup
sudo cp /root/cluster-backup.xml /tmp/cluster-restore.xml
sudo pcs cluster cib-push /tmp/cluster-restore.xml --config
Advanced Pacemaker Configuration
Advanced Pacemaker configuration enables sophisticated cluster management scenarios and optimization for complex environments. Furthermore, advanced features support multi-site clustering, complex dependencies, and performance tuning. Additionally, understanding advanced configuration options maximizes cluster flexibility and reliability.
Multi-Site Clustering
Booth Ticket Manager:
# Install booth for multi-site clustering
sudo yum install booth
# Configure booth arbitrator
echo 'transport="UDP"
port="9929"
arbitrator="arbitrator.example.com"
site="192.168.1.10"
site="192.168.2.10"
ticket="web-ticket"
expire="600"
timeout="10"
retries="5"' > /etc/booth/booth.conf
# Synchronize booth configuration
sudo scp /etc/booth/booth.conf node2:/etc/booth/
sudo scp /etc/booth/booth.conf arbitrator:/etc/booth/
Geo-Clustering Configuration:
# Create ticket-dependent resources
sudo pcs resource create WebCluster apache \
configfile="/etc/httpd/conf/httpd.conf" \
ticket="web-ticket" \
loss-policy="stop"
# Configure booth resource
sudo pcs resource create booth booth \
config="/etc/booth/booth.conf" \
op monitor interval="10s"
Complex Resource Dependencies
Advanced Constraint Configuration:
# Complex ordering constraints
sudo pcs constraint order set VirtualIP WebServer Database \
kind=Optional symmetrical=false
# Conditional constraints based on resource state
sudo pcs constraint order promote DatabaseMaster then start WebServer
sudo pcs constraint colocation add WebServer with master DatabaseMaster INFINITY
Resource Sets and Groups:
# Create complex resource sets
sudo pcs constraint order set DatabaseMaster WebServer VirtualIP \
sequential=true \
require-all=false \
action=start \
role=Master
# Advanced colocation with roles
sudo pcs constraint colocation add WebServer with DatabaseMaster \
INFINITY master node-attribute=datacenter
Performance Optimization
Cluster Performance Tuning:
# Optimize cluster timing parameters
sudo pcs property set dc-deadtime=20s
sudo pcs property set election-timeout=5s
sudo pcs property set shutdown-escalation=5min
# Configure batch processing
sudo pcs property set batch-limit=30
sudo pcs property set migration-limit=1
Resource Operation Optimization:
# Optimize resource monitoring
sudo pcs resource op defaults record-pending=true
# Configure operation intervals
sudo pcs resource update WebServer \
op monitor interval=30s timeout=20s \
op start timeout=60s interval=0s \
op stop timeout=60s interval=0s
Custom Resource Agents
Creating Custom OCF Resource Agent:
# Create custom resource agent directory
sudo mkdir -p /usr/lib/ocf/resource.d/custom
# Sample custom resource agent
cat << 'EOF' > /usr/lib/ocf/resource.d/custom/myapp
#!/bin/bash
#
# Custom application resource agent
#
# OCF parameters:
# OCF_RESKEY_config
# OCF_RESKEY_pid_file
. ${OCF_ROOT}/lib/heartbeat/ocf-shellfuncs
MYAPP_CONFIG=${OCF_RESKEY_config:-"/etc/myapp/myapp.conf"}
MYAPP_PID=${OCF_RESKEY_pid_file:-"/var/run/myapp.pid"}
myapp_start() {
myapp_monitor
if [ $? -eq $OCF_SUCCESS ]; then
return $OCF_SUCCESS
fi
/usr/bin/myapp -c $MYAPP_CONFIG &
echo $! > $MYAPP_PID
if myapp_monitor; then
return $OCF_SUCCESS
else
return $OCF_ERR_GENERIC
fi
}
myapp_stop() {
if [ -f $MYAPP_PID ]; then
pid=$(cat $MYAPP_PID)
kill $pid
rm -f $MYAPP_PID
fi
return $OCF_SUCCESS
}
myapp_monitor() {
if [ -f $MYAPP_PID ]; then
pid=$(cat $MYAPP_PID)
if ps -p $pid > /dev/null 2>&1; then
return $OCF_SUCCESS
fi
fi
return $OCF_NOT_RUNNING
}
case "$1" in
start) myapp_start;;
stop) myapp_stop;;
monitor) myapp_monitor;;
*) echo "Usage: $0 {start|stop|monitor}"
exit $OCF_ERR_UNIMPLEMENTED;;
esac
exit $?
EOF
# Make executable and register
sudo chmod +x /usr/lib/ocf/resource.d/custom/myapp
sudo pcs resource agents ocf:custom:myapp
Cluster Policies and Rules
Advanced Cluster Policies:
# Configure cluster policies
sudo pcs property set cluster-infrastructure=corosync
sudo pcs property set start-failure-is-fatal=false
sudo pcs property set stop-orphan-resources=true
sudo pcs property set stop-orphan-actions=true
# Maintenance policies
sudo pcs property set maintenance-mode=false
sudo pcs property set enable-startup-probes=true
Resource Stickiness and Preferences:
# Configure resource stickiness
sudo pcs resource defaults resource-stickiness=100
# Node preference scoring
sudo pcs constraint location WebServer rule score=100 \
datacenter eq primary
# Time-based constraints
sudo pcs constraint location WebServer rule score=-INFINITY \
date gt 2024-12-31
Monitoring and Alerting Integration
Advanced Monitoring Configuration:
# Configure cluster monitoring resource
sudo pcs resource create cluster-mon ClusterMon \
extra_options="-r -n" \
htmlfile="/var/www/html/cluster.html" \
op monitor interval=15s
# SNMP monitoring integration
sudo pcs resource create snmp-trap ocf:heartbeat:MailTo \
email="admin@example.com" \
subject="Cluster Alert" \
op monitor interval=10s
Custom Monitoring Scripts:
# Create monitoring resource agent
sudo pcs resource create custom-monitor systemd:custom-monitor \
op monitor interval=30s timeout=20s \
op start timeout=60s interval=0s \
op stop timeout=60s interval=0s
Security Hardening
Cluster Security Configuration:
# Configure secure communication
sudo pcs property set cluster-name=production-cluster
sudo corosync-keygen -l
# SSL/TLS for cluster communication
echo 'crypto_cipher: aes256
crypto_hash: sha256' >> /etc/corosync/corosync.conf
Access Control and Authentication:
# Configure role-based access
sudo pcs acl role create monitor-role \
read xpath //crm_config \
read xpath //nodes \
read xpath //resources \
read xpath //status
# Assign roles to users
sudo pcs acl user create monitor-user monitor-role
Frequently Asked Questions
What is the difference between Pacemaker and Keepalived?
Pacemaker provides comprehensive cluster resource management with advanced features like constraint-based policies, complex dependencies, and multiple resource types. Furthermore, Pacemaker supports enterprise-grade fencing mechanisms and multi-node clusters. Keepalived, conversely, offers lightweight high availability focused primarily on IP failover and simple service monitoring. Additionally, Keepalived uses VRRP protocol for IP address management and requires less configuration complexity.
How many nodes can Pacemaker support in a cluster?
Pacemaker clustering officially supports clusters with 2 to 32 nodes, though practical limitations depend on network configuration and resource requirements. Moreover, larger clusters require careful tuning of heartbeat intervals and communication timeouts. Additionally, most production environments successfully operate with 2-8 node clusters for optimal performance and management complexity.
What happens during a split-brain scenario?
During split-brain scenarios, cluster partitions believe other nodes have failed and attempt to acquire exclusive resource control. Furthermore, STONITH fencing prevents data corruption by forcibly shutting down suspected failed nodes. Additionally, quorum mechanisms ensure only the partition with majority votes continues operating, while minority partitions enter standby mode until connectivity restores.
Can Pacemaker work with Docker containers?
Pacemaker container integration supports Docker through custom resource agents and orchestration frameworks. Furthermore, Pacemaker can manage container lifecycle events including startup, monitoring, and failover scenarios. However, Kubernetes typically provides better native container orchestration capabilities, while Pacemaker excels at managing traditional applications and infrastructure services.
How does Pacemaker handle shared storage?
Pacemaker shared storage management requires careful coordination to prevent data corruption and ensure consistent access. Furthermore, Pacemaker integrates with DRBD, SAN, and NAS systems through specialized resource agents. Additionally, proper storage fencing mechanisms prevent multiple nodes from accessing storage simultaneously during failure scenarios.
What is the recommended heartbeat interval?
Heartbeat interval configuration typically ranges from 1-3 seconds depending on network latency and cluster requirements. Furthermore, shorter intervals provide faster failure detection but increase network overhead. Additionally, production environments often configure token timeout of 3000ms with token retransmits of 10 for optimal balance between responsiveness and stability.
How do I backup Pacemaker configuration?
Pacemaker configuration backup involves exporting the Cluster Information Base (CIB) and related configuration files:
# Export CIB configuration
sudo pcs cluster cib cluster-backup.xml
# Backup important configuration files
sudo tar -czf cluster-config-backup.tar.gz \
/etc/corosync/corosync.conf \
/etc/corosync/authkey \
cluster-backup.xml
Can I run active/active configuration with Pacemaker?
Active/active clustering with Pacemaker depends on application support for concurrent access. Furthermore, applications requiring shared storage typically need clone resources or master/slave configurations. Additionally, stateless applications like web servers easily support active/active deployment, while databases often require master/slave or active/passive configurations.
How do I upgrade Pacemaker in production?
Production Pacemaker upgrades require careful planning and rolling upgrade procedures:
# Rolling upgrade procedure
sudo pcs node maintenance node1
sudo yum update pacemaker corosync
sudo pcs node maintenance node1 --off
sudo pcs node maintenance node2
sudo yum update pacemaker corosync
sudo pcs node maintenance node2 --off
What monitoring tools integrate with Pacemaker?
Pacemaker monitoring integration supports various enterprise monitoring solutions including Nagios, Zabbix, Prometheus, and SNMP-based systems. Furthermore, native tools like crm_mon provide real-time cluster status information. Additionally, most monitoring platforms offer Pacemaker-specific plugins for comprehensive cluster health monitoring.
Additional Resources
Official Documentation and References
Community Resources and Forums
Related LinuxTips.pro Articles
- Post #79: SonarQube Code Quality Analysis on Linux
- Post #80: Nexus Repository Manager on Linux
- Post #82: GlusterFS Distributed File System
- Post #83: Keepalived VRRP for High Availability
Prerequisites:
- Intermediate Linux system administration experience
- Understanding of networking concepts and TCP/IP
- Familiarity with systemd service management
- Basic knowledge of virtualization or hardware management
- Experience with package management and firewall configuration
Learning Outcomes: After completing this guide, readers will understand Pacemaker cluster architecture, successfully configure two-node high availability clusters, implement proper STONITH fencing mechanisms, create and manage cluster resources with constraints, perform comprehensive failover testing, and troubleshoot common clustering issues effectively.
Estimated Reading Time: 45-60 minutes Difficulty Level: Advanced Last Updated: November 2025