HPC Administration Tutorial - Part 3
Created: 2025-03-16 17:02:37 | Last updated: 2025-03-16 17:02:37 | Status: Public
Monitoring, Troubleshooting, Performance Tuning, and Documentation
This third part of our HPC administration tutorial focuses on operational aspects of managing an HPC cluster. Learning to effectively monitor, troubleshoot, optimize, and document your HPC environment is essential for maintaining reliable research computing services.
Table of Contents
Monitoring and Troubleshooting
System Monitoring Setup
- Basic Monitoring Tools:
# Install essential monitoring tools
sudo apt install -y htop iotop sysstat nmon dstat
- Setting up Prometheus and Grafana:
Install Prometheus:
# Download Prometheus
wget https://github.com/prometheus/prometheus/releases/download/v2.43.0/prometheus-2.43.0.linux-armv7.tar.gz
tar xvf prometheus-2.43.0.linux-armv7.tar.gz
sudo mv prometheus-2.43.0.linux-armv7 /opt/prometheus
# Create service file
sudo nano /etc/systemd/system/prometheus.service
Add to service file:
[Unit]
Description=Prometheus Monitoring System
Documentation=https://prometheus.io/docs/introduction/overview/
After=network-online.target
[Service]
User=pi
ExecStart=/opt/prometheus/prometheus --config.file=/opt/prometheus/prometheus.yml
[Install]
WantedBy=multi-user.target
Configure Prometheus:
sudo nano /opt/prometheus/prometheus.yml
Basic configuration:
global:
scrape_interval: 15s
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
- job_name: 'node_exporter'
static_configs:
- targets: ['pi-head:9100', 'pi-compute-01:9100', 'pi-compute-02:9100']
Install Node Exporter on all nodes:
wget https://github.com/prometheus/node_exporter/releases/download/v1.5.0/node_exporter-1.5.0.linux-armv7.tar.gz
tar xvf node_exporter-1.5.0.linux-armv7.tar.gz
sudo mv node_exporter-1.5.0.linux-armv7/node_exporter /usr/local/bin/
sudo nano /etc/systemd/system/node_exporter.service
Add to service file:
[Unit]
Description=Node Exporter
After=network-online.target
[Service]
User=pi
ExecStart=/usr/local/bin/node_exporter
[Install]
WantedBy=multi-user.target
Start services:
sudo systemctl daemon-reload
sudo systemctl enable node_exporter
sudo systemctl start node_exporter
# On head node only
sudo systemctl enable prometheus
sudo systemctl start prometheus
Install Grafana:
# Download and install
wget https://dl.grafana.com/oss/release/grafana-9.3.6.linux-armv7.tar.gz
tar xvf grafana-9.3.6.linux-armv7.tar.gz
sudo mv grafana-9.3.6 /opt/grafana
# Create service file
sudo nano /etc/systemd/system/grafana-server.service
Add to service file:
[Unit]
Description=Grafana instance
After=network-online.target
[Service]
User=pi
ExecStart=/opt/grafana/bin/grafana-server -homepath /opt/grafana
[Install]
WantedBy=multi-user.target
Start Grafana:
sudo systemctl daemon-reload
sudo systemctl enable grafana-server
sudo systemctl start grafana-server
- Slurm Monitoring:
Install Slurm exporter:
# Clone repository
git clone https://github.com/vpenso/prometheus-slurm-exporter.git
cd prometheus-slurm-exporter
# Build (requires Go)
sudo apt install -y golang
make build
# Install
sudo cp prometheus-slurm-exporter /usr/local/bin/
# Create service
sudo nano /etc/systemd/system/slurm-exporter.service
Add to service file:
[Unit]
Description=Slurm Exporter
After=network-online.target
[Service]
User=pi
ExecStart=/usr/local/bin/prometheus-slurm-exporter
[Install]
WantedBy=multi-user.target
Update Prometheus config to scrape Slurm metrics:
sudo nano /opt/prometheus/prometheus.yml
Add to scrape_configs:
- job_name: 'slurm'
static_configs:
- targets: ['pi-head:8080']
Start service:
sudo systemctl daemon-reload
sudo systemctl enable slurm-exporter
sudo systemctl start slurm-exporter
sudo systemctl restart prometheus
Logging and Log Analysis
- Centralized Logging:
# Install Loki (log aggregation)
wget https://github.com/grafana/loki/releases/download/v2.7.1/loki-linux-arm.zip
unzip loki-linux-arm.zip
sudo mv loki-linux-arm /usr/local/bin/loki
# Create config file
sudo mkdir -p /opt/loki
sudo nano /opt/loki/config.yml
Basic Loki config:
auth_enabled: false
server:
http_listen_port: 3100
ingester:
lifecycler:
ring:
kvstore:
store: inmemory
replication_factor: 1
final_sleep: 0s
chunk_idle_period: 5m
chunk_retain_period: 30s
schema_config:
configs:
- from: 2020-05-15
store: boltdb
object_store: filesystem
schema: v11
index:
prefix: index_
period: 168h
storage_config:
boltdb:
directory: /opt/loki/index
filesystem:
directory: /opt/loki/chunks
limits_config:
enforce_metric_name: false
reject_old_samples: true
reject_old_samples_max_age: 168h
Create service file:
sudo nano /etc/systemd/system/loki.service
Add to service file:
[Unit]
Description=Loki log aggregation system
After=network-online.target
[Service]
User=pi
ExecStart=/usr/local/bin/loki -config.file=/opt/loki/config.yml
[Install]
WantedBy=multi-user.target
Install Promtail on all nodes:
wget https://github.com/grafana/loki/releases/download/v2.7.1/promtail-linux-arm.zip
unzip promtail-linux-arm.zip
sudo mv promtail-linux-arm /usr/local/bin/promtail
# Create config
sudo mkdir -p /opt/promtail
sudo nano /opt/promtail/config.yml
Promtail config:
server:
http_listen_port: 9080
positions:
filename: /opt/promtail/positions.yaml
clients:
- url: http://pi-head:3100/loki/api/v1/push
scrape_configs:
- job_name: system
static_configs:
- targets:
- localhost
labels:
job: syslog
host: HOSTNAME # Replace with actual hostname in script
__path__: /var/log/syslog
- job_name: slurm
static_configs:
- targets:
- localhost
labels:
job: slurm
host: HOSTNAME # Replace with actual hostname in script
__path__: /var/log/slurm/*.log
Create deployment script:
nano /shared/admin/deploy_promtail.sh
Script content:
#!/bin/bash
# Deploy Promtail to compute nodes
for node in pi-compute-01 pi-compute-02; do
echo "Deploying to $node..."
# Copy config with hostname set
cat /opt/promtail/config.yml | sed "s/HOSTNAME/$node/" > /tmp/promtail_config.yml
scp /tmp/promtail_config.yml pi@$node:/tmp/config.yml
ssh pi@$node "sudo mkdir -p /opt/promtail && sudo mv /tmp/config.yml /opt/promtail/"
# Create service file
cat > /tmp/promtail.service << EOF
[Unit]
Description=Promtail log collector
After=network-online.target
[Service]
User=pi
ExecStart=/usr/local/bin/promtail -config.file=/opt/promtail/config.yml
[Install]
WantedBy=multi-user.target
EOF
scp /tmp/promtail.service pi@$node:/tmp/
ssh pi@$node "sudo mv /tmp/promtail.service /etc/systemd/system/ && sudo systemctl daemon-reload && sudo systemctl enable promtail && sudo systemctl start promtail"
done
# Setup head node
cat /opt/promtail/config.yml | sed "s/HOSTNAME/pi-head/" > /tmp/promtail_config.yml
sudo mv /tmp/promtail_config.yml /opt/promtail/config.yml
sudo systemctl daemon-reload
sudo systemctl enable promtail
sudo systemctl start promtail
- Log Rotation:
sudo nano /etc/logrotate.d/slurm
Add:
/var/log/slurm/*.log {
weekly
rotate 4
compress
missingok
notifempty
create 0640 slurm slurm
}
Troubleshooting Methodology
- Create Troubleshooting Flowcharts:
Node connectivity issues:
- System Health Check Script:
nano /shared/admin/health_check.sh
Script content:
#!/bin/bash
# HPC Cluster Health Check Script
REPORT_FILE="/shared/admin/reports/health_$(date +%Y%m%d_%H%M%S).txt"
mkdir -p $(dirname $REPORT_FILE)
echo "HPC Cluster Health Check - $(date)" > $REPORT_FILE
echo "=============================" >> $REPORT_FILE
# Check system load
echo -e "\n== System Load ==" >> $REPORT_FILE
uptime >> $REPORT_FILE
# Check memory
echo -e "\n== Memory Usage ==" >> $REPORT_FILE
free -h >> $REPORT_FILE
# Check disk space
echo -e "\n== Disk Space ==" >> $REPORT_FILE
df -h >> $REPORT_FILE
# Check network connectivity
echo -e "\n== Network Connectivity ==" >> $REPORT_FILE
for node in pi-head pi-compute-01 pi-compute-02; do
ping -c 1 $node > /dev/null
if [ $? -eq 0 ]; then
echo "$node: OK" >> $REPORT_FILE
else
echo "$node: FAIL" >> $REPORT_FILE
fi
done
# Check Slurm status
echo -e "\n== Slurm Status ==" >> $REPORT_FILE
sinfo >> $REPORT_FILE
# Check running jobs
echo -e "\n== Running Jobs ==" >> $REPORT_FILE
squeue >> $REPORT_FILE
# Check Munge status
echo -e "\n== Munge Status ==" >> $REPORT_FILE
for node in pi-head pi-compute-01 pi-compute-02; do
ssh $node "munge -n | unmunge" &> /dev/null
if [ $? -eq 0 ]; then
echo "$node: OK" >> $REPORT_FILE
else
echo "$node: FAIL" >> $REPORT_FILE
fi
done
# Check NFS mounts
echo -e "\n== NFS Mounts ==" >> $REPORT_FILE
for node in pi-compute-01 pi-compute-02; do
ssh $node "df -h | grep -q /shared"
if [ $? -eq 0 ]; then
echo "$node: OK" >> $REPORT_FILE
else
echo "$node: FAIL" >> $REPORT_FILE
fi
done
echo -e "\nReport saved to $REPORT_FILE"
Make executable:
chmod +x /shared/admin/health_check.sh
- Common Slurm Troubleshooting Commands:
# Check Slurm daemon status
sudo systemctl status slurmctld
sudo systemctl status slurmd
# Check Slurm configuration
scontrol show config
# Test configuration file
sudo slurmd -C
# Check node status
sinfo -N
scontrol show node
# Job debugging
scontrol show job JOB_ID
sacct -j JOB_ID --format=JobID,JobName,MaxRSS,MaxVMSize,State,ExitCode
# View detailed logs
sudo cat /var/log/slurm/slurmctld.log
sudo cat /var/log/slurm/slurmd.log
Performance Tuning
Node-Level Performance Optimization
- CPU Frequency Management:
# Install cpufrequtils
sudo apt install -y cpufrequtils
# View available governors
cpufreq-info
# Set performance governor
sudo cpufreq-set -g performance
# Make permanent
sudo nano /etc/default/cpufrequtils
Add:
GOVERNOR="performance"
- Memory Management:
# View memory info
cat /proc/meminfo
# Adjust swappiness (lower values prioritize RAM over swap)
sudo sysctl vm.swappiness=10
# Make permanent
sudo nano /etc/sysctl.conf
Add:
vm.swappiness = 10
- I/O Tuning:
# View current I/O scheduler
cat /sys/block/sda/queue/scheduler
# Set deadline scheduler
echo deadline | sudo tee /sys/block/sda/queue/scheduler
# Make permanent
sudo nano /etc/default/grub
Update GRUB_CMDLINE_LINUX:
GRUB_CMDLINE_LINUX="elevator=deadline"
Update Grub:
sudo update-grub
Network Performance
- Network Interface Tuning:
# Check link speed
ethtool eth0
# Increase buffer sizes
sudo nano /etc/sysctl.conf
Add:
# Increase TCP max buffer size
net.core.rmem_max = 16777216
net.core.wmem_max = 16777216
# Increase default buffer size
net.core.rmem_default = 262144
net.core.wmem_default = 262144
# Increase TCP buffer limits
net.ipv4.tcp_rmem = 4096 87380 16777216
net.ipv4.tcp_wmem = 4096 65536 16777216
Apply changes:
sudo sysctl -p
- NFS Performance:
# Check current NFS stats
nfsstat
# Update mount options
sudo nano /etc/fstab
Optimize NFS options:
pi-head:/shared /shared nfs rsize=1048576,wsize=1048576,noatime,actimeo=600,nofail 0 0
Slurm Performance Optimization
- Scheduler Tuning:
sudo nano /etc/slurm/slurm.conf
Add/modify:
# Scheduler optimization
SchedulerType=sched/backfill
SchedulerParameters=bf_max_job_user=30,bf_resolution=180
# Prolog/Epilog scripts
PrologFlags=AllowSetDefault
Prolog=/shared/admin/slurm_prolog.sh
Epilog=/shared/admin/slurm_epilog.sh
- Create Prolog/Epilog Scripts:
nano /shared/admin/slurm_prolog.sh
Script content:
#!/bin/bash
# Slurm prolog script - runs before job starts
# Clean /tmp
find /tmp -user $SLURM_JOB_USER -mtime +3 -delete
# Create job-specific temp directory
JOB_TMP_DIR="/scratch/$SLURM_JOB_USER/$SLURM_JOB_ID"
mkdir -p $JOB_TMP_DIR
chown $SLURM_JOB_USER:$SLURM_JOB_USER $JOB_TMP_DIR
# Log job start
logger -t slurm-prolog "Job $SLURM_JOB_ID started for user $SLURM_JOB_USER"
Epilog script:
nano /shared/admin/slurm_epilog.sh
Script content:
#!/bin/bash
# Slurm epilog script - runs after job completes
# Clean up job-specific temp directory
JOB_TMP_DIR="/scratch/$SLURM_JOB_USER/$SLURM_JOB_ID"
if [ -d "$JOB_TMP_DIR" ]; then
rm -rf $JOB_TMP_DIR
fi
# Log job end
logger -t slurm-epilog "Job $SLURM_JOB_ID completed for user $SLURM_JOB_USER"
Make scripts executable:
chmod +x /shared/admin/slurm_prolog.sh
chmod +x /shared/admin/slurm_epilog.sh
- Job Performance Collection:
sudo nano /etc/slurm/slurm.conf
Add:
# Job accounting
JobAcctGatherType=jobacct_gather/linux
JobAcctGatherFrequency=30
Create job efficiency report script:
nano /shared/admin/job_efficiency.sh
Script content:
#!/bin/bash
# Report on job resource efficiency
if [ -z "$1" ]; then
echo "Usage: $0 <job_id>"
exit 1
fi
JOB_ID=$1
# Get job details
sacct -j $JOB_ID --format=JobID,JobName,NNodes,NCPUS,TotalCPU,Elapsed,MaxRSS,MaxVMSize,State
# Calculate CPU efficiency
TOTAL_CPU=$(sacct -j $JOB_ID --format=TotalCPU --noheader | head -1)
ELAPSED=$(sacct -j $JOB_ID --format=Elapsed --noheader | head -1)
NCPUS=$(sacct -j $JOB_ID --format=NCPUS --noheader | head -1)
# Convert HH:MM:SS to seconds
ELAPSED_SEC=$(echo $ELAPSED | awk -F: '{ print ($1 * 3600) + ($2 * 60) + $3 }')
TOTAL_CPU_SEC=$(echo $TOTAL_CPU | awk -F: '{ print ($1 * 3600) + ($2 * 60) + $3 }')
# Calculate efficiency as percentage
EFFICIENCY=$(echo "scale=2; 100 * $TOTAL_CPU_SEC / ($ELAPSED_SEC * $NCPUS)" | bc)
echo ""
echo "CPU Efficiency: $EFFICIENCY%"
# Memory efficiency
MEM_REQ=$(scontrol show job $JOB_ID | grep -oP 'MinMemoryCPU=\K\d+')
MEM_USED=$(sacct -j $JOB_ID --format=MaxRSS --noheader --units=M | head -1 | sed 's/M//')
MEM_EFFICIENCY=$(echo "scale=2; 100 * $MEM_USED / $MEM_REQ" | bc)
echo "Memory Efficiency: $MEM_EFFICIENCY%"