HPC Administration Tutorial - Part 3

Created: 2025-03-16 17:02:37 | Last updated: 2025-03-16 17:02:37 | Status: Public

Monitoring, Troubleshooting, Performance Tuning, and Documentation

This third part of our HPC administration tutorial focuses on operational aspects of managing an HPC cluster. Learning to effectively monitor, troubleshoot, optimize, and document your HPC environment is essential for maintaining reliable research computing services.

Table of Contents

Monitoring and Troubleshooting

System Monitoring Setup

  1. Basic Monitoring Tools:
   # Install essential monitoring tools
   sudo apt install -y htop iotop sysstat nmon dstat
  1. Setting up Prometheus and Grafana:

Install Prometheus:

   # Download Prometheus
   wget https://github.com/prometheus/prometheus/releases/download/v2.43.0/prometheus-2.43.0.linux-armv7.tar.gz
   tar xvf prometheus-2.43.0.linux-armv7.tar.gz
   sudo mv prometheus-2.43.0.linux-armv7 /opt/prometheus

   # Create service file
   sudo nano /etc/systemd/system/prometheus.service

Add to service file:

   [Unit]
   Description=Prometheus Monitoring System
   Documentation=https://prometheus.io/docs/introduction/overview/
   After=network-online.target

   [Service]
   User=pi
   ExecStart=/opt/prometheus/prometheus --config.file=/opt/prometheus/prometheus.yml

   [Install]
   WantedBy=multi-user.target

Configure Prometheus:

   sudo nano /opt/prometheus/prometheus.yml

Basic configuration:

   global:
     scrape_interval: 15s

   scrape_configs:
     - job_name: 'prometheus'
       static_configs:
         - targets: ['localhost:9090']

     - job_name: 'node_exporter'
       static_configs:
         - targets: ['pi-head:9100', 'pi-compute-01:9100', 'pi-compute-02:9100']

Install Node Exporter on all nodes:

   wget https://github.com/prometheus/node_exporter/releases/download/v1.5.0/node_exporter-1.5.0.linux-armv7.tar.gz
   tar xvf node_exporter-1.5.0.linux-armv7.tar.gz
   sudo mv node_exporter-1.5.0.linux-armv7/node_exporter /usr/local/bin/

   sudo nano /etc/systemd/system/node_exporter.service

Add to service file:

   [Unit]
   Description=Node Exporter
   After=network-online.target

   [Service]
   User=pi
   ExecStart=/usr/local/bin/node_exporter

   [Install]
   WantedBy=multi-user.target

Start services:

   sudo systemctl daemon-reload
   sudo systemctl enable node_exporter
   sudo systemctl start node_exporter

   # On head node only
   sudo systemctl enable prometheus
   sudo systemctl start prometheus

Install Grafana:

   # Download and install
   wget https://dl.grafana.com/oss/release/grafana-9.3.6.linux-armv7.tar.gz
   tar xvf grafana-9.3.6.linux-armv7.tar.gz
   sudo mv grafana-9.3.6 /opt/grafana

   # Create service file
   sudo nano /etc/systemd/system/grafana-server.service

Add to service file:

   [Unit]
   Description=Grafana instance
   After=network-online.target

   [Service]
   User=pi
   ExecStart=/opt/grafana/bin/grafana-server -homepath /opt/grafana

   [Install]
   WantedBy=multi-user.target

Start Grafana:

   sudo systemctl daemon-reload
   sudo systemctl enable grafana-server
   sudo systemctl start grafana-server
  1. Slurm Monitoring:

Install Slurm exporter:

   # Clone repository
   git clone https://github.com/vpenso/prometheus-slurm-exporter.git
   cd prometheus-slurm-exporter

   # Build (requires Go)
   sudo apt install -y golang
   make build

   # Install
   sudo cp prometheus-slurm-exporter /usr/local/bin/

   # Create service
   sudo nano /etc/systemd/system/slurm-exporter.service

Add to service file:

   [Unit]
   Description=Slurm Exporter
   After=network-online.target

   [Service]
   User=pi
   ExecStart=/usr/local/bin/prometheus-slurm-exporter

   [Install]
   WantedBy=multi-user.target

Update Prometheus config to scrape Slurm metrics:

   sudo nano /opt/prometheus/prometheus.yml

Add to scrape_configs:

   - job_name: 'slurm'
     static_configs:
       - targets: ['pi-head:8080']

Start service:

   sudo systemctl daemon-reload
   sudo systemctl enable slurm-exporter
   sudo systemctl start slurm-exporter
   sudo systemctl restart prometheus

Logging and Log Analysis

  1. Centralized Logging:
   # Install Loki (log aggregation)
   wget https://github.com/grafana/loki/releases/download/v2.7.1/loki-linux-arm.zip
   unzip loki-linux-arm.zip
   sudo mv loki-linux-arm /usr/local/bin/loki

   # Create config file
   sudo mkdir -p /opt/loki
   sudo nano /opt/loki/config.yml

Basic Loki config:

   auth_enabled: false

   server:
     http_listen_port: 3100

   ingester:
     lifecycler:
       ring:
         kvstore:
           store: inmemory
         replication_factor: 1
       final_sleep: 0s
     chunk_idle_period: 5m
     chunk_retain_period: 30s

   schema_config:
     configs:
       - from: 2020-05-15
         store: boltdb
         object_store: filesystem
         schema: v11
         index:
           prefix: index_
           period: 168h

   storage_config:
     boltdb:
       directory: /opt/loki/index

     filesystem:
       directory: /opt/loki/chunks

   limits_config:
     enforce_metric_name: false
     reject_old_samples: true
     reject_old_samples_max_age: 168h

Create service file:

   sudo nano /etc/systemd/system/loki.service

Add to service file:

   [Unit]
   Description=Loki log aggregation system
   After=network-online.target

   [Service]
   User=pi
   ExecStart=/usr/local/bin/loki -config.file=/opt/loki/config.yml

   [Install]
   WantedBy=multi-user.target

Install Promtail on all nodes:

   wget https://github.com/grafana/loki/releases/download/v2.7.1/promtail-linux-arm.zip
   unzip promtail-linux-arm.zip
   sudo mv promtail-linux-arm /usr/local/bin/promtail

   # Create config
   sudo mkdir -p /opt/promtail
   sudo nano /opt/promtail/config.yml

Promtail config:

   server:
     http_listen_port: 9080

   positions:
     filename: /opt/promtail/positions.yaml

   clients:
     - url: http://pi-head:3100/loki/api/v1/push

   scrape_configs:
     - job_name: system
       static_configs:
       - targets:
           - localhost
         labels:
           job: syslog
           host: HOSTNAME  # Replace with actual hostname in script
           __path__: /var/log/syslog

     - job_name: slurm
       static_configs:
       - targets:
           - localhost
         labels:
           job: slurm
           host: HOSTNAME  # Replace with actual hostname in script
           __path__: /var/log/slurm/*.log

Create deployment script:

   nano /shared/admin/deploy_promtail.sh

Script content:

   #!/bin/bash
   # Deploy Promtail to compute nodes

   for node in pi-compute-01 pi-compute-02; do
     echo "Deploying to $node..."

     # Copy config with hostname set
     cat /opt/promtail/config.yml | sed "s/HOSTNAME/$node/" > /tmp/promtail_config.yml
     scp /tmp/promtail_config.yml pi@$node:/tmp/config.yml

     ssh pi@$node "sudo mkdir -p /opt/promtail && sudo mv /tmp/config.yml /opt/promtail/"

     # Create service file
     cat > /tmp/promtail.service << EOF
   [Unit]
   Description=Promtail log collector
   After=network-online.target

   [Service]
   User=pi
   ExecStart=/usr/local/bin/promtail -config.file=/opt/promtail/config.yml

   [Install]
   WantedBy=multi-user.target
   EOF

     scp /tmp/promtail.service pi@$node:/tmp/
     ssh pi@$node "sudo mv /tmp/promtail.service /etc/systemd/system/ && sudo systemctl daemon-reload && sudo systemctl enable promtail && sudo systemctl start promtail"
   done

   # Setup head node
   cat /opt/promtail/config.yml | sed "s/HOSTNAME/pi-head/" > /tmp/promtail_config.yml
   sudo mv /tmp/promtail_config.yml /opt/promtail/config.yml
   sudo systemctl daemon-reload
   sudo systemctl enable promtail
   sudo systemctl start promtail
  1. Log Rotation:
   sudo nano /etc/logrotate.d/slurm

Add:

   /var/log/slurm/*.log {
       weekly
       rotate 4
       compress
       missingok
       notifempty
       create 0640 slurm slurm
   }

Troubleshooting Methodology

  1. Create Troubleshooting Flowcharts:

Node connectivity issues:

flowchart TD A[Node Connectivity Issue] --> B{Can ping node?} B -->|Yes| C{SSH works?} B -->|No| D[Check physical network] D --> E[Check switch connection] D --> F[Verify IP settings] C -->|Yes| G{Slurm status?} C -->|No| H[Check SSH service] H --> I[Check SSH config] H --> J[Check firewall] G -->|DOWN| K[Check slurmd service] G -->|DRAIN| L[Node in drain state] G -->|Up| M[Connectivity OK] K --> N[Check slurmd logs] K --> O[Verify munge authentication] L --> P[Check for drain reason]
  1. System Health Check Script:
   nano /shared/admin/health_check.sh

Script content:

   #!/bin/bash
   # HPC Cluster Health Check Script

   REPORT_FILE="/shared/admin/reports/health_$(date +%Y%m%d_%H%M%S).txt"
   mkdir -p $(dirname $REPORT_FILE)

   echo "HPC Cluster Health Check - $(date)" > $REPORT_FILE
   echo "=============================" >> $REPORT_FILE

   # Check system load
   echo -e "\n== System Load ==" >> $REPORT_FILE
   uptime >> $REPORT_FILE

   # Check memory
   echo -e "\n== Memory Usage ==" >> $REPORT_FILE
   free -h >> $REPORT_FILE

   # Check disk space
   echo -e "\n== Disk Space ==" >> $REPORT_FILE
   df -h >> $REPORT_FILE

   # Check network connectivity
   echo -e "\n== Network Connectivity ==" >> $REPORT_FILE
   for node in pi-head pi-compute-01 pi-compute-02; do
     ping -c 1 $node > /dev/null
     if [ $? -eq 0 ]; then
       echo "$node: OK" >> $REPORT_FILE
     else
       echo "$node: FAIL" >> $REPORT_FILE
     fi
   done

   # Check Slurm status
   echo -e "\n== Slurm Status ==" >> $REPORT_FILE
   sinfo >> $REPORT_FILE

   # Check running jobs
   echo -e "\n== Running Jobs ==" >> $REPORT_FILE
   squeue >> $REPORT_FILE

   # Check Munge status
   echo -e "\n== Munge Status ==" >> $REPORT_FILE
   for node in pi-head pi-compute-01 pi-compute-02; do
     ssh $node "munge -n | unmunge" &> /dev/null
     if [ $? -eq 0 ]; then
       echo "$node: OK" >> $REPORT_FILE
     else
       echo "$node: FAIL" >> $REPORT_FILE
     fi
   done

   # Check NFS mounts
   echo -e "\n== NFS Mounts ==" >> $REPORT_FILE
   for node in pi-compute-01 pi-compute-02; do
     ssh $node "df -h | grep -q /shared"
     if [ $? -eq 0 ]; then
       echo "$node: OK" >> $REPORT_FILE
     else
       echo "$node: FAIL" >> $REPORT_FILE
     fi
   done

   echo -e "\nReport saved to $REPORT_FILE"

Make executable:

   chmod +x /shared/admin/health_check.sh
  1. Common Slurm Troubleshooting Commands:
   # Check Slurm daemon status
   sudo systemctl status slurmctld
   sudo systemctl status slurmd

   # Check Slurm configuration
   scontrol show config

   # Test configuration file
   sudo slurmd -C

   # Check node status
   sinfo -N
   scontrol show node

   # Job debugging
   scontrol show job JOB_ID
   sacct -j JOB_ID --format=JobID,JobName,MaxRSS,MaxVMSize,State,ExitCode

   # View detailed logs
   sudo cat /var/log/slurm/slurmctld.log
   sudo cat /var/log/slurm/slurmd.log

Performance Tuning

Node-Level Performance Optimization

  1. CPU Frequency Management:
   # Install cpufrequtils
   sudo apt install -y cpufrequtils

   # View available governors
   cpufreq-info

   # Set performance governor
   sudo cpufreq-set -g performance

   # Make permanent
   sudo nano /etc/default/cpufrequtils

Add:

   GOVERNOR="performance"
  1. Memory Management:
   # View memory info
   cat /proc/meminfo

   # Adjust swappiness (lower values prioritize RAM over swap)
   sudo sysctl vm.swappiness=10

   # Make permanent
   sudo nano /etc/sysctl.conf

Add:

   vm.swappiness = 10
  1. I/O Tuning:
   # View current I/O scheduler
   cat /sys/block/sda/queue/scheduler

   # Set deadline scheduler
   echo deadline | sudo tee /sys/block/sda/queue/scheduler

   # Make permanent
   sudo nano /etc/default/grub

Update GRUB_CMDLINE_LINUX:

   GRUB_CMDLINE_LINUX="elevator=deadline"

Update Grub:

   sudo update-grub

Network Performance

  1. Network Interface Tuning:
   # Check link speed
   ethtool eth0

   # Increase buffer sizes
   sudo nano /etc/sysctl.conf

Add:

   # Increase TCP max buffer size
   net.core.rmem_max = 16777216
   net.core.wmem_max = 16777216

   # Increase default buffer size
   net.core.rmem_default = 262144
   net.core.wmem_default = 262144

   # Increase TCP buffer limits
   net.ipv4.tcp_rmem = 4096 87380 16777216
   net.ipv4.tcp_wmem = 4096 65536 16777216

Apply changes:

   sudo sysctl -p
  1. NFS Performance:
   # Check current NFS stats
   nfsstat

   # Update mount options
   sudo nano /etc/fstab

Optimize NFS options:

   pi-head:/shared /shared nfs rsize=1048576,wsize=1048576,noatime,actimeo=600,nofail 0 0

Slurm Performance Optimization

  1. Scheduler Tuning:
   sudo nano /etc/slurm/slurm.conf

Add/modify:

   # Scheduler optimization
   SchedulerType=sched/backfill
   SchedulerParameters=bf_max_job_user=30,bf_resolution=180

   # Prolog/Epilog scripts
   PrologFlags=AllowSetDefault
   Prolog=/shared/admin/slurm_prolog.sh
   Epilog=/shared/admin/slurm_epilog.sh
  1. Create Prolog/Epilog Scripts:
   nano /shared/admin/slurm_prolog.sh

Script content:

   #!/bin/bash
   # Slurm prolog script - runs before job starts

   # Clean /tmp
   find /tmp -user $SLURM_JOB_USER -mtime +3 -delete

   # Create job-specific temp directory
   JOB_TMP_DIR="/scratch/$SLURM_JOB_USER/$SLURM_JOB_ID"
   mkdir -p $JOB_TMP_DIR
   chown $SLURM_JOB_USER:$SLURM_JOB_USER $JOB_TMP_DIR

   # Log job start
   logger -t slurm-prolog "Job $SLURM_JOB_ID started for user $SLURM_JOB_USER"

Epilog script:

   nano /shared/admin/slurm_epilog.sh

Script content:

   #!/bin/bash
   # Slurm epilog script - runs after job completes

   # Clean up job-specific temp directory
   JOB_TMP_DIR="/scratch/$SLURM_JOB_USER/$SLURM_JOB_ID"
   if [ -d "$JOB_TMP_DIR" ]; then
     rm -rf $JOB_TMP_DIR
   fi

   # Log job end
   logger -t slurm-epilog "Job $SLURM_JOB_ID completed for user $SLURM_JOB_USER"

Make scripts executable:

   chmod +x /shared/admin/slurm_prolog.sh
   chmod +x /shared/admin/slurm_epilog.sh
  1. Job Performance Collection:
   sudo nano /etc/slurm/slurm.conf

Add:

   # Job accounting
   JobAcctGatherType=jobacct_gather/linux
   JobAcctGatherFrequency=30

Create job efficiency report script:

   nano /shared/admin/job_efficiency.sh

Script content:

   #!/bin/bash
   # Report on job resource efficiency

   if [ -z "$1" ]; then
     echo "Usage: $0 <job_id>"
     exit 1
   fi

   JOB_ID=$1

   # Get job details
   sacct -j $JOB_ID --format=JobID,JobName,NNodes,NCPUS,TotalCPU,Elapsed,MaxRSS,MaxVMSize,State

   # Calculate CPU efficiency
   TOTAL_CPU=$(sacct -j $JOB_ID --format=TotalCPU --noheader | head -1)
   ELAPSED=$(sacct -j $JOB_ID --format=Elapsed --noheader | head -1)
   NCPUS=$(sacct -j $JOB_ID --format=NCPUS --noheader | head -1)

   # Convert HH:MM:SS to seconds
   ELAPSED_SEC=$(echo $ELAPSED | awk -F: '{ print ($1 * 3600) + ($2 * 60) + $3 }')
   TOTAL_CPU_SEC=$(echo $TOTAL_CPU | awk -F: '{ print ($1 * 3600) + ($2 * 60) + $3 }')

   # Calculate efficiency as percentage
   EFFICIENCY=$(echo "scale=2; 100 * $TOTAL_CPU_SEC / ($ELAPSED_SEC * $NCPUS)" | bc)

   echo ""
   echo "CPU Efficiency: $EFFICIENCY%"

   # Memory efficiency
   MEM_REQ=$(scontrol show job $JOB_ID | grep -oP 'MinMemoryCPU=\K\d+')
   MEM_USED=$(sacct -j $JOB_ID --format=MaxRSS --noheader --units=M | head -1 | sed 's/M//')

   MEM_EFFICIENCY=$(echo "scale=2; 100 * $MEM_USED / $MEM_REQ" | bc)

   echo "Memory Efficiency: $MEM_EFFICIENCY%"