HPC Administration Tutorial - Part 1

Created: 2025-03-16 17:02:07 | Last updated: 2025-03-16 17:02:07 | Status: Public

Introduction, Architecture Planning, and Deployment

This tutorial series is designed to help you develop skills relevant to a High-Performance Computing (HPC) System Administrator role in a research environment. While we’ll be using a small-scale Raspberry Pi 5 cluster, the concepts and practices apply to larger enterprise HPC environments.

Table of Contents

Introduction to HPC Administration

The Role of an HPC Administrator

As an HPC system administrator, you’ll be responsible for configuring, deploying, and supporting HPC clusters that enable critical research. This includes:

  1. System Administration: Managing hardware, software, networking, and security
  2. Problem Solving: Diagnosing and resolving complex technical issues
  3. User Support: Helping researchers effectively utilize HPC resources
  4. Performance Optimization: Ensuring systems operate at peak efficiency
  5. Infrastructure Planning: Contributing to future system specifications and upgrades

HPC in Research Environments

Research computing has unique requirements compared to enterprise IT:

  • High computational demands: Simulations, data analysis, AI/ML workloads
  • Specialized software stacks: Domain-specific applications and libraries
  • Diverse user needs: Supporting multiple research disciplines
  • Resource sharing: Efficient allocation across various research groups
  • Data-intensive workflows: Managing, processing, and storing large datasets

Skills Development Approach

This tutorial uses a small-scale Raspberry Pi cluster to develop foundational skills that scale to enterprise environments:

  • Learning by doing: Hands-on experience with real hardware
  • Problem-based learning: Troubleshooting actual system issues
  • Progressive complexity: Building from basic to advanced concepts
  • Documentation practice: Creating clear technical documentation
  • Research mindset: Exploring new technologies and approaches

Cluster Architecture and Planning

Requirements Analysis

Before building any HPC system, you must understand the requirements:

  1. Computational needs:
    - Types of workloads (CPU-bound, memory-bound, I/O-intensive)
    - Parallel computing requirements (MPI, shared memory, accelerators)
    - Job characteristics (long-running, high-throughput, interactive)

  2. User requirements:
    - Number of concurrent users
    - Software environment needs
    - Data storage and transfer requirements

  3. Resource constraints:
    - Budget limitations
    - Power and cooling capacity
    - Physical space
    - Network infrastructure

Hardware Selection

For our Raspberry Pi 5 cluster, we’re using:

  • 3× Raspberry Pi 5 (8GB RAM model)
  • Gigabit Ethernet switch
  • Power supplies
  • Storage devices

In enterprise environments, you would consider:

  • Compute nodes: Server-grade CPUs, memory capacity, accelerators (GPUs)
  • Network fabric: InfiniBand, high-speed Ethernet, or specialized interconnects
  • Storage systems: Parallel file systems, tiered storage, archival solutions
  • Management infrastructure: Out-of-band management, monitoring systems

Logical Architecture

Our Raspberry Pi cluster uses a simple architecture:

graph TD subgraph Logical Architecture A[Head Node
pi-head] --> B[Compute Node
pi-compute-01] A --> C[Compute Node
pi-compute-02] end subgraph Services A --> D[NFS Server] A --> E[Slurm Controller] A --> F[User Authentication] A --> G[Monitoring] B --> H[Slurm Compute Daemon] C --> I[Slurm Compute Daemon] end

Enterprise HPC clusters follow similar patterns but with more complexity:

  • Dedicated login nodes
  • Multiple management nodes for redundancy
  • Specialized data transfer nodes
  • Visualization nodes
  • Different node types optimized for specific workloads

Network Architecture

Our simple network layout:

graph TD A[Gigabit Switch] --- B[pi-head
192.168.1.100] A --- C[pi-compute-01
192.168.1.101] A --- D[pi-compute-02
192.168.1.102]

In enterprise settings, you would design:

  • Multiple networks (management, storage, computation)
  • High-bandwidth, low-latency interconnects
  • Network security zones
  • External connectivity with appropriate controls

Installation and Deployment

Operating System Installation

For our Raspberry Pi cluster:

  1. Base OS Installation:
   # Flash Debian Bookworm to microSD cards using Raspberry Pi Imager
   # Enable SSH during initial setup
  1. System Updates:
   sudo apt update
   sudo apt upgrade -y
  1. Essential Packages:
   sudo apt install -y vim git htop ntp rsync build-essential python3-pip

Enterprise OS Deployment

In larger environments, you would:

  • Use automated provisioning tools (Kickstart, Cobbler, etc.)
  • Implement PXE boot infrastructure
  • Create standardized OS images
  • Use configuration management tools (Ansible, Puppet, etc.)

Network Configuration

  1. Hostname Configuration:
   # On head node
   sudo hostnamectl set-hostname pi-head

   # On compute nodes
   sudo hostnamectl set-hostname pi-compute-01
   sudo hostnamectl set-hostname pi-compute-02
  1. Static IP Assignment:
   sudo nano /etc/network/interfaces.d/eth0

Add to the file:

   auto eth0
   iface eth0 inet static
       address 192.168.1.100  # Adjust for each node
       netmask 255.255.255.0
       gateway 192.168.1.1    # Your network gateway
       dns-nameservers 8.8.8.8 8.8.4.4
  1. Host File Configuration:
   sudo nano /etc/hosts

Add:

   192.168.1.100 pi-head
   192.168.1.101 pi-compute-01
   192.168.1.102 pi-compute-02

Cluster Management Tools

For our Raspberry Pi cluster, we’ll use a simple toolset:

  1. Cluster Shell:
   sudo apt install -y clustershell

Configure node groups:

   sudo mkdir -p /etc/clustershell/groups.d
   sudo nano /etc/clustershell/groups.d/local.cfg

Add:

   all: pi-head,pi-compute-[01-02]
   compute: pi-compute-[01-02]
  1. Shared SSH Access:
   # Generate SSH key on head node
   ssh-keygen -t ed25519

   # Distribute to compute nodes
   ssh-copy-id pi@pi-compute-01
   ssh-copy-id pi@pi-compute-02

Slurm Installation

Slurm is the industry-standard HPC job scheduler:

  1. Install Dependencies:
   # On all nodes
   sudo apt install -y munge libmunge-dev

   # On head node
   sudo apt install -y slurm-wlm

   # On compute nodes
   sudo apt install -y slurmd slurm-client
  1. Munge Configuration (authentication):
   # On head node
   sudo systemctl enable munge
   sudo systemctl start munge

   # Copy key to compute nodes
   sudo scp /etc/munge/munge.key pi@pi-compute-01:/tmp/
   sudo scp /etc/munge/munge.key pi@pi-compute-02:/tmp/

On each compute node:

   sudo cp /tmp/munge.key /etc/munge/
   sudo chown munge:munge /etc/munge/munge.key
   sudo chmod 400 /etc/munge/munge.key
   sudo systemctl enable munge
   sudo systemctl start munge
  1. Slurm Configuration:
   sudo nano /etc/slurm/slurm.conf

Basic configuration:

   ClusterName=pi-cluster
   SlurmctldHost=pi-head

   # Control
   SlurmctldPidFile=/var/run/slurm/slurmctld.pid
   SlurmdPidFile=/var/run/slurm/slurmd.pid
   SlurmdSpoolDir=/var/spool/slurmd
   StateSaveLocation=/var/spool/slurmctld

   # Authentication
   AuthType=auth/munge
   CryptoType=crypto/munge

   # Node configurations
   NodeName=pi-head CPUs=4 RealMemory=7000 State=UNKNOWN
   NodeName=pi-compute-[01-02] CPUs=4 RealMemory=7000 State=UNKNOWN

   # Partition configuration
   PartitionName=main Nodes=pi-head,pi-compute-[01-02] Default=YES MaxTime=INFINITE State=UP
  1. Start Slurm Services:
   # On head node
   sudo systemctl enable slurmctld
   sudo systemctl start slurmctld

   # On compute nodes
   sudo systemctl enable slurmd
   sudo systemctl start slurmd

Storage Configuration

  1. NFS Setup (on head node):
   sudo apt install -y nfs-kernel-server

   # Create shared directories
   sudo mkdir -p /shared /shared/apps /shared/data

   sudo nano /etc/exports

Add:

   /shared 192.168.1.0/24(rw,sync,no_subtree_check)
   /home   192.168.1.0/24(rw,sync,no_subtree_check)

Activate:

   sudo exportfs -a
   sudo systemctl restart nfs-kernel-server
  1. NFS Client Setup (on compute nodes):
   sudo apt install -y nfs-common

   # Create mount points
   sudo mkdir -p /shared

   # Mount NFS shares
   sudo mount pi-head:/shared /shared
   sudo mount pi-head:/home /home

   # Make persistent
   sudo nano /etc/fstab

Add:

   pi-head:/shared /shared nfs defaults 0 0
   pi-head:/home   /home   nfs defaults 0 0

Software Environment

  1. Environment Modules:
   sudo apt install -y environment-modules
   mkdir -p /shared/modulefiles

Create a sample module:

   nano /shared/modulefiles/python-ml

Add:

   #%Module1.0

   proc ModulesHelp { } {
       puts stderr "Python ML environment"
   }

   module-whatis "Python machine learning environment"

   prepend-path PATH /shared/apps/python-ml/bin
  1. Deploy Software:
   mkdir -p /shared/apps/python-ml/bin

   # Create virtual environment
   python3 -m venv /shared/apps/python-ml

   # Activate and install packages
   source /shared/apps/python-ml/bin/activate
   pip install numpy scipy scikit-learn pandas matplotlib

System Testing

Verify your deployment:

# Check Slurm
sinfo

# Check NFS
df -h

# Test module system
module avail
module load python-ml

Submit a test job:

# Create job script
cat > test.sh << 'EOF'
#!/bin/bash
#SBATCH --job-name=test
#SBATCH --output=test_%j.out
#SBATCH --nodes=2
#SBATCH --ntasks=2

hostname
sleep 10
srun hostname
EOF

# Submit job
sbatch test.sh

# Check status
squeue

Next Steps

In the next tutorial, we’ll cover:
- Account management and security
- Resource management and scheduling
- Storage solutions and data management

This foundation will prepare you for more advanced HPC administration tasks.