HPC Administration Tutorial - Part 1
Created: 2025-03-16 17:02:07 | Last updated: 2025-03-16 17:02:07 | Status: Public
Introduction, Architecture Planning, and Deployment
This tutorial series is designed to help you develop skills relevant to a High-Performance Computing (HPC) System Administrator role in a research environment. While we’ll be using a small-scale Raspberry Pi 5 cluster, the concepts and practices apply to larger enterprise HPC environments.
Table of Contents
Introduction to HPC Administration
The Role of an HPC Administrator
As an HPC system administrator, you’ll be responsible for configuring, deploying, and supporting HPC clusters that enable critical research. This includes:
- System Administration: Managing hardware, software, networking, and security
- Problem Solving: Diagnosing and resolving complex technical issues
- User Support: Helping researchers effectively utilize HPC resources
- Performance Optimization: Ensuring systems operate at peak efficiency
- Infrastructure Planning: Contributing to future system specifications and upgrades
HPC in Research Environments
Research computing has unique requirements compared to enterprise IT:
- High computational demands: Simulations, data analysis, AI/ML workloads
- Specialized software stacks: Domain-specific applications and libraries
- Diverse user needs: Supporting multiple research disciplines
- Resource sharing: Efficient allocation across various research groups
- Data-intensive workflows: Managing, processing, and storing large datasets
Skills Development Approach
This tutorial uses a small-scale Raspberry Pi cluster to develop foundational skills that scale to enterprise environments:
- Learning by doing: Hands-on experience with real hardware
- Problem-based learning: Troubleshooting actual system issues
- Progressive complexity: Building from basic to advanced concepts
- Documentation practice: Creating clear technical documentation
- Research mindset: Exploring new technologies and approaches
Cluster Architecture and Planning
Requirements Analysis
Before building any HPC system, you must understand the requirements:
-
Computational needs:
- Types of workloads (CPU-bound, memory-bound, I/O-intensive)
- Parallel computing requirements (MPI, shared memory, accelerators)
- Job characteristics (long-running, high-throughput, interactive) -
User requirements:
- Number of concurrent users
- Software environment needs
- Data storage and transfer requirements -
Resource constraints:
- Budget limitations
- Power and cooling capacity
- Physical space
- Network infrastructure
Hardware Selection
For our Raspberry Pi 5 cluster, we’re using:
- 3× Raspberry Pi 5 (8GB RAM model)
- Gigabit Ethernet switch
- Power supplies
- Storage devices
In enterprise environments, you would consider:
- Compute nodes: Server-grade CPUs, memory capacity, accelerators (GPUs)
- Network fabric: InfiniBand, high-speed Ethernet, or specialized interconnects
- Storage systems: Parallel file systems, tiered storage, archival solutions
- Management infrastructure: Out-of-band management, monitoring systems
Logical Architecture
Our Raspberry Pi cluster uses a simple architecture:
pi-head] --> B[Compute Node
pi-compute-01] A --> C[Compute Node
pi-compute-02] end subgraph Services A --> D[NFS Server] A --> E[Slurm Controller] A --> F[User Authentication] A --> G[Monitoring] B --> H[Slurm Compute Daemon] C --> I[Slurm Compute Daemon] end
Enterprise HPC clusters follow similar patterns but with more complexity:
- Dedicated login nodes
- Multiple management nodes for redundancy
- Specialized data transfer nodes
- Visualization nodes
- Different node types optimized for specific workloads
Network Architecture
Our simple network layout:
192.168.1.100] A --- C[pi-compute-01
192.168.1.101] A --- D[pi-compute-02
192.168.1.102]
In enterprise settings, you would design:
- Multiple networks (management, storage, computation)
- High-bandwidth, low-latency interconnects
- Network security zones
- External connectivity with appropriate controls
Installation and Deployment
Operating System Installation
For our Raspberry Pi cluster:
- Base OS Installation:
# Flash Debian Bookworm to microSD cards using Raspberry Pi Imager
# Enable SSH during initial setup
- System Updates:
sudo apt update
sudo apt upgrade -y
- Essential Packages:
sudo apt install -y vim git htop ntp rsync build-essential python3-pip
Enterprise OS Deployment
In larger environments, you would:
- Use automated provisioning tools (Kickstart, Cobbler, etc.)
- Implement PXE boot infrastructure
- Create standardized OS images
- Use configuration management tools (Ansible, Puppet, etc.)
Network Configuration
- Hostname Configuration:
# On head node
sudo hostnamectl set-hostname pi-head
# On compute nodes
sudo hostnamectl set-hostname pi-compute-01
sudo hostnamectl set-hostname pi-compute-02
- Static IP Assignment:
sudo nano /etc/network/interfaces.d/eth0
Add to the file:
auto eth0
iface eth0 inet static
address 192.168.1.100 # Adjust for each node
netmask 255.255.255.0
gateway 192.168.1.1 # Your network gateway
dns-nameservers 8.8.8.8 8.8.4.4
- Host File Configuration:
sudo nano /etc/hosts
Add:
192.168.1.100 pi-head
192.168.1.101 pi-compute-01
192.168.1.102 pi-compute-02
Cluster Management Tools
For our Raspberry Pi cluster, we’ll use a simple toolset:
- Cluster Shell:
sudo apt install -y clustershell
Configure node groups:
sudo mkdir -p /etc/clustershell/groups.d
sudo nano /etc/clustershell/groups.d/local.cfg
Add:
all: pi-head,pi-compute-[01-02]
compute: pi-compute-[01-02]
- Shared SSH Access:
# Generate SSH key on head node
ssh-keygen -t ed25519
# Distribute to compute nodes
ssh-copy-id pi@pi-compute-01
ssh-copy-id pi@pi-compute-02
Slurm Installation
Slurm is the industry-standard HPC job scheduler:
- Install Dependencies:
# On all nodes
sudo apt install -y munge libmunge-dev
# On head node
sudo apt install -y slurm-wlm
# On compute nodes
sudo apt install -y slurmd slurm-client
- Munge Configuration (authentication):
# On head node
sudo systemctl enable munge
sudo systemctl start munge
# Copy key to compute nodes
sudo scp /etc/munge/munge.key pi@pi-compute-01:/tmp/
sudo scp /etc/munge/munge.key pi@pi-compute-02:/tmp/
On each compute node:
sudo cp /tmp/munge.key /etc/munge/
sudo chown munge:munge /etc/munge/munge.key
sudo chmod 400 /etc/munge/munge.key
sudo systemctl enable munge
sudo systemctl start munge
- Slurm Configuration:
sudo nano /etc/slurm/slurm.conf
Basic configuration:
ClusterName=pi-cluster
SlurmctldHost=pi-head
# Control
SlurmctldPidFile=/var/run/slurm/slurmctld.pid
SlurmdPidFile=/var/run/slurm/slurmd.pid
SlurmdSpoolDir=/var/spool/slurmd
StateSaveLocation=/var/spool/slurmctld
# Authentication
AuthType=auth/munge
CryptoType=crypto/munge
# Node configurations
NodeName=pi-head CPUs=4 RealMemory=7000 State=UNKNOWN
NodeName=pi-compute-[01-02] CPUs=4 RealMemory=7000 State=UNKNOWN
# Partition configuration
PartitionName=main Nodes=pi-head,pi-compute-[01-02] Default=YES MaxTime=INFINITE State=UP
- Start Slurm Services:
# On head node
sudo systemctl enable slurmctld
sudo systemctl start slurmctld
# On compute nodes
sudo systemctl enable slurmd
sudo systemctl start slurmd
Storage Configuration
- NFS Setup (on head node):
sudo apt install -y nfs-kernel-server
# Create shared directories
sudo mkdir -p /shared /shared/apps /shared/data
sudo nano /etc/exports
Add:
/shared 192.168.1.0/24(rw,sync,no_subtree_check)
/home 192.168.1.0/24(rw,sync,no_subtree_check)
Activate:
sudo exportfs -a
sudo systemctl restart nfs-kernel-server
- NFS Client Setup (on compute nodes):
sudo apt install -y nfs-common
# Create mount points
sudo mkdir -p /shared
# Mount NFS shares
sudo mount pi-head:/shared /shared
sudo mount pi-head:/home /home
# Make persistent
sudo nano /etc/fstab
Add:
pi-head:/shared /shared nfs defaults 0 0
pi-head:/home /home nfs defaults 0 0
Software Environment
- Environment Modules:
sudo apt install -y environment-modules
mkdir -p /shared/modulefiles
Create a sample module:
nano /shared/modulefiles/python-ml
Add:
#%Module1.0
proc ModulesHelp { } {
puts stderr "Python ML environment"
}
module-whatis "Python machine learning environment"
prepend-path PATH /shared/apps/python-ml/bin
- Deploy Software:
mkdir -p /shared/apps/python-ml/bin
# Create virtual environment
python3 -m venv /shared/apps/python-ml
# Activate and install packages
source /shared/apps/python-ml/bin/activate
pip install numpy scipy scikit-learn pandas matplotlib
System Testing
Verify your deployment:
# Check Slurm
sinfo
# Check NFS
df -h
# Test module system
module avail
module load python-ml
Submit a test job:
# Create job script
cat > test.sh << 'EOF'
#!/bin/bash
#SBATCH --job-name=test
#SBATCH --output=test_%j.out
#SBATCH --nodes=2
#SBATCH --ntasks=2
hostname
sleep 10
srun hostname
EOF
# Submit job
sbatch test.sh
# Check status
squeue
Next Steps
In the next tutorial, we’ll cover:
- Account management and security
- Resource management and scheduling
- Storage solutions and data management
This foundation will prepare you for more advanced HPC administration tasks.