Building a Raspberry Pi 5 HPC Cluster with Slurm
Created: 2025-03-22 22:19:50 | Last updated: 2025-03-23 18:01:06 | Status: Public
This tutorial guides you through setting up a small High-Performance Computing (HPC) cluster using 3 Raspberry Pi 5 devices with 8GB RAM each, running Debian Bookworm, and connected via a dedicated Gigabit switch.
Table of Contents
- Prerequisites
- Cluster Architecture
- Initial Setup
- Network Configuration
- SSH Access to Compute Nodes
- Shared Storage with NFS
- User Management
- Installing Slurm
- Slurm Configuration
- Testing Your Cluster
- Advanced Configuration
- Troubleshooting
Prerequisites
Hardware Requirements:
- 3× Raspberry Pi 5 (8GB RAM)
- 3× Power supplies (30W USB-C recommended)
- 3× microSD cards (32GB+ recommended)
- 1× Gigabit Ethernet switch
- Ethernet cables
- Optional: USB SSD for shared storage
Software Requirements:
- Debian Bookworm OS
- SSH enabled on all nodes
- Basic Linux knowledge
Cluster Architecture
Our cluster will consist of:
- 1 head node (controller + compute capability) with dual network connectivity
- 2 compute nodes on a private network
The head node will have:
- WiFi connection to your home network (192.168.x.x)
- Ethernet connection to a private cluster network (10.0.0.x)
The compute nodes will only have:
- Ethernet connection to the private cluster network (10.0.0.x)
Network Diagram
┌───────────────┐
│ Home Router │
│ 192.168.x.x │
└───────┬───────┘
│
│ WiFi
│
┌────────────────────────────────┴───────────────────────────────────┐
│ │
│ ┌───────────────┐ Private Network ┌────────┐ │
│ │ Head Node │◄───────┐ ┌────────────────┐ ┌─────►│Switch │ │
│ │ pi-head │ │ │ │ │ └────┬───┘ │
│ │ 192.168.x.x │ │ │ │ │ │ │
│ │ (WiFi IP) │ │ │ │ │ │ │
│ └───────┬───────┘ │ │ │ │ │ │
│ │ │ │ │ │ │ │
│ │ Ethernet │ │ 10.0.0.0/24 │ │ │ │
│ ┌────▼─────┐ │ │ │ │ ┌────▼────┐│
│ │ 10.0.0.1 │──────────┘ │ │ └──────│10.0.0.3 ││
│ └──────────┘ │ │ └─────────┘│
│ Head Node │ │ Compute02 │
│ (Ethernet IP) │ │ │
│ │ │ │
│ ┌────▼─────┐ │
│ │ 10.0.0.2 │ │
│ └──────────┘ │
│ Compute01 │
│ │
└─────────────────────────────────────────────────────────────────────┘
Initial Setup
1. Prepare the OS
For each Raspberry Pi:
- Flash Debian Bookworm to each microSD card
- Boot each Pi and complete initial setup
- Update the system:
sudo apt update
sudo apt upgrade -y
- Install essential packages:
sudo apt install -y vim git htop ntp build-essential
2. Configure Hostname and Hosts Files
For the head node:
sudo hostnamectl set-hostname pi-head
For compute nodes:
# On first compute node
sudo hostnamectl set-hostname pi-compute01
# On second compute node
sudo hostnamectl set-hostname pi-compute02
Edit /etc/hosts
on each node to include all nodes:
sudo nano /etc/hosts
Add the following lines to each node:
10.0.0.1 pi-head
10.0.0.2 pi-compute01
10.0.0.3 pi-compute02
Network Configuration
1. Set Up Dual-Network on Head Node
On the head node, configure WiFi for external access and Ethernet for cluster communication:
-
Ensure WiFi is connected to your home network via NetworkManager
-
Configure static IP for Ethernet interface using systemd-networkd:
sudo mkdir -p /etc/systemd/network/
sudo nano /etc/systemd/network/20-wired.network
Add:
[Match]
Name=eth0
[Network]
Address=10.0.0.1/24
- Enable IP forwarding (for compute nodes to access the internet if needed):
echo "net.ipv4.ip_forward=1" | sudo tee -a /etc/sysctl.conf
sudo sysctl -p
- Apply the configuration:
sudo systemctl enable systemd-networkd
sudo systemctl start systemd-networkd
2. Set Up Ethernet-Only Private Network on Compute Nodes
On each compute node:
- Disable WiFi:
sudo nmcli radio wifi off
sudo systemctl disable wpa_supplicant.service
- Configure static IP for Ethernet using systemd-networkd:
sudo mkdir -p /etc/systemd/network/
sudo nano /etc/systemd/network/20-wired.network
Add (adjust IP address for each compute node):
# For pi-compute01
[Match]
Name=eth0
[Network]
Address=10.0.0.2/24
Gateway=10.0.0.1
DNS=8.8.8.8 8.8.4.4
# For pi-compute02
[Match]
Name=eth0
[Network]
Address=10.0.0.3/24
Gateway=10.0.0.1
DNS=8.8.8.8 8.8.4.4
- Apply the configuration:
sudo systemctl enable systemd-networkd
sudo systemctl start systemd-networkd
- Verify connectivity:
ping 10.0.0.1
3. Configure SSH Key Authentication
On the head node, generate SSH keys:
ssh-keygen -t ed25519 -C "cluster-key"
Copy the key to each compute node:
ssh-copy-id pi@10.0.0.2
ssh-copy-id pi@10.0.0.3
SSH Access to Compute Nodes
1. SSH via Head Node (Jump Host)
To access compute nodes from your personal computer, you’ll need to SSH to the head node first, then to the compute nodes.
On your personal computer, set up an SSH config file for easier access:
nano ~/.ssh/config
Add:
Host pi-head
HostName 192.168.x.x # Replace with your head node's WiFi IP
User pi # Replace with your username
Host pi-compute01
HostName 10.0.0.2
User pi # Replace with your username
ProxyJump pi-head
Host pi-compute02
HostName 10.0.0.3
User pi # Replace with your username
ProxyJump pi-head
Now you can directly SSH to any node:
ssh pi-head
ssh pi-compute01
ssh pi-compute02
2. Test Connectivity
After setting up the network:
- From head node, verify you can reach compute nodes:
ping 10.0.0.2
ping 10.0.0.3
ssh pi@10.0.0.2
ssh pi@10.0.0.3
- From compute nodes, verify they can reach the head node and internet:
ping 10.0.0.1
ping 8.8.8.8
Shared Storage with NFS
1. Set Up NFS Server (Head Node)
Install NFS server:
sudo apt install -y nfs-kernel-server
Create a shared directory:
sudo mkdir -p /shared
sudo chmod 777 /shared # For tutorial purposes; use proper permissions in production
Configure exports:
sudo nano /etc/exports
Add the following:
/shared 10.0.0.0/24(rw,sync,no_subtree_check,no_root_squash)
/home 10.0.0.0/24(rw,sync,no_subtree_check)
Apply the configuration:
sudo exportfs -a
sudo systemctl restart nfs-kernel-server
2. Set Up NFS Clients (Compute Nodes)
On each compute node:
sudo apt install -y nfs-common
sudo mkdir -p /shared
# Add mount entries to fstab
sudo nano /etc/fstab
Add these lines:
10.0.0.1:/shared /shared nfs defaults 0 0
10.0.0.1:/home /home nfs defaults 0 0
After modifying fstab, reload the daemon and mount the shares:
sudo systemctl daemon-reload
sudo mount -a
User Management
1. Create Cluster User
Create a consistent user on all nodes (will be synchronized via NFS home directory):
# On head node only
sudo adduser hpcuser
sudo usermod -aG sudo hpcuser
2. Test User Accessibility on Compute Nodes
To test that the hpcuser is accessible on compute nodes after NFS home is mounted:
# On the head node, create a test file in hpcuser's home directory
sudo -u hpcuser touch /home/hpcuser/test_file
# SSH to a compute node
ssh 10.0.0.2
# Check if the test file exists and is accessible
ls -la /home/hpcuser/test_file
# Try to switch to the hpcuser account
su - hpcuser
# Verify you can create files as this user
touch ~/test_from_compute
exit
# Return to head node and verify the file is visible
ssh 10.0.0.1
ls -la /home/hpcuser/test_from_compute
If all tests pass, your NFS home directory and user setup are working correctly.
Installing Slurm
1. Install Dependencies (All Nodes)
On all nodes:
sudo apt install -y slurmd slurm-client munge libmunge-dev
On the head node also install:
sudo apt install -y slurmctld slurm-wlm-basic-plugins
2. Configure Munge Authentication (All Nodes)
On the head node:
# Create a munge key
sudo /usr/sbin/create-munge-key -r
sudo systemctl enable munge
sudo systemctl start munge
# Copy the key to a location accessible via NFS
sudo cp /etc/munge/munge.key /shared/
sudo chmod 400 /shared/munge.key
On compute nodes:
# Stop munge if running
sudo systemctl stop munge
# Copy the key
sudo cp /shared/munge.key /etc/munge/
sudo chown munge:munge /etc/munge/munge.key
sudo chmod 400 /etc/munge/munge.key
# Restart munge
sudo systemctl enable munge
sudo systemctl start munge
Test munge on all nodes:
munge -n | unmunge
Slurm Configuration
1. Create Slurm Configuration File
On the head node, create the configuration:
sudo nano /etc/slurm/slurm.conf
Use this base configuration (adjusted for our network):
# slurm.conf
ClusterName=pi-cluster
SlurmctldHost=pi-head
# Authentication/security
AuthType=auth/munge
CryptoType=crypto/munge
MpiDefault=none
# Scheduling
SchedulerType=sched/backfill
SelectType=select/cons_tres
SelectTypeParameters=CR_Core
# Performance
SlurmctldDebug=info
SlurmdDebug=info
JobAcctGatherType=jobacct_gather/none
SlurmctldLogFile=/var/log/slurm/slurmctld.log
SlurmdLogFile=/var/log/slurm/slurmd.log
# Process management
ProctrackType=proctrack/linuxproc
TaskPlugin=task/none
# Node configurations - using private network IPs
NodeName=pi-head CPUs=4 RealMemory=7000 State=UNKNOWN
NodeName=pi-compute01 CPUs=4 RealMemory=7000 State=UNKNOWN
NodeName=pi-compute02 CPUs=4 RealMemory=7000 State=UNKNOWN
# Partition configuration
PartitionName=main Nodes=pi-head,pi-compute01,pi-compute02 Default=YES MaxTime=INFINITE State=UP
Create log directories:
sudo mkdir -p /var/log/slurm
sudo chown slurm:slurm /var/log/slurm
2. Distribute Configuration
Copy to all nodes:
sudo cp /etc/slurm/slurm.conf /shared/
On compute nodes:
sudo cp /shared/slurm.conf /etc/slurm/
3. Start Slurm Services
On the head node:
sudo systemctl enable slurmctld
sudo systemctl start slurmctld
On compute nodes:
sudo systemctl enable slurmd
sudo systemctl start slurmd
Testing Your Cluster
1. Check Cluster Status
On the head node:
sinfo
You should see all nodes in your cluster listed.
2. Run a Test Job
Create a test job script:
nano ~/test_job.sh
Add the following:
#!/bin/bash
#SBATCH --job-name=test
#SBATCH --output=test_%j.out
#SBATCH --error=test_%j.err
#SBATCH --ntasks=4
#SBATCH --time=00:05:00
hostname
sleep 10
echo "This is a test job running on $(hostname)"
srun hostname
Make it executable:
chmod +x ~/test_job.sh
Submit the job:
sbatch ~/test_job.sh
Check job status:
squeue
3. Run a Simple MPI Job
Install MPI:
sudo apt install -y openmpi-bin libopenmpi-dev
Create an MPI test program:
nano ~/mpi_hello.c
Add the following:
#include <mpi.h>
#include <stdio.h>
#include <unistd.h>
int main(int argc, char** argv) {
int world_size, world_rank;
char hostname[256];
MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &world_size);
MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);
gethostname(hostname, sizeof(hostname));
printf("Hello from processor %s, rank %d out of %d processors\n",
hostname, world_rank, world_size);
MPI_Finalize();
return 0;
}
Compile:
mpicc -o mpi_hello mpi_hello.c
Create a submission script:
nano ~/mpi_job.sh
Add the following:
#!/bin/bash
#SBATCH --job-name=mpi_test
#SBATCH --output=mpi_%j.out
#SBATCH --error=mpi_%j.err
#SBATCH --nodes=2
#SBATCH --ntasks=8
#SBATCH --time=00:05:00
module load mpi/openmpi # If using environment modules
srun --mpi=pmix_v3 ./mpi_hello
Submit the job:
sbatch ~/mpi_job.sh
Troubleshooting
Common Issues and Solutions
- Network Connectivity Issues:
# Check if nodes can ping each other
ping 10.0.0.1
ping 10.0.0.2
ping 10.0.0.3
# Check network interface status
ip addr show
# Check systemd-networkd status
sudo systemctl status systemd-networkd
# Restart networking if needed
sudo systemctl restart systemd-networkd
- SSH Connection Problems:
# Check SSH service status
sudo systemctl status ssh
# Check SSH config for errors
sudo sshd -t
# Check SSH key permissions
ls -la ~/.ssh/
chmod 600 ~/.ssh/id_ed25519
chmod 644 ~/.ssh/id_ed25519.pub
- Nodes showing DOWN state:
# Check slurmd logs
sudo systemctl status slurmd
cat /var/log/slurm/slurmd.log
# Restart slurmd
sudo systemctl restart slurmd
- Internet Access from Compute Nodes:
# On head node, enable NAT if needed
sudo iptables -t nat -A POSTROUTING -o wlan0 -j MASQUERADE
sudo iptables -A FORWARD -i eth0 -o wlan0 -j ACCEPT
sudo iptables -A FORWARD -i wlan0 -o eth0 -m state --state RELATED,ESTABLISHED -j ACCEPT
# Make iptables persistent
sudo apt install -y iptables-persistent
sudo netfilter-persistent save
- NFS issues:
# Check mounts
df -h
# Remount if needed
sudo systemctl daemon-reload
sudo mount -a
This tutorial provides the foundation for your Raspberry Pi 5 HPC cluster with a dual-network setup. From here, you can expand by adding more nodes, configuring GPU resources if available, implementing more sophisticated job scheduling policies, or adding monitoring tools like Ganglia or Prometheus.
Happy cluster computing!