Building a Raspberry Pi 5 HPC Cluster with Slurm

Created: 2025-03-22 22:19:50 | Last updated: 2025-03-23 18:01:06 | Status: Public

This tutorial guides you through setting up a small High-Performance Computing (HPC) cluster using 3 Raspberry Pi 5 devices with 8GB RAM each, running Debian Bookworm, and connected via a dedicated Gigabit switch.

Prerequisites
Cluster Architecture
Initial Setup
Network Configuration
SSH Access to Compute Nodes
Shared Storage with NFS
User Management
Installing Slurm
Slurm Configuration
Testing Your Cluster
Advanced Configuration
Troubleshooting

Prerequisites

Hardware Requirements:
- 3× Raspberry Pi 5 (8GB RAM)
- 3× Power supplies (30W USB-C recommended)
- 3× microSD cards (32GB+ recommended)
- 1× Gigabit Ethernet switch
- Ethernet cables
- Optional: USB SSD for shared storage

Software Requirements:
- Debian Bookworm OS
- SSH enabled on all nodes
- Basic Linux knowledge

Cluster Architecture

Our cluster will consist of:
- 1 head node (controller + compute capability) with dual network connectivity
- 2 compute nodes on a private network

The head node will have:
- WiFi connection to your home network (192.168.x.x)
- Ethernet connection to a private cluster network (10.0.0.x)

The compute nodes will only have:
- Ethernet connection to the private cluster network (10.0.0.x)

Network Diagram

                          ┌───────────────┐
                          │  Home Router  │
                          │  192.168.x.x  │
                          └───────┬───────┘
                                  │
                                  │ WiFi
                                  │
┌────────────────────────────────┴───────────────────────────────────┐
│                                                                     │
│  ┌───────────────┐            Private Network           ┌────────┐ │
│  │    Head Node  │◄───────┐  ┌────────────────┐  ┌─────►│Switch  │ │
│  │    pi-head    │        │  │                │  │      └────┬───┘ │
│  │  192.168.x.x  │        │  │                │  │           │     │
│  │   (WiFi IP)   │        │  │                │  │           │     │
│  └───────┬───────┘        │  │                │  │           │     │
│          │                │  │                │  │           │     │
│          │ Ethernet       │  │   10.0.0.0/24  │  │           │     │
│     ┌────▼─────┐          │  │                │  │      ┌────▼────┐│
│     │ 10.0.0.1 │──────────┘  │                │  └──────│10.0.0.3 ││
│     └──────────┘             │                │         └─────────┘│
│     Head Node                │                │        Compute02   │
│    (Ethernet IP)             │                │                    │
│                              │                │                    │
│                         ┌────▼─────┐                               │
│                         │ 10.0.0.2 │                               │
│                         └──────────┘                               │
│                          Compute01                                 │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘

Initial Setup

1. Prepare the OS

For each Raspberry Pi:

Flash Debian Bookworm to each microSD card
Boot each Pi and complete initial setup
Update the system:

sudo apt update
sudo apt upgrade -y

Install essential packages:

sudo apt install -y vim git htop ntp build-essential

2. Configure Hostname and Hosts Files

For the head node:

sudo hostnamectl set-hostname pi-head

For compute nodes:

# On first compute node
sudo hostnamectl set-hostname pi-compute01

# On second compute node
sudo hostnamectl set-hostname pi-compute02

Edit /etc/hosts on each node to include all nodes:

sudo nano /etc/hosts

Add the following lines to each node:

10.0.0.1 pi-head
10.0.0.2 pi-compute01
10.0.0.3 pi-compute02

Network Configuration

1. Set Up Dual-Network on Head Node

On the head node, configure WiFi for external access and Ethernet for cluster communication:

Ensure WiFi is connected to your home network via NetworkManager
Configure static IP for Ethernet interface using systemd-networkd:

sudo mkdir -p /etc/systemd/network/
sudo nano /etc/systemd/network/20-wired.network

Add:

[Match]
Name=eth0

[Network]
Address=10.0.0.1/24

Enable IP forwarding (for compute nodes to access the internet if needed):

echo "net.ipv4.ip_forward=1" | sudo tee -a /etc/sysctl.conf
sudo sysctl -p

Apply the configuration:

sudo systemctl enable systemd-networkd
sudo systemctl start systemd-networkd

2. Set Up Ethernet-Only Private Network on Compute Nodes

On each compute node:

Disable WiFi:

sudo nmcli radio wifi off
sudo systemctl disable wpa_supplicant.service

Configure static IP for Ethernet using systemd-networkd:

sudo mkdir -p /etc/systemd/network/
sudo nano /etc/systemd/network/20-wired.network

Add (adjust IP address for each compute node):

# For pi-compute01
[Match]
Name=eth0

[Network]
Address=10.0.0.2/24
Gateway=10.0.0.1
DNS=8.8.8.8 8.8.4.4

# For pi-compute02
[Match]
Name=eth0

[Network]
Address=10.0.0.3/24
Gateway=10.0.0.1
DNS=8.8.8.8 8.8.4.4

Apply the configuration:

sudo systemctl enable systemd-networkd
sudo systemctl start systemd-networkd

Verify connectivity:

ping 10.0.0.1

3. Configure SSH Key Authentication

On the head node, generate SSH keys:

ssh-keygen -t ed25519 -C "cluster-key"

Copy the key to each compute node:

ssh-copy-id pi@10.0.0.2
ssh-copy-id pi@10.0.0.3

SSH Access to Compute Nodes

1. SSH via Head Node (Jump Host)

To access compute nodes from your personal computer, you’ll need to SSH to the head node first, then to the compute nodes.

On your personal computer, set up an SSH config file for easier access:

nano ~/.ssh/config

Add:

Host pi-head
    HostName 192.168.x.x  # Replace with your head node's WiFi IP
    User pi               # Replace with your username

Host pi-compute01
    HostName 10.0.0.2
    User pi               # Replace with your username
    ProxyJump pi-head

Host pi-compute02
    HostName 10.0.0.3
    User pi               # Replace with your username
    ProxyJump pi-head

Now you can directly SSH to any node:

ssh pi-head
ssh pi-compute01
ssh pi-compute02

2. Test Connectivity

After setting up the network:

From head node, verify you can reach compute nodes:

ping 10.0.0.2
ping 10.0.0.3
ssh pi@10.0.0.2
ssh pi@10.0.0.3

From compute nodes, verify they can reach the head node and internet:

ping 10.0.0.1
ping 8.8.8.8

Shared Storage with NFS

1. Set Up NFS Server (Head Node)

Install NFS server:

sudo apt install -y nfs-kernel-server

Create a shared directory:

sudo mkdir -p /shared
sudo chmod 777 /shared  # For tutorial purposes; use proper permissions in production

Configure exports:

sudo nano /etc/exports

Add the following:

/shared 10.0.0.0/24(rw,sync,no_subtree_check,no_root_squash)
/home   10.0.0.0/24(rw,sync,no_subtree_check)

Apply the configuration:

sudo exportfs -a
sudo systemctl restart nfs-kernel-server

2. Set Up NFS Clients (Compute Nodes)

On each compute node:

sudo apt install -y nfs-common
sudo mkdir -p /shared

# Add mount entries to fstab
sudo nano /etc/fstab

Add these lines:

10.0.0.1:/shared /shared nfs defaults 0 0
10.0.0.1:/home   /home   nfs defaults 0 0

After modifying fstab, reload the daemon and mount the shares:

sudo systemctl daemon-reload
sudo mount -a

User Management

1. Create Cluster User

Create a consistent user on all nodes (will be synchronized via NFS home directory):

# On head node only
sudo adduser hpcuser
sudo usermod -aG sudo hpcuser

2. Test User Accessibility on Compute Nodes

To test that the hpcuser is accessible on compute nodes after NFS home is mounted:

# On the head node, create a test file in hpcuser's home directory
sudo -u hpcuser touch /home/hpcuser/test_file

# SSH to a compute node
ssh 10.0.0.2

# Check if the test file exists and is accessible
ls -la /home/hpcuser/test_file

# Try to switch to the hpcuser account
su - hpcuser

# Verify you can create files as this user
touch ~/test_from_compute
exit

# Return to head node and verify the file is visible
ssh 10.0.0.1
ls -la /home/hpcuser/test_from_compute

If all tests pass, your NFS home directory and user setup are working correctly.

Installing Slurm

1. Install Dependencies (All Nodes)

On all nodes:

sudo apt install -y slurmd slurm-client munge libmunge-dev

On the head node also install:

sudo apt install -y slurmctld slurm-wlm-basic-plugins

2. Configure Munge Authentication (All Nodes)

On the head node:

# Create a munge key
sudo /usr/sbin/create-munge-key -r
sudo systemctl enable munge
sudo systemctl start munge

# Copy the key to a location accessible via NFS
sudo cp /etc/munge/munge.key /shared/
sudo chmod 400 /shared/munge.key

On compute nodes:

# Stop munge if running
sudo systemctl stop munge

# Copy the key
sudo cp /shared/munge.key /etc/munge/
sudo chown munge:munge /etc/munge/munge.key
sudo chmod 400 /etc/munge/munge.key

# Restart munge
sudo systemctl enable munge
sudo systemctl start munge

Test munge on all nodes:

munge -n | unmunge

Slurm Configuration

1. Create Slurm Configuration File

On the head node, create the configuration:

sudo nano /etc/slurm/slurm.conf

Use this base configuration (adjusted for our network):

# slurm.conf
ClusterName=pi-cluster
SlurmctldHost=pi-head

# Authentication/security
AuthType=auth/munge
CryptoType=crypto/munge
MpiDefault=none

# Scheduling
SchedulerType=sched/backfill
SelectType=select/cons_tres
SelectTypeParameters=CR_Core

# Performance
SlurmctldDebug=info
SlurmdDebug=info
JobAcctGatherType=jobacct_gather/none
SlurmctldLogFile=/var/log/slurm/slurmctld.log
SlurmdLogFile=/var/log/slurm/slurmd.log

# Process management
ProctrackType=proctrack/linuxproc
TaskPlugin=task/none

# Node configurations - using private network IPs
NodeName=pi-head CPUs=4 RealMemory=7000 State=UNKNOWN
NodeName=pi-compute01 CPUs=4 RealMemory=7000 State=UNKNOWN
NodeName=pi-compute02 CPUs=4 RealMemory=7000 State=UNKNOWN

# Partition configuration
PartitionName=main Nodes=pi-head,pi-compute01,pi-compute02 Default=YES MaxTime=INFINITE State=UP

Create log directories:

sudo mkdir -p /var/log/slurm
sudo chown slurm:slurm /var/log/slurm

2. Distribute Configuration

Copy to all nodes:

sudo cp /etc/slurm/slurm.conf /shared/

On compute nodes:

sudo cp /shared/slurm.conf /etc/slurm/

3. Start Slurm Services

On the head node:

sudo systemctl enable slurmctld
sudo systemctl start slurmctld

On compute nodes:

sudo systemctl enable slurmd
sudo systemctl start slurmd

Testing Your Cluster

1. Check Cluster Status

On the head node:

sinfo

You should see all nodes in your cluster listed.

2. Run a Test Job

Create a test job script:

nano ~/test_job.sh

Add the following:

#!/bin/bash
#SBATCH --job-name=test
#SBATCH --output=test_%j.out
#SBATCH --error=test_%j.err
#SBATCH --ntasks=4
#SBATCH --time=00:05:00

hostname
sleep 10
echo "This is a test job running on $(hostname)"
srun hostname

Make it executable:

chmod +x ~/test_job.sh

Submit the job:

sbatch ~/test_job.sh

Check job status:

squeue

3. Run a Simple MPI Job

Install MPI:

sudo apt install -y openmpi-bin libopenmpi-dev

Create an MPI test program:

nano ~/mpi_hello.c

Add the following:

#include <mpi.h>
#include <stdio.h>
#include <unistd.h>

int main(int argc, char** argv) {
    int world_size, world_rank;
    char hostname[256];

    MPI_Init(&argc, &argv);
    MPI_Comm_size(MPI_COMM_WORLD, &world_size);
    MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);

    gethostname(hostname, sizeof(hostname));

    printf("Hello from processor %s, rank %d out of %d processors\n",
           hostname, world_rank, world_size);

    MPI_Finalize();
    return 0;
}

Compile:

mpicc -o mpi_hello mpi_hello.c

Create a submission script:

nano ~/mpi_job.sh

Add the following:

#!/bin/bash
#SBATCH --job-name=mpi_test
#SBATCH --output=mpi_%j.out
#SBATCH --error=mpi_%j.err
#SBATCH --nodes=2
#SBATCH --ntasks=8
#SBATCH --time=00:05:00

module load mpi/openmpi  # If using environment modules

srun --mpi=pmix_v3 ./mpi_hello

Submit the job:

sbatch ~/mpi_job.sh

Troubleshooting

Common Issues and Solutions

Network Connectivity Issues:

   # Check if nodes can ping each other
   ping 10.0.0.1
   ping 10.0.0.2
   ping 10.0.0.3

   # Check network interface status
   ip addr show

   # Check systemd-networkd status
   sudo systemctl status systemd-networkd

   # Restart networking if needed
   sudo systemctl restart systemd-networkd

SSH Connection Problems:

   # Check SSH service status
   sudo systemctl status ssh

   # Check SSH config for errors
   sudo sshd -t

   # Check SSH key permissions
   ls -la ~/.ssh/
   chmod 600 ~/.ssh/id_ed25519
   chmod 644 ~/.ssh/id_ed25519.pub

Nodes showing DOWN state:

   # Check slurmd logs
   sudo systemctl status slurmd
   cat /var/log/slurm/slurmd.log

   # Restart slurmd
   sudo systemctl restart slurmd

Internet Access from Compute Nodes:

   # On head node, enable NAT if needed
   sudo iptables -t nat -A POSTROUTING -o wlan0 -j MASQUERADE
   sudo iptables -A FORWARD -i eth0 -o wlan0 -j ACCEPT
   sudo iptables -A FORWARD -i wlan0 -o eth0 -m state --state RELATED,ESTABLISHED -j ACCEPT

   # Make iptables persistent
   sudo apt install -y iptables-persistent
   sudo netfilter-persistent save

NFS issues:

   # Check mounts
   df -h

   # Remount if needed
   sudo systemctl daemon-reload
   sudo mount -a

This tutorial provides the foundation for your Raspberry Pi 5 HPC cluster with a dual-network setup. From here, you can expand by adding more nodes, configuring GPU resources if available, implementing more sophisticated job scheduling policies, or adding monitoring tools like Ganglia or Prometheus.

Happy cluster computing!

Table of Contents

Prerequisites

Cluster Architecture

Network Diagram

Initial Setup

1. Prepare the OS

2. Configure Hostname and Hosts Files

Network Configuration

1. Set Up Dual-Network on Head Node

2. Set Up Ethernet-Only Private Network on Compute Nodes

3. Configure SSH Key Authentication

SSH Access to Compute Nodes

1. SSH via Head Node (Jump Host)

2. Test Connectivity

Shared Storage with NFS

1. Set Up NFS Server (Head Node)

2. Set Up NFS Clients (Compute Nodes)

User Management

1. Create Cluster User

2. Test User Accessibility on Compute Nodes

Installing Slurm

1. Install Dependencies (All Nodes)

2. Configure Munge Authentication (All Nodes)

Slurm Configuration

1. Create Slurm Configuration File

2. Distribute Configuration

3. Start Slurm Services

Testing Your Cluster

1. Check Cluster Status

2. Run a Test Job

3. Run a Simple MPI Job

Troubleshooting

Common Issues and Solutions