Tutorial: Setting Up a 3-Node Raspberry Pi 5 SLURM Cluster
Created: 2025-04-05 16:10:47 | Last updated: 2025-04-05 16:10:47 | Status: Public
This tutorial guides you through setting up a small High-Performance Computing (HPC) cluster using three Raspberry Pi 5 devices, SLURM Workload Manager, and a specific network configuration involving both Wi-Fi and a private Ethernet network.
Cluster Configuration:
- Nodes: 3 x Raspberry Pi 5 (8GB RAM recommended)
- OS: Raspberry Pi OS Bookworm (64-bit recommended)
- Boot: From SSDs
- Cluster User:
cuser
- Networking:
pi-head
:- WLAN (
wlan0
): Connects to your main router via Wi-Fi, gets192.168.1.20
via DHCP reservation (Gateway:192.168.1.1
). Provides internet access. - Ethernet (
eth0
): Connects to private switch, static IP10.0.0.1/24
.
- WLAN (
pi-cp01
:- Ethernet (
eth0
): Connects to private switch, static IP10.0.0.2/24
. Gateway viapi-head
(10.0.0.1
).
- Ethernet (
pi-cp02
:- Ethernet (
eth0
): Connects to private switch, static IP10.0.0.3/24
. Gateway viapi-head
(10.0.0.1
).
- Ethernet (
- SLURM: Basic setup (
slurmctld
,slurmd
,munge
).
Prerequisites
- Hardware:
- 3 x Raspberry Pi 5 (8GB RAM)
- 3 x NVMe SSDs (or SATA SSDs with appropriate adapters) compatible with RPi 5 boot.
- 3 x Reliable Power Supplies for RPi 5 (5V/5A recommended).
- 1 x Gigabit Ethernet Switch (unmanaged is fine).
- 3 x Ethernet Cables.
- Access to your existing Wi-Fi network and router admin interface (for DHCP reservation).
- Software:
- Raspberry Pi Imager tool.
- Raspberry Pi OS Bookworm (64-bit recommended) flashed onto each SSD.
- Initial Setup:
- Ensure each Pi boots correctly from its SSD.
- Complete the initial Raspberry Pi OS setup wizard (create the initial user - this is NOT
cuser
yet, set locale, keyboard, etc.). - Enable SSH on each Pi:
sudo raspi-config
-> Interface Options -> SSH -> Enable. - Connect
pi-head
to your Wi-Fi network. - Configure the DHCP reservation on your OpenWRT router to assign
192.168.1.20
topi-head
’s WLAN MAC address. Verifypi-head
gets this IP (ip a show wlan0
). - Physically connect all three Pis to the Gigabit switch using Ethernet cables.
Phase 1: Basic OS Configuration & Hostnames
(Perform these steps on each Pi, adjusting hostnames accordingly. You’ll need SSH access.)
- Login: SSH into each Pi using the initial user you created during setup.
- Set Hostnames:
- On the first Pi (intended as head node):
sudo hostnamectl set-hostname pi-head
* On the second Pi (compute node 1):
sudo hostnamectl set-hostname pi-cp01
* On the third Pi (compute node 2):
sudo hostnamectl set-hostname pi-cp02
* Reboot each Pi (`sudo reboot`) or log out and log back in for the change to take effect in your shell prompt and network identity.
- Update System (
pi-head
only for now):- Ensure
pi-head
has internet via Wi-Fi. - SSH into
pi-head
:
- Ensure
sudo apt update
sudo apt full-upgrade -y
sudo apt install -y vim git build-essential # Essential tools
* *Note: We will update `pi-cp01` and `pi-cp02` after setting up network routing.*
Phase 2: Network Configuration
We will use nmcli
, the command-line tool for NetworkManager, which is standard on Raspberry Pi OS Bookworm.
-
Verify
wlan0
onpi-head
:- SSH into
pi-head
. - Run
ip addr show wlan0
. Confirm it showsinet 192.168.1.20/XX
(netmask might vary). - Run
ip route
. Confirm the default route is via192.168.1.1 dev wlan0
.
- SSH into
-
Configure
eth0
onpi-head
(Static Private IP):- SSH into
pi-head
. - Identify the Ethernet interface name (usually
eth0
or similar). Useip a
. Assumeeth0
. - Add a NetworkManager connection profile for
eth0
with the static IP:
- SSH into
# Replace 'eth0' if your interface name is different
sudo nmcli connection add type ethernet con-name 'static-eth0' ifname eth0 ip4 10.0.0.1/24
# Critical: Prevent this from becoming the default route or having a gateway
sudo nmcli connection modify 'static-eth0' ipv4.gateway ''
sudo nmcli connection modify 'static-eth0' ipv4.never-default yes
sudo nmcli connection modify 'static-eth0' connection.autoconnect yes
# Bring the connection up (might happen automatically)
sudo nmcli connection up 'static-eth0'
* **Verify:**
* `ip addr show eth0`: Should show `inet 10.0.0.1/24`.
* `ip route`: Should *still* show the default route via `192.168.1.1 dev wlan0`. No default route via `10.0.0.1` should exist.
- Configure
eth0
onpi-cp01
(Static Private IP & Gateway):- SSH into
pi-cp01
. If you cannot SSH yet over the private network, connect a monitor/keyboard temporarily, or temporarily connect itseth0
to your main network to get an IP via DHCP for initial access. - Add the static IP configuration, setting
pi-head
as the gateway:
- SSH into
# Replace 'eth0' if needed
sudo nmcli connection add type ethernet con-name 'static-eth0' ifname eth0 ip4 10.0.0.2/24 gw4 10.0.0.1
# Set DNS - will be routed via pi-head. Use public servers.
sudo nmcli connection modify 'static-eth0' ipv4.dns "8.8.8.8 1.1.1.1"
sudo nmcli connection modify 'static-eth0' ipv4.ignore-auto-dns yes # Ensure our manual DNS is used
sudo nmcli connection modify 'static-eth0' connection.autoconnect yes
# Bring the connection up
sudo nmcli connection up 'static-eth0'
* **Verify:**
* `ip addr show eth0`: Should show `inet 10.0.0.2/24`.
* `ip route`: Should show `default via 10.0.0.1 dev eth0`.
- Configure
eth0
onpi-cp02
(Static Private IP & Gateway):- SSH into
pi-cp02
(similar potential initial access caveats aspi-cp01
). - Add the static IP configuration:
- SSH into
# Replace 'eth0' if needed
sudo nmcli connection add type ethernet con-name 'static-eth0' ifname eth0 ip4 10.0.0.3/24 gw4 10.0.0.1
# Set DNS
sudo nmcli connection modify 'static-eth0' ipv4.dns "8.8.8.8 1.1.1.1"
sudo nmcli connection modify 'static-eth0' ipv4.ignore-auto-dns yes
sudo nmcli connection modify 'static-eth0' connection.autoconnect yes
# Bring the connection up
sudo nmcli connection up 'static-eth0'
* **Verify:**
* `ip addr show eth0`: Should show `inet 10.0.0.3/24`.
* `ip route`: Should show `default via 10.0.0.1 dev eth0`.
- Enable IP Forwarding and NAT on
pi-head
:- SSH into
pi-head
. - Enable kernel IP forwarding:
- SSH into
echo 'net.ipv4.ip_forward=1' | sudo tee /etc/sysctl.d/99-ip_forward.conf
sudo sysctl -p /etc/sysctl.d/99-ip_forward.conf
* Set up `iptables` rules for NAT (masquerading). **Carefully replace `wlan0` and `eth0` if your interface names differ!**
# Flush existing rules if needed (use with caution if you have other rules)
# sudo iptables -t nat -F
# sudo iptables -F FORWARD
# Rule: Allow traffic originating from private network (eth0) to go out via public network (wlan0)
sudo iptables -t nat -A POSTROUTING -o wlan0 -j MASQUERADE
# Rule: Allow established connections back in from wlan0 to eth0
sudo iptables -A FORWARD -i wlan0 -o eth0 -m state --state RELATED,ESTABLISHED -j ACCEPT
# Rule: Allow new connections from eth0 to wlan0
sudo iptables -A FORWARD -i eth0 -o wlan0 -j ACCEPT
* Make `iptables` rules persistent across reboots:
sudo apt update # Ensure pi-head has internet
sudo apt install -y iptables-persistent
# During installation, it will ask if you want to save current IPv4/IPv6 rules. Select "Yes" for both.
# To save rules manually later: sudo netfilter-persistent save
- Test Network Connectivity:
- From
pi-head
:
- From
ping -c 3 10.0.0.2
ping -c 3 10.0.0.3
* From `pi-cp01`:
ping -c 3 10.0.0.1 # Ping head node
ping -c 3 8.8.8.8 # Ping Google DNS (tests routing + internet)
ping -c 3 google.com # Ping domain name (tests routing + internet + DNS)
* From `pi-cp02`:
ping -c 3 10.0.0.1
ping -c 3 8.8.8.8
ping -c 3 google.com
* **Troubleshooting:** If pings fail:
* Check `ip a` and `ip route` on all nodes.
* Check physical cable connections.
* Check `iptables -t nat -L -n -v` and `iptables -L FORWARD -n -v` on `pi-head`.
* Check `sysctl net.ipv4.ip_forward` on `pi-head` (should be `1`).
* Check DNS settings (`cat /etc/resolv.conf`) on `pi-cp01`, `pi-cp02`.
- Update Compute Nodes:
- Now that
pi-cp01
andpi-cp02
should have internet access, update them:
- Now that
# On pi-cp01 AND pi-cp02
sudo apt update
sudo apt full-upgrade -y
sudo apt install -y vim git build-essential # Install tools here too
Phase 3: Common Cluster Environment Setup
(Perform steps on all nodes unless specified)
- Configure Hostname Resolution (
/etc/hosts
):- Edit the hosts file on all three nodes:
sudo vim /etc/hosts
- Ensure the following lines exist (add them if missing, below the
127.0.0.1 localhost
line):
- Edit the hosts file on all three nodes:
127.0.1.1 <current_hostname> # This line is usually added by hostnamectl
# Cluster Nodes
10.0.0.1 pi-head
10.0.0.2 pi-cp01
10.0.0.3 pi-cp02
* **Test:** From any node, ping the others by hostname (e.g., `ping -c 1 pi-cp01` from `pi-head`).
- Create Common Cluster User (
cuser
):- Crucially,
cuser
must have the same User ID (UID) and Group ID (GID) on all nodes. - On
pi-head
first:
- Crucially,
sudo adduser cuser
# Follow prompts to set password etc.
# Note the UID and GID displayed (e.g., uid=1001(cuser) gid=1001(cuser) groups=...)
# Optional: Add cuser to the sudo group if needed for administration tasks
# sudo usermod -aG sudo cuser
* **On `pi-cp01` and `pi-cp02`:**
* Get the UID and GID from `pi-head`. Use `id cuser` on `pi-head`. Let's assume it was `1001` for both UID and GID. **Replace `1001` below if yours is different.**
# Create the group first with the specific GID
sudo groupadd -g 1001 cuser
# Create the user with the specific UID and GID
sudo useradd -u 1001 -g 1001 -m -s /bin/bash cuser
# Set the password for the new user
sudo passwd cuser
# Optional: Add to sudo group (use the same groups as on pi-head if needed)
# sudo usermod -aG sudo cuser
* **Verify:** Run `id cuser` on **all three** nodes. Ensure the UID and GID match exactly.
- Setup Passwordless SSH for
cuser
:- Log in as
cuser
onpi-head
. You can usesu - cuser
if logged in as another user, or SSH directly:ssh cuser@pi-head
. - Generate SSH key pair (run as
cuser
):
- Log in as
# Accept default file location (~/.ssh/id_rsa), press Enter for empty passphrase
ssh-keygen -t rsa -b 4096
* **Copy the public key to all nodes (including `pi-head` itself):**
# Run as cuser from pi-head
ssh-copy-id cuser@pi-head
ssh-copy-id cuser@pi-cp01
ssh-copy-id cuser@pi-cp02
# Enter the password for 'cuser' when prompted for each node
* **Test:** Still as `cuser` on `pi-head`, try SSHing to each node without a password:
ssh pi-head date
ssh pi-cp01 date
ssh pi-cp02 date
# The first time connecting to each might ask "Are you sure you want to continue connecting (yes/no/[fingerprint])?". Type 'yes'.
# If it prompts for a password after the first connection, the key setup failed. Check permissions in ~/.ssh directories.
- Install and Configure NFS (Shared Filesystem):
- We’ll share
/clusterfs
frompi-head
to be used by all nodes. - On
pi-head
(NFS Server):
- We’ll share
sudo apt update
sudo apt install -y nfs-kernel-server
sudo mkdir -p /clusterfs
# Option 1: Allow anyone to write (simple for cluster user)
sudo chown nobody:nogroup /clusterfs
sudo chmod 777 /clusterfs
# Option 2: Restrict to cuser (better security, requires consistent UID/GID)
# sudo chown cuser:cuser /clusterfs
# sudo chmod 770 /clusterfs # Or 750 if group members only need read
# Edit the NFS exports file
sudo vim /etc/exports
# Add this line to allow access from the private 10.0.0.x network:
# Use 'no_root_squash' carefully if you need root access over NFS
/clusterfs 10.0.0.0/24(rw,sync,no_subtree_check)
# Activate the exports
sudo exportfs -ra
# Restart and enable the NFS server service
sudo systemctl restart nfs-kernel-server
sudo systemctl enable nfs-kernel-server
* **On `pi-cp01` and `pi-cp02` (NFS Clients):**
sudo apt update
sudo apt install -y nfs-common
sudo mkdir -p /clusterfs
# Add the mount to /etc/fstab for automatic mounting on boot
sudo vim /etc/fstab
# Add this line at the end:
pi-head:/clusterfs /clusterfs nfs defaults,auto,nofail 0 0
# Mount all filesystems defined in fstab (including the new one)
sudo mount -a
# Verify the mount was successful
df -h | grep /clusterfs
# Check mount options (optional)
mount | grep /clusterfs
* From `pi-head` as `cuser`: `touch /clusterfs/test_head.txt`
* From `pi-cp01` as `cuser`: `ls /clusterfs` (should see `test_head.txt`)
* From `pi-cp02` as `cuser`: `touch /clusterfs/test_cp02.txt`
* From `pi-head` as `cuser`: `ls /clusterfs` (should see both files)
- Install and Configure NTP (Time Synchronization): Accurate time is essential for SLURM.
- Install
chrony
on all nodes:
- Install
sudo apt update
sudo apt install -y chrony
* Ensure it's enabled and running on **all nodes**:
sudo systemctl enable --now chrony
* `chrony` will automatically use internet time sources. Since all nodes now have internet (directly or via `pi-head`), this should work.
* **Verify sync status** (might take a minute or two after starting):
# Run on all nodes
chronyc sources
# Look for lines starting with '^*' (synced server) or '^+ (acceptable server).
timedatectl status | grep "NTP service"
# Should show 'active'.
Phase 4: Install and Configure SLURM & Munge
- Install Munge (Authentication Service): Munge provides secure authentication between SLURM daemons. Install on all nodes.
sudo apt update
sudo apt install -y munge libmunge-dev libmunge2
- Create and Distribute Munge Key: A shared secret key must be identical on all nodes.
- On
pi-head
ONLY:
- On
# Stop munge service if running
sudo systemctl stop munge
# Create the key (as root or using sudo)
sudo dd if=/dev/urandom of=/etc/munge/munge.key bs=1 count=1024
# Set correct ownership and permissions
sudo chown munge:munge /etc/munge/munge.key
sudo chmod 400 /etc/munge/munge.key
* **Securely copy the key from `pi-head` to compute nodes:**
# Run these commands on pi-head (you might need root@ if sudo access via SSH is restricted)
sudo scp /etc/munge/munge.key pi-cp01:/etc/munge/munge.key
sudo scp /etc/munge/munge.key pi-cp02:/etc/munge/munge.key
# Enter the SSH password for the user on pi-cp01/pi-cp02 when prompted (e.g., your initial setup user if using sudo scp, or root password if using root@).
* **On `pi-cp01` and `pi-cp02`:**
# Ensure munge is stopped
sudo systemctl stop munge
# Set correct ownership and permissions on the copied key
sudo chown munge:munge /etc/munge/munge.key
sudo chmod 400 /etc/munge/munge.key
- Start and Enable Munge Service: On all nodes:
sudo systemctl start munge
sudo systemctl enable munge
# Verify status
sudo systemctl status munge
- Test Munge Communication:
- From
pi-head
:
- From
# Test local encoding/decoding
munge -n | unmunge
# Test head -> cp01
munge -n | ssh pi-cp01 unmunge
# Test head -> cp02
munge -n | ssh pi-cp02 unmunge
# Test cp01 -> head (round trip)
ssh pi-cp01 munge -n | unmunge
* All tests should return a `STATUS: Success (...)` line. If not, check `munge.key` consistency, permissions, and `munged` service status. Also check `/var/log/munge/munged.log`.
- Install SLURM: Install the SLURM workload manager packages on all nodes.
sudo apt update
sudo apt install -y slurm-wlm slurm-wlm-doc # slurm-wlm pulls in slurmd, slurmctld etc.
- Configure SLURM (
slurm.conf
):- Create the configuration file on
pi-head
first. A minimal configuration is below. - Generate a node list helper (optional but good practice):
- Create the configuration file on
# On pi-head
scontrol show config | grep ClusterName # Find default if any
* Edit the main config file: `sudo vim /etc/slurm/slurm.conf`
* Replace the **entire content** with the following.
* **Adjust `RealMemory`**: `free -m` shows total memory in MiB. Leave some (~200-300MB) for the OS. For an 8GB Pi (approx 7850MB usable), `7600` is a safe starting point.
* **CPUs**: RPi 5 has 4 cores.
# /etc/slurm/slurm.conf
# Basic SLURM configuration for pi-cluster
ClusterName=pi-cluster
SlurmctldHost=pi-head #(Or use IP 10.0.0.1)
# SlurmctldHost=pi-head(10.0.0.1) # Optional: Specify both
MpiDefault=none
ProctrackType=proctrack/cgroup
ReturnToService=1
SlurmctldPidFile=/run/slurmctld.pid
SlurmdPidFile=/run/slurmd.pid
SlurmctldPort=6817
SlurmdPort=6818
AuthType=auth/munge
StateSaveLocation=/var/spool/slurmctld
SlurmdSpoolDir=/var/spool/slurmd
SwitchType=switch/none
TaskPlugin=task/cgroup
# LOGGING
SlurmctldLogFile=/var/log/slurm/slurmctld.log
SlurmdLogFile=/var/log/slurm/slurmd.log
JobCompType=jobcomp/none # No job completion logging for basic setup
# TIMERS
SlurmctldTimeout=120
SlurmdTimeout=300
InactiveLimit=0
MinJobAge=300
KillWait=30
Waittime=0
# SCHEDULING
SchedulerType=sched/backfill
SelectType=select/cons_tres # Use cons_tres for memory tracking
SelectTypeParameters=CR_Core_Memory # Track Cores and Memory
# NODES - Adjust RealMemory based on your Pi 5 8GB (~7600 is conservative)
NodeName=pi-head NodeAddr=10.0.0.1 CPUs=4 RealMemory=7600 State=UNKNOWN
NodeName=pi-cp01 NodeAddr=10.0.0.2 CPUs=4 RealMemory=7600 State=UNKNOWN
NodeName=pi-cp02 NodeAddr=10.0.0.3 CPUs=4 RealMemory=7600 State=UNKNOWN
# PARTITION
PartitionName=rpi_part Nodes=pi-head,pi-cp01,pi-cp02 Default=YES MaxTime=INFINITE State=UP Oversubscribe=NO
* Create the SLURM log and spool directories **on all nodes**:
sudo mkdir /var/log/slurm /var/spool/slurmctld /var/spool/slurmd
# SLURM typically runs daemons as 'slurm' user/group created during package install
sudo chown slurm:slurm /var/log/slurm /var/spool/slurmctld /var/spool/slurmd
sudo chmod 755 /var/log/slurm /var/spool/slurmctld /var/spool/slurmd
# Verify user exists
id slurm
# On pi-head
sudo scp /etc/slurm/slurm.conf pi-cp01:/etc/slurm/slurm.conf
sudo scp /etc/slurm/slurm.conf pi-cp02:/etc/slurm/slurm.conf
- Configure Cgroup Plugin (
cgroup.conf
): Needed for resource constraint (ProctrackType=proctrack/cgroup
,TaskPlugin=task/cgroup
,SelectType=select/cons_tres
).- Create
/etc/slurm/cgroup.conf
onpi-head
first:sudo vim /etc/slurm/cgroup.conf
- Add the following content:
- Create
# /etc/slurm/cgroup.conf
CgroupAutomount=yes
CgroupReleaseAgentDir="/etc/slurm/cgroup"
ConstrainCores=yes
ConstrainDevices=yes
ConstrainRAMSpace=yes
# If using systemd, TaskAffinity should generally be no
TaskAffinity=no
* Create the `CgroupReleaseAgentDir` **on all nodes**:
sudo mkdir -p /etc/slurm/cgroup
sudo chown slurm:slurm /etc/slurm/cgroup # Or root:root might be needed depending on systemd interactions
* Copy `cgroup.conf` to compute nodes:
# On pi-head
sudo scp /etc/slurm/cgroup.conf pi-cp01:/etc/slurm/cgroup.conf
sudo scp /etc/slurm/cgroup.conf pi-cp02:/etc/slurm/cgroup.conf
- Start SLURM Services:
- On
pi-head
(Controller):
- On
sudo systemctl enable slurmctld.service
sudo systemctl start slurmctld.service
# Check status immediately
sudo systemctl status slurmctld.service
journalctl -u slurmctld.service | tail -n 20 # Check logs
* **On ALL nodes (Compute Daemons - including `pi-head`):**
sudo systemctl enable slurmd.service
sudo systemctl start slurmd.service
# Check status
sudo systemctl status slurmd.service
# Check logs on each node
tail -n 20 /var/log/slurm/slurmd.log
- Verify SLURM Cluster Status:
- Wait ~10-15 seconds for nodes to register. Run on
pi-head
:
- Wait ~10-15 seconds for nodes to register. Run on
sinfo
# Expected output:
# PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
# rpi_part* up infinite 3 idle pi-head,pi-cp0[1-2]
# (State might be 'unk' or 'down' initially, or 'mix' if nodes are registering)
scontrol show node
# Check details for each node. Look for 'State=IDLE'. If 'State=DOWN' or 'State=DRAINED', check logs:
# - /var/log/slurm/slurmctld.log on pi-head
# - /var/log/slurm/slurmd.log on the affected node(s)
# If nodes are down/drained due to initial errors that are now fixed:
# sudo scontrol update nodename=pi-head,pi-cp01,pi-cp02 state=resume
* Time synchronization errors between nodes. (Fix with `chrony`)
* Munge authentication errors. (Check `munge.key` and `munged` service)
* Firewall blocking ports `6817` (slurmctld) or `6818` (slurmd). (`iptables` on pi-head shouldn't block the 10.0.0.x network, but check if `ufw` or other firewalls are active).
* Incorrect hostnames or IP addresses in `slurm.conf` (`SlurmctldHost`, `NodeName`, `NodeAddr`). Use the `10.0.0.x` addresses for `NodeAddr`.
* Incorrect permissions or non-existent spool/log directories (`/var/spool/slurm*`, `/var/log/slurm`).
* `slurmd` fails to start due to resource limits or cgroup issues. Check `journalctl -u slurmd` and `dmesg`.
Phase 5: Testing the SLURM Cluster
(Run these commands as cuser
on pi-head
)
- Login as
cuser
:
su - cuser
# Or: ssh cuser@pi-head
cd /clusterfs # Work in the shared filesystem if desired
- Run a Simple Command Interactively:
srun hostname
# Runs 'hostname' on one available node in the default partition.
- Run Command on Specific Number of Nodes:
# Run hostname on 2 different nodes, 1 task per node
srun --nodes=2 --ntasks-per-node=1 hostname | sort
# Should show two different hostnames (e.g., pi-cp01, pi-cp02 or pi-head, pi-cp01)
- Submit a Simple Batch Job:
- Create a job script file, e.g.,
/clusterfs/cuser/hello.sh
(ensure/clusterfs/cuser
exists and is writable bycuser
):
- Create a job script file, e.g.,
#!/bin/bash
#SBATCH --job-name=hello # Job name
#SBATCH --output=hello_job_%j.out # Standard output file (%j = job ID)
#SBATCH --error=hello_job_%j.err # Standard error file
#SBATCH --nodes=3 # Request all 3 nodes
#SBATCH --ntasks-per-node=2 # Request 2 tasks (processes) per node (total 6)
#SBATCH --cpus-per-task=1 # Request 1 CPU core per task
#SBATCH --partition=rpi_part # Specify partition (optional if default)
#SBATCH --time=00:05:00 # Time limit (5 minutes)
echo "Job running on nodes:"
srun hostname | sort # Use srun within sbatch to launch parallel tasks
echo "Tasks started at: $(date)"
sleep 20 # Simulate some work
echo "Tasks finished at: $(date)"
* Make the script executable: `chmod +x /clusterfs/cuser/hello.sh`
* Submit the job from the directory containing the script:
sbatch hello.sh
# Should print: Submitted batch job <JOB_ID>
* Check the queue:
squeue
# Shows running or pending jobs
watch squeue # Monitor queue updates
sinfo
# Should show nodes in 'alloc' or 'mix' state.
* Once the job finishes (disappears from `squeue`), check the output files (`hello_job_<JOB_ID>.out` and `.err`) in the submission directory:
cat hello_job_<JOB_ID>.out
# Should show hostnames from all 3 nodes, likely repeated if ntasks > nnodes * ntasks-per-node setting is used correctly.
# In this example, it should list pi-head, pi-cp01, pi-cp02 each twice.
Congratulations!
You should now have a functional 3-node Raspberry Pi 5 SLURM cluster. The compute nodes (pi-cp01
, pi-cp02
) use the head node (pi-head
) as a gateway for internet access, while all cluster communication happens over the private 10.0.0.x
network.
Next Steps & Considerations
- Install MPI: Install OpenMPI or MPICH (
sudo apt install -y openmpi-bin libopenmpi-dev
on all nodes) to run parallel MPI applications. Update SLURM’sMpiDefault=pmix
or configure MPI properly if needed. - Shared Software Stack: Install compilers, libraries, and applications needed for your HPC tasks onto the shared NFS filesystem (
/clusterfs
) so they are accessible from all nodes without needing installation everywhere. Modules systems like Lmod can help manage this. - Monitoring: Set up monitoring tools like
htop
,glances
, or more comprehensive systems like Prometheus + Grafana or Ganglia to observe cluster load and resource usage. - SLURM Tuning: Explore more advanced
slurm.conf
options: resource limits (memory, cores per job/user), Quality of Service (QoS), fair-share scheduling, job arrays. - SLURM Accounting: For tracking resource usage over time, set up the SLURM accounting database (
slurmdbd
) which requires installing and configuring a database (like MariaDB/MySQL). - Security: Review
iptables
rules, harden SSH (/etc/ssh/sshd_config
), and consider user permissions carefully. - Backup: Back up your
slurm.conf
,munge.key
, and important data on/clusterfs
.
Enjoy your mini HPC cluster!