Tutorial: Setting Up a 3-Node Raspberry Pi 5 SLURM Cluster

Created: 2025-04-05 16:10:47 | Last updated: 2025-04-05 16:10:47 | Status: Public

This tutorial guides you through setting up a small High-Performance Computing (HPC) cluster using three Raspberry Pi 5 devices, SLURM Workload Manager, and a specific network configuration involving both Wi-Fi and a private Ethernet network.

Cluster Configuration:

Nodes: 3 x Raspberry Pi 5 (8GB RAM recommended)
OS: Raspberry Pi OS Bookworm (64-bit recommended)
Boot: From SSDs
Cluster User: cuser
Networking:
- pi-head:
  - WLAN (wlan0): Connects to your main router via Wi-Fi, gets 192.168.1.20 via DHCP reservation (Gateway: 192.168.1.1). Provides internet access.
  - Ethernet (eth0): Connects to private switch, static IP 10.0.0.1/24.
- pi-cp01:
  - Ethernet (eth0): Connects to private switch, static IP 10.0.0.2/24. Gateway via pi-head (10.0.0.1).
- pi-cp02:
  - Ethernet (eth0): Connects to private switch, static IP 10.0.0.3/24. Gateway via pi-head (10.0.0.1).
SLURM: Basic setup (slurmctld, slurmd, munge).

Prerequisites

Hardware:
- 3 x Raspberry Pi 5 (8GB RAM)
- 3 x NVMe SSDs (or SATA SSDs with appropriate adapters) compatible with RPi 5 boot.
- 3 x Reliable Power Supplies for RPi 5 (5V/5A recommended).
- 1 x Gigabit Ethernet Switch (unmanaged is fine).
- 3 x Ethernet Cables.
- Access to your existing Wi-Fi network and router admin interface (for DHCP reservation).
Software:
- Raspberry Pi Imager tool.
- Raspberry Pi OS Bookworm (64-bit recommended) flashed onto each SSD.
Initial Setup:
- Ensure each Pi boots correctly from its SSD.
- Complete the initial Raspberry Pi OS setup wizard (create the initial user - this is NOT cuser yet, set locale, keyboard, etc.).
- Enable SSH on each Pi: sudo raspi-config -> Interface Options -> SSH -> Enable.
- Connect pi-head to your Wi-Fi network.
- Configure the DHCP reservation on your OpenWRT router to assign 192.168.1.20 to pi-head’s WLAN MAC address. Verify pi-head gets this IP (ip a show wlan0).
- Physically connect all three Pis to the Gigabit switch using Ethernet cables.

Phase 1: Basic OS Configuration & Hostnames

(Perform these steps on each Pi, adjusting hostnames accordingly. You’ll need SSH access.)

Login: SSH into each Pi using the initial user you created during setup.
Set Hostnames:
- On the first Pi (intended as head node):

        sudo hostnamectl set-hostname pi-head

*   On the second Pi (compute node 1):

        sudo hostnamectl set-hostname pi-cp01

*   On the third Pi (compute node 2):

        sudo hostnamectl set-hostname pi-cp02

*   Reboot each Pi (`sudo reboot`) or log out and log back in for the change to take effect in your shell prompt and network identity.

Update System (pi-head only for now):
- Ensure pi-head has internet via Wi-Fi.
- SSH into pi-head:

        sudo apt update
        sudo apt full-upgrade -y
        sudo apt install -y vim git build-essential # Essential tools

*   *Note: We will update `pi-cp01` and `pi-cp02` after setting up network routing.*

Phase 2: Network Configuration

We will use nmcli, the command-line tool for NetworkManager, which is standard on Raspberry Pi OS Bookworm.

Verify wlan0 on pi-head:
- SSH into pi-head.
- Run ip addr show wlan0. Confirm it shows inet 192.168.1.20/XX (netmask might vary).
- Run ip route. Confirm the default route is via 192.168.1.1 dev wlan0.
Configure eth0 on pi-head (Static Private IP):
- SSH into pi-head.
- Identify the Ethernet interface name (usually eth0 or similar). Use ip a. Assume eth0.
- Add a NetworkManager connection profile for eth0 with the static IP:

        # Replace 'eth0' if your interface name is different
        sudo nmcli connection add type ethernet con-name 'static-eth0' ifname eth0 ip4 10.0.0.1/24
        # Critical: Prevent this from becoming the default route or having a gateway
        sudo nmcli connection modify 'static-eth0' ipv4.gateway ''
        sudo nmcli connection modify 'static-eth0' ipv4.never-default yes
        sudo nmcli connection modify 'static-eth0' connection.autoconnect yes
        # Bring the connection up (might happen automatically)
        sudo nmcli connection up 'static-eth0'

*   **Verify:**
    *   `ip addr show eth0`: Should show `inet 10.0.0.1/24`.
    *   `ip route`: Should *still* show the default route via `192.168.1.1 dev wlan0`. No default route via `10.0.0.1` should exist.

Configure eth0 on pi-cp01 (Static Private IP & Gateway):
- SSH into pi-cp01. If you cannot SSH yet over the private network, connect a monitor/keyboard temporarily, or temporarily connect its eth0 to your main network to get an IP via DHCP for initial access.
- Add the static IP configuration, setting pi-head as the gateway:

        # Replace 'eth0' if needed
        sudo nmcli connection add type ethernet con-name 'static-eth0' ifname eth0 ip4 10.0.0.2/24 gw4 10.0.0.1
        # Set DNS - will be routed via pi-head. Use public servers.
        sudo nmcli connection modify 'static-eth0' ipv4.dns "8.8.8.8 1.1.1.1"
        sudo nmcli connection modify 'static-eth0' ipv4.ignore-auto-dns yes # Ensure our manual DNS is used
        sudo nmcli connection modify 'static-eth0' connection.autoconnect yes
        # Bring the connection up
        sudo nmcli connection up 'static-eth0'

*   **Verify:**
    *   `ip addr show eth0`: Should show `inet 10.0.0.2/24`.
    *   `ip route`: Should show `default via 10.0.0.1 dev eth0`.

Configure eth0 on pi-cp02 (Static Private IP & Gateway):
- SSH into pi-cp02 (similar potential initial access caveats as pi-cp01).
- Add the static IP configuration:

        # Replace 'eth0' if needed
        sudo nmcli connection add type ethernet con-name 'static-eth0' ifname eth0 ip4 10.0.0.3/24 gw4 10.0.0.1
        # Set DNS
        sudo nmcli connection modify 'static-eth0' ipv4.dns "8.8.8.8 1.1.1.1"
        sudo nmcli connection modify 'static-eth0' ipv4.ignore-auto-dns yes
        sudo nmcli connection modify 'static-eth0' connection.autoconnect yes
        # Bring the connection up
        sudo nmcli connection up 'static-eth0'

*   **Verify:**
    *   `ip addr show eth0`: Should show `inet 10.0.0.3/24`.
    *   `ip route`: Should show `default via 10.0.0.1 dev eth0`.

Enable IP Forwarding and NAT on pi-head:
- SSH into pi-head.
- Enable kernel IP forwarding:

        echo 'net.ipv4.ip_forward=1' | sudo tee /etc/sysctl.d/99-ip_forward.conf
        sudo sysctl -p /etc/sysctl.d/99-ip_forward.conf

*   Set up `iptables` rules for NAT (masquerading). **Carefully replace `wlan0` and `eth0` if your interface names differ!**

        # Flush existing rules if needed (use with caution if you have other rules)
        # sudo iptables -t nat -F
        # sudo iptables -F FORWARD
        # Rule: Allow traffic originating from private network (eth0) to go out via public network (wlan0)
        sudo iptables -t nat -A POSTROUTING -o wlan0 -j MASQUERADE
        # Rule: Allow established connections back in from wlan0 to eth0
        sudo iptables -A FORWARD -i wlan0 -o eth0 -m state --state RELATED,ESTABLISHED -j ACCEPT
        # Rule: Allow new connections from eth0 to wlan0
        sudo iptables -A FORWARD -i eth0 -o wlan0 -j ACCEPT

*   Make `iptables` rules persistent across reboots:

        sudo apt update # Ensure pi-head has internet
        sudo apt install -y iptables-persistent
        # During installation, it will ask if you want to save current IPv4/IPv6 rules. Select "Yes" for both.
        # To save rules manually later: sudo netfilter-persistent save

Test Network Connectivity:
- From pi-head:

        ping -c 3 10.0.0.2
        ping -c 3 10.0.0.3

*   From `pi-cp01`:

        ping -c 3 10.0.0.1 # Ping head node
        ping -c 3 8.8.8.8  # Ping Google DNS (tests routing + internet)
        ping -c 3 google.com # Ping domain name (tests routing + internet + DNS)

*   From `pi-cp02`:

        ping -c 3 10.0.0.1
        ping -c 3 8.8.8.8
        ping -c 3 google.com

*   **Troubleshooting:** If pings fail:
    *   Check `ip a` and `ip route` on all nodes.
    *   Check physical cable connections.
    *   Check `iptables -t nat -L -n -v` and `iptables -L FORWARD -n -v` on `pi-head`.
    *   Check `sysctl net.ipv4.ip_forward` on `pi-head` (should be `1`).
    *   Check DNS settings (`cat /etc/resolv.conf`) on `pi-cp01`, `pi-cp02`.

Update Compute Nodes:
- Now that pi-cp01 and pi-cp02 should have internet access, update them:

        # On pi-cp01 AND pi-cp02
        sudo apt update
        sudo apt full-upgrade -y
        sudo apt install -y vim git build-essential # Install tools here too

Phase 3: Common Cluster Environment Setup

(Perform steps on all nodes unless specified)

Configure Hostname Resolution (/etc/hosts):
- Edit the hosts file on all three nodes: sudo vim /etc/hosts
- Ensure the following lines exist (add them if missing, below the 127.0.0.1 localhost line):

        127.0.1.1       <current_hostname> # This line is usually added by hostnamectl

        # Cluster Nodes
        10.0.0.1    pi-head
        10.0.0.2    pi-cp01
        10.0.0.3    pi-cp02

*   **Test:** From any node, ping the others by hostname (e.g., `ping -c 1 pi-cp01` from `pi-head`).

Create Common Cluster User (cuser):
- Crucially, cuser must have the same User ID (UID) and Group ID (GID) on all nodes.
- On pi-head first:

        sudo adduser cuser
        # Follow prompts to set password etc.
        # Note the UID and GID displayed (e.g., uid=1001(cuser) gid=1001(cuser) groups=...)
        # Optional: Add cuser to the sudo group if needed for administration tasks
        # sudo usermod -aG sudo cuser

*   **On `pi-cp01` and `pi-cp02`:**
    *   Get the UID and GID from `pi-head`. Use `id cuser` on `pi-head`. Let's assume it was `1001` for both UID and GID. **Replace `1001` below if yours is different.**

        # Create the group first with the specific GID
        sudo groupadd -g 1001 cuser
        # Create the user with the specific UID and GID
        sudo useradd -u 1001 -g 1001 -m -s /bin/bash cuser
        # Set the password for the new user
        sudo passwd cuser
        # Optional: Add to sudo group (use the same groups as on pi-head if needed)
        # sudo usermod -aG sudo cuser

*   **Verify:** Run `id cuser` on **all three** nodes. Ensure the UID and GID match exactly.

Setup Passwordless SSH for cuser:
- Log in as cuser on pi-head. You can use su - cuser if logged in as another user, or SSH directly: ssh cuser@pi-head.
- Generate SSH key pair (run as cuser):

        # Accept default file location (~/.ssh/id_rsa), press Enter for empty passphrase
        ssh-keygen -t rsa -b 4096

*   **Copy the public key to all nodes (including `pi-head` itself):**

        # Run as cuser from pi-head
        ssh-copy-id cuser@pi-head
        ssh-copy-id cuser@pi-cp01
        ssh-copy-id cuser@pi-cp02
        # Enter the password for 'cuser' when prompted for each node

*   **Test:** Still as `cuser` on `pi-head`, try SSHing to each node without a password:

        ssh pi-head date
        ssh pi-cp01 date
        ssh pi-cp02 date
        # The first time connecting to each might ask "Are you sure you want to continue connecting (yes/no/[fingerprint])?". Type 'yes'.
        # If it prompts for a password after the first connection, the key setup failed. Check permissions in ~/.ssh directories.

Install and Configure NFS (Shared Filesystem):
- We’ll share /clusterfs from pi-head to be used by all nodes.
- On pi-head (NFS Server):

        sudo apt update
        sudo apt install -y nfs-kernel-server
        sudo mkdir -p /clusterfs
        # Option 1: Allow anyone to write (simple for cluster user)
        sudo chown nobody:nogroup /clusterfs
        sudo chmod 777 /clusterfs
        # Option 2: Restrict to cuser (better security, requires consistent UID/GID)
        # sudo chown cuser:cuser /clusterfs
        # sudo chmod 770 /clusterfs # Or 750 if group members only need read
        # Edit the NFS exports file
        sudo vim /etc/exports
        # Add this line to allow access from the private 10.0.0.x network:
        # Use 'no_root_squash' carefully if you need root access over NFS
        /clusterfs    10.0.0.0/24(rw,sync,no_subtree_check)
        # Activate the exports
        sudo exportfs -ra
        # Restart and enable the NFS server service
        sudo systemctl restart nfs-kernel-server
        sudo systemctl enable nfs-kernel-server

*   **On `pi-cp01` and `pi-cp02` (NFS Clients):**

        sudo apt update
        sudo apt install -y nfs-common
        sudo mkdir -p /clusterfs
        # Add the mount to /etc/fstab for automatic mounting on boot
        sudo vim /etc/fstab
        # Add this line at the end:
        pi-head:/clusterfs    /clusterfs   nfs    defaults,auto,nofail    0    0
        # Mount all filesystems defined in fstab (including the new one)
        sudo mount -a
        # Verify the mount was successful
        df -h | grep /clusterfs
        # Check mount options (optional)
        mount | grep /clusterfs

    *   From `pi-head` as `cuser`: `touch /clusterfs/test_head.txt`
    *   From `pi-cp01` as `cuser`: `ls /clusterfs` (should see `test_head.txt`)
    *   From `pi-cp02` as `cuser`: `touch /clusterfs/test_cp02.txt`
    *   From `pi-head` as `cuser`: `ls /clusterfs` (should see both files)

Install and Configure NTP (Time Synchronization): Accurate time is essential for SLURM.
- Install chrony on all nodes:

        sudo apt update
        sudo apt install -y chrony

*   Ensure it's enabled and running on **all nodes**:

        sudo systemctl enable --now chrony

*   `chrony` will automatically use internet time sources. Since all nodes now have internet (directly or via `pi-head`), this should work.
*   **Verify sync status** (might take a minute or two after starting):

        # Run on all nodes
        chronyc sources
        # Look for lines starting with '^*' (synced server) or '^+ (acceptable server).
        timedatectl status | grep "NTP service"
        # Should show 'active'.

Phase 4: Install and Configure SLURM & Munge

Install Munge (Authentication Service): Munge provides secure authentication between SLURM daemons. Install on all nodes.

    sudo apt update
    sudo apt install -y munge libmunge-dev libmunge2

Create and Distribute Munge Key: A shared secret key must be identical on all nodes.
- On pi-head ONLY:

        # Stop munge service if running
        sudo systemctl stop munge
        # Create the key (as root or using sudo)
        sudo dd if=/dev/urandom of=/etc/munge/munge.key bs=1 count=1024
        # Set correct ownership and permissions
        sudo chown munge:munge /etc/munge/munge.key
        sudo chmod 400 /etc/munge/munge.key

*   **Securely copy the key from `pi-head` to compute nodes:**

        # Run these commands on pi-head (you might need root@ if sudo access via SSH is restricted)
        sudo scp /etc/munge/munge.key pi-cp01:/etc/munge/munge.key
        sudo scp /etc/munge/munge.key pi-cp02:/etc/munge/munge.key
        # Enter the SSH password for the user on pi-cp01/pi-cp02 when prompted (e.g., your initial setup user if using sudo scp, or root password if using root@).

*   **On `pi-cp01` and `pi-cp02`:**

        # Ensure munge is stopped
        sudo systemctl stop munge
        # Set correct ownership and permissions on the copied key
        sudo chown munge:munge /etc/munge/munge.key
        sudo chmod 400 /etc/munge/munge.key

Start and Enable Munge Service: On all nodes:

    sudo systemctl start munge
    sudo systemctl enable munge
    # Verify status
    sudo systemctl status munge

Test Munge Communication:
- From pi-head:

        # Test local encoding/decoding
        munge -n | unmunge
        # Test head -> cp01
        munge -n | ssh pi-cp01 unmunge
        # Test head -> cp02
        munge -n | ssh pi-cp02 unmunge
        # Test cp01 -> head (round trip)
        ssh pi-cp01 munge -n | unmunge

*   All tests should return a `STATUS: Success (...)` line. If not, check `munge.key` consistency, permissions, and `munged` service status. Also check `/var/log/munge/munged.log`.

Install SLURM: Install the SLURM workload manager packages on all nodes.

    sudo apt update
    sudo apt install -y slurm-wlm slurm-wlm-doc # slurm-wlm pulls in slurmd, slurmctld etc.

Configure SLURM (slurm.conf):
- Create the configuration file on pi-head first. A minimal configuration is below.
- Generate a node list helper (optional but good practice):

        # On pi-head
        scontrol show config | grep ClusterName # Find default if any

*   Edit the main config file: `sudo vim /etc/slurm/slurm.conf`
*   Replace the **entire content** with the following.
    *   **Adjust `RealMemory`**: `free -m` shows total memory in MiB. Leave some (~200-300MB) for the OS. For an 8GB Pi (approx 7850MB usable), `7600` is a safe starting point.
    *   **CPUs**: RPi 5 has 4 cores.

        # /etc/slurm/slurm.conf
        # Basic SLURM configuration for pi-cluster
        ClusterName=pi-cluster
        SlurmctldHost=pi-head #(Or use IP 10.0.0.1)
        # SlurmctldHost=pi-head(10.0.0.1) # Optional: Specify both
        MpiDefault=none
        ProctrackType=proctrack/cgroup
        ReturnToService=1
        SlurmctldPidFile=/run/slurmctld.pid
        SlurmdPidFile=/run/slurmd.pid
        SlurmctldPort=6817
        SlurmdPort=6818
        AuthType=auth/munge
        StateSaveLocation=/var/spool/slurmctld
        SlurmdSpoolDir=/var/spool/slurmd
        SwitchType=switch/none
        TaskPlugin=task/cgroup
        # LOGGING
        SlurmctldLogFile=/var/log/slurm/slurmctld.log
        SlurmdLogFile=/var/log/slurm/slurmd.log
        JobCompType=jobcomp/none # No job completion logging for basic setup
        # TIMERS
        SlurmctldTimeout=120
        SlurmdTimeout=300
        InactiveLimit=0
        MinJobAge=300
        KillWait=30
        Waittime=0
        # SCHEDULING
        SchedulerType=sched/backfill
        SelectType=select/cons_tres # Use cons_tres for memory tracking
        SelectTypeParameters=CR_Core_Memory # Track Cores and Memory
        # NODES - Adjust RealMemory based on your Pi 5 8GB (~7600 is conservative)
        NodeName=pi-head NodeAddr=10.0.0.1 CPUs=4 RealMemory=7600 State=UNKNOWN
        NodeName=pi-cp01 NodeAddr=10.0.0.2 CPUs=4 RealMemory=7600 State=UNKNOWN
        NodeName=pi-cp02 NodeAddr=10.0.0.3 CPUs=4 RealMemory=7600 State=UNKNOWN
        # PARTITION
        PartitionName=rpi_part Nodes=pi-head,pi-cp01,pi-cp02 Default=YES MaxTime=INFINITE State=UP Oversubscribe=NO

*   Create the SLURM log and spool directories **on all nodes**:

        sudo mkdir /var/log/slurm /var/spool/slurmctld /var/spool/slurmd
        # SLURM typically runs daemons as 'slurm' user/group created during package install
        sudo chown slurm:slurm /var/log/slurm /var/spool/slurmctld /var/spool/slurmd
        sudo chmod 755 /var/log/slurm /var/spool/slurmctld /var/spool/slurmd
        # Verify user exists
        id slurm

        # On pi-head
        sudo scp /etc/slurm/slurm.conf pi-cp01:/etc/slurm/slurm.conf
        sudo scp /etc/slurm/slurm.conf pi-cp02:/etc/slurm/slurm.conf

Configure Cgroup Plugin (cgroup.conf): Needed for resource constraint (ProctrackType=proctrack/cgroup, TaskPlugin=task/cgroup, SelectType=select/cons_tres).
- Create /etc/slurm/cgroup.conf on pi-head first: sudo vim /etc/slurm/cgroup.conf
- Add the following content:

        # /etc/slurm/cgroup.conf
        CgroupAutomount=yes
        CgroupReleaseAgentDir="/etc/slurm/cgroup"
        ConstrainCores=yes
        ConstrainDevices=yes
        ConstrainRAMSpace=yes
        # If using systemd, TaskAffinity should generally be no
        TaskAffinity=no

*   Create the `CgroupReleaseAgentDir` **on all nodes**:

        sudo mkdir -p /etc/slurm/cgroup
        sudo chown slurm:slurm /etc/slurm/cgroup # Or root:root might be needed depending on systemd interactions

*   Copy `cgroup.conf` to compute nodes:

        # On pi-head
        sudo scp /etc/slurm/cgroup.conf pi-cp01:/etc/slurm/cgroup.conf
        sudo scp /etc/slurm/cgroup.conf pi-cp02:/etc/slurm/cgroup.conf

Start SLURM Services:
- On pi-head (Controller):

        sudo systemctl enable slurmctld.service
        sudo systemctl start slurmctld.service
        # Check status immediately
        sudo systemctl status slurmctld.service
        journalctl -u slurmctld.service | tail -n 20 # Check logs

*   **On ALL nodes (Compute Daemons - including `pi-head`):**

        sudo systemctl enable slurmd.service
        sudo systemctl start slurmd.service
        # Check status
        sudo systemctl status slurmd.service
        # Check logs on each node
        tail -n 20 /var/log/slurm/slurmd.log

Verify SLURM Cluster Status:
- Wait ~10-15 seconds for nodes to register. Run on pi-head:

        sinfo
        # Expected output:
        # PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
        # rpi_part*    up   infinite      3   idle pi-head,pi-cp0[1-2]
        # (State might be 'unk' or 'down' initially, or 'mix' if nodes are registering)

        scontrol show node
        # Check details for each node. Look for 'State=IDLE'. If 'State=DOWN' or 'State=DRAINED', check logs:
        # - /var/log/slurm/slurmctld.log on pi-head
        # - /var/log/slurm/slurmd.log on the affected node(s)

        # If nodes are down/drained due to initial errors that are now fixed:
        # sudo scontrol update nodename=pi-head,pi-cp01,pi-cp02 state=resume

    *   Time synchronization errors between nodes. (Fix with `chrony`)
    *   Munge authentication errors. (Check `munge.key` and `munged` service)
    *   Firewall blocking ports `6817` (slurmctld) or `6818` (slurmd). (`iptables` on pi-head shouldn't block the 10.0.0.x network, but check if `ufw` or other firewalls are active).
    *   Incorrect hostnames or IP addresses in `slurm.conf` (`SlurmctldHost`, `NodeName`, `NodeAddr`). Use the `10.0.0.x` addresses for `NodeAddr`.
    *   Incorrect permissions or non-existent spool/log directories (`/var/spool/slurm*`, `/var/log/slurm`).
    *   `slurmd` fails to start due to resource limits or cgroup issues. Check `journalctl -u slurmd` and `dmesg`.

Phase 5: Testing the SLURM Cluster

(Run these commands as cuser on pi-head)

Login as cuser:

    su - cuser
    # Or: ssh cuser@pi-head
    cd /clusterfs # Work in the shared filesystem if desired

Run a Simple Command Interactively:

    srun hostname
    # Runs 'hostname' on one available node in the default partition.

Run Command on Specific Number of Nodes:

    # Run hostname on 2 different nodes, 1 task per node
    srun --nodes=2 --ntasks-per-node=1 hostname | sort
    # Should show two different hostnames (e.g., pi-cp01, pi-cp02 or pi-head, pi-cp01)

Submit a Simple Batch Job:
- Create a job script file, e.g., /clusterfs/cuser/hello.sh (ensure /clusterfs/cuser exists and is writable by cuser):

        #!/bin/bash
        #SBATCH --job-name=hello       # Job name
        #SBATCH --output=hello_job_%j.out # Standard output file (%j = job ID)
        #SBATCH --error=hello_job_%j.err  # Standard error file
        #SBATCH --nodes=3                 # Request all 3 nodes
        #SBATCH --ntasks-per-node=2       # Request 2 tasks (processes) per node (total 6)
        #SBATCH --cpus-per-task=1         # Request 1 CPU core per task
        #SBATCH --partition=rpi_part      # Specify partition (optional if default)
        #SBATCH --time=00:05:00           # Time limit (5 minutes)

        echo "Job running on nodes:"
        srun hostname | sort # Use srun within sbatch to launch parallel tasks

        echo "Tasks started at: $(date)"
        sleep 20 # Simulate some work
        echo "Tasks finished at: $(date)"

*   Make the script executable: `chmod +x /clusterfs/cuser/hello.sh`
*   Submit the job from the directory containing the script:

        sbatch hello.sh
        # Should print: Submitted batch job <JOB_ID>

*   Check the queue:

        squeue
        # Shows running or pending jobs
        watch squeue # Monitor queue updates

        sinfo
        # Should show nodes in 'alloc' or 'mix' state.

*   Once the job finishes (disappears from `squeue`), check the output files (`hello_job_<JOB_ID>.out` and `.err`) in the submission directory:

        cat hello_job_<JOB_ID>.out
        # Should show hostnames from all 3 nodes, likely repeated if ntasks > nnodes * ntasks-per-node setting is used correctly.
        # In this example, it should list pi-head, pi-cp01, pi-cp02 each twice.

Congratulations!

You should now have a functional 3-node Raspberry Pi 5 SLURM cluster. The compute nodes (pi-cp01, pi-cp02) use the head node (pi-head) as a gateway for internet access, while all cluster communication happens over the private 10.0.0.x network.

Next Steps & Considerations

Install MPI: Install OpenMPI or MPICH (sudo apt install -y openmpi-bin libopenmpi-dev on all nodes) to run parallel MPI applications. Update SLURM’s MpiDefault=pmix or configure MPI properly if needed.
Shared Software Stack: Install compilers, libraries, and applications needed for your HPC tasks onto the shared NFS filesystem (/clusterfs) so they are accessible from all nodes without needing installation everywhere. Modules systems like Lmod can help manage this.
Monitoring: Set up monitoring tools like htop, glances, or more comprehensive systems like Prometheus + Grafana or Ganglia to observe cluster load and resource usage.
SLURM Tuning: Explore more advanced slurm.conf options: resource limits (memory, cores per job/user), Quality of Service (QoS), fair-share scheduling, job arrays.
SLURM Accounting: For tracking resource usage over time, set up the SLURM accounting database (slurmdbd) which requires installing and configuring a database (like MariaDB/MySQL).
Security: Review iptables rules, harden SSH (/etc/ssh/sshd_config), and consider user permissions carefully.
Backup: Back up your slurm.conf, munge.key, and important data on /clusterfs.

Enjoy your mini HPC cluster!