HPC Administration Tutorial - Part 4

Created: 2025-03-16 17:03:13 | Last updated: 2025-03-16 17:03:13 | Status: Public

Documentation Best Practices

This fourth part of our HPC administration tutorial focuses on effective documentation practices. Good documentation is crucial for maintaining institutional knowledge, enabling smooth operations, and ensuring user productivity in an HPC environment.

Documentation Best Practices
System Documentation
User Documentation
Knowledge Management
Documentation Tools and Platforms

Documentation Best Practices

System Documentation

Hardware Inventory:

Create a comprehensive hardware inventory document:

   nano /shared/docs/hardware_inventory.md

Template:

   # HPC Cluster Hardware Inventory

   Last updated: YYYY-MM-DD

   ## Head Node (pi-head)

   - **Model**: Raspberry Pi 5
   - **CPU**: Broadcom BCM2712 (4 cores)
   - **RAM**: 8GB LPDDR5
   - **Storage**: 64GB microSD + 500GB USB SSD
   - **Network**: Gigabit Ethernet
   - **MAC Address**: XX:XX:XX:XX:XX:XX
   - **Serial Number**: XXXXXXXX
   - **Purchase Date**: YYYY-MM-DD
   - **Warranty Expiration**: YYYY-MM-DD

   ## Compute Node 1 (pi-compute-01)

   - **Model**: Raspberry Pi 5
   - **CPU**: Broadcom BCM2712 (4 cores)
   - **RAM**: 8GB LPDDR5
   - **Storage**: 32GB microSD
   - **Network**: Gigabit Ethernet
   - **MAC Address**: XX:XX:XX:XX:XX:XX
   - **Serial Number**: XXXXXXXX
   - **Purchase Date**: YYYY-MM-DD
   - **Warranty Expiration**: YYYY-MM-DD

   ## Compute Node 2 (pi-compute-02)

   - **Model**: Raspberry Pi 5
   - **CPU**: Broadcom BCM2712 (4 cores)
   - **RAM**: 8GB LPDDR5
   - **Storage**: 32GB microSD
   - **Network**: Gigabit Ethernet
   - **MAC Address**: XX:XX:XX:XX:XX:XX
   - **Serial Number**: XXXXXXXX
   - **Purchase Date**: YYYY-MM-DD
   - **Warranty Expiration**: YYYY-MM-DD

   ## Network Equipment

   - **Switch Model**: XXXXX
   - **Ports**: 8-port Gigabit
   - **Location**: XXXXX
   - **IP Address**: 192.168.1.XXX
   - **Purchase Date**: YYYY-MM-DD

   ## Storage Systems

   - **Device**: 500GB USB SSD
   - **Connected to**: Head Node
   - **Filesystem**: ext4
   - **Mount Point**: /shared
   - **Purpose**: Shared storage for home directories and applications

Network Documentation:

   nano /shared/docs/network_architecture.md

Template:

   # HPC Cluster Network Architecture

   Last updated: YYYY-MM-DD

   ## Network Topology

graph TD
Switch[Gigabit Switch] — HeadNode[Head Node
pi-head
192.168.1.100]
Switch — Compute1[Compute Node 1
pi-compute-01
192.168.1.101]
Switch — Compute2[Compute Node 2
pi-compute-02
192.168.1.102]

   ## IP Address Allocation

   | Hostname | IP Address | MAC Address | Purpose |
   |----------|------------|-------------|---------|
   | pi-head | 192.168.1.100 | XX:XX:XX:XX:XX:XX | Head/management node |
   | pi-compute-01 | 192.168.1.101 | XX:XX:XX:XX:XX:XX | Compute node |
   | pi-compute-02 | 192.168.1.102 | XX:XX:XX:XX:XX:XX | Compute node |

   ## Network Services

   | Service | Port | Nodes | Description |
   |---------|------|-------|-------------|
   | SSH | 22 | All | Secure Shell |
   | NFS | 2049 | pi-head (server) | Network File System |
   | Slurm | 6817-6819 | All | Slurm communication |
   | Prometheus | 9090 | pi-head | Monitoring system |
   | Node Exporter | 9100 | All | System metrics |
   | Grafana | 3000 | pi-head | Monitoring dashboard |

   ## Firewall Rules

   | Source | Destination | Port | Protocol | Action | Purpose |
   |--------|-------------|------|----------|--------|---------|
   | 192.168.1.0/24 | Any | 22 | TCP | ALLOW | SSH access |
   | 192.168.1.0/24 | Any | 6817-6819 | TCP | ALLOW | Slurm |
   | 192.168.1.0/24 | pi-head | 2049 | TCP/UDP | ALLOW | NFS |
   | Any | Any | Any | Any | DENY | Default rule |

Installation and Configuration Guide:

   nano /shared/docs/installation_guide.md

Template:

   # HPC Cluster Installation and Configuration Guide

   Last updated: YYYY-MM-DD

   ## 1. Base Installation

   ### Operating System

   - OS: Debian Bookworm
   - Installation method: Raspberry Pi Imager
   - Initial configuration:
     - Enabled SSH
     - Set hostname
     - Updated system packages

   ### Network Configuration

   Static IP configuration in `/etc/network/interfaces.d/eth0`:

auto eth0
iface eth0 inet static
address 192.168.1.XXX
netmask 255.255.255.0
gateway 192.168.1.1
dns-nameservers 8.8.8.8 8.8.4.4

   ## 2. Slurm Installation

   ### Package Installation

sudo apt install -y slurmd slurm-client slurm-wlm munge libmunge-dev

   ### Configuration Files

   Location: `/etc/slurm/slurm.conf`

   Key parameters:
   - ClusterName: pi-cluster
   - SlurmctldHost: pi-head
   - NodeName definition with CPUs and Memory
   - Partition configuration

   ## 3. Storage Configuration

   ### NFS Server Setup (Head Node)

   Installed packages:

sudo apt install -y nfs-kernel-server

   Exports defined in `/etc/exports`:

/shared 192.168.1.0/24(rw,sync,no_subtree_check)
/home 192.168.1.0/24(rw,sync,no_subtree_check)

   ### NFS Client Setup (Compute Nodes)

   Installed packages:

sudo apt install -y nfs-common

   Mount entries in `/etc/fstab`:

pi-head:/shared /shared nfs rsize=1048576,wsize=1048576,noatime,nodiratime 0 0
pi-head:/home /home nfs rsize=1048576,wsize=1048576,noatime,nodiratime 0 0

User Documentation

User Guide:

   nano /shared/docs/user_guide.md

Template:

   # HPC Cluster User Guide

   Last updated: YYYY-MM-DD

   ## Getting Started

   ### Account Setup

   To request an account, please contact the system administrator.

   ### Connecting to the Cluster

   Connect using SSH:

ssh username@pi-head

   ### Storage Locations

   - `/home/username`: Personal home directory
   - `/shared/apps`: Software applications
   - `/shared/data`: Shared datasets
   - `/scratch`: Temporary high-speed storage (files deleted after 7 days)

   ## Running Jobs

   ### Submitting Batch Jobs

   Create a job script:

#!/bin/bash
#SBATCH –job-name=myJob
#SBATCH –output=myJob_%j.out
#SBATCH –error=myJob_%j.err
#SBATCH –ntasks=4
#SBATCH –time=01:00:00

# Your commands here
echo “Running on $(hostname)”
sleep 60

   Submit the job:

sbatch myjob.sh

   ### Monitoring Jobs

   Check job status:

squeue -u username

   View detailed job information:

scontrol show job JOB_ID

   Cancel a job:

scancel JOB_ID

   ### Interactive Jobs

   Request an interactive session:

srun –pty bash -i

   ## Software Environment

   ### Available Software

   Use the module system to access software:

module avail # List available software
module load python-ml # Load a specific module
module list # List loaded modules
module unload python-ml # Unload a module

   ### Installing Custom Software

   For personal use, install in your home directory:

pip install –user package_name

   ## Getting Help

   For assistance, contact:
   - Email: admin@example.com
   - Submit a ticket: http://helpdesk.example.com

Software Documentation:

   nano /shared/docs/software_modules.md

Template:

   # Available Software Modules

   Last updated: YYYY-MM-DD

   ## How to Use Modules

   The cluster uses Environment Modules to manage software. Basic commands:

module avail # List available modules
module load # Load a module
module list # List loaded modules
module unload # Unload a module
module purge # Unload all modules

   ## Available Modules

   | Module Name | Version | Description | Example Usage |
   |-------------|---------|-------------|--------------|
   | python-ml | 3.9.2 | Python with ML libraries | `module load python-ml` |
   | openmpi | 4.1.1 | MPI implementation | `module load openmpi` |

   ## Python ML Module

   The python-ml module provides Python with common machine learning libraries:

   - numpy
   - scipy
   - scikit-learn
   - pandas
   - matplotlib

   Example usage:

import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestClassifier

# Your code here

   ## OpenMPI Module

   The OpenMPI module provides MPI libraries for parallel programming.

   Example compilation:

mpicc -o mpi_program mpi_program.c

   Example execution:

srun -n 4 ./mpi_program

Knowledge Management

Incident Response Documentation:

   nano /shared/docs/incident_response.md

Template:

   # Incident Response Procedures

   Last updated: YYYY-MM-DD

   ## Incident Categories

   1. **System Outage**
      - Head node failure
      - Compute node failure
      - Network failure
      - Storage failure

   2. **Performance Issues**
      - Job scheduling delays
      - Slow file access
      - Network latency

   3. **Security Incidents**
      - Unauthorized access
      - Unusual system behavior
      - Resource abuse

   ## Response Procedures

   ### Head Node Failure

   1. **Assessment**
      - Check physical status (power, connections)
      - Check remote access capability
      - Verify if services are running

   2. **Recovery Steps**
      - Restart the node if possible
      - Check system logs: `sudo journalctl -xb`
      - Verify Slurm controller status: `sudo systemctl status slurmctld`
      - Restart services as needed

   3. **Post-Incident**
      - Document the cause
      - Update monitoring thresholds if needed
      - Review backup procedures

   ### Storage Access Issues

   1. **Assessment**
      - Check NFS server status: `sudo systemctl status nfs-kernel-server`
      - Verify mount points on compute nodes: `df -h`
      - Check disk space and inode usage: `df -h` and `df -i`

   2. **Recovery Steps**
      - Restart NFS service if needed: `sudo systemctl restart nfs-kernel-server`
      - Remount filesystems on clients: `sudo mount -a`
      - Clear space if filesystems are full

   3. **Post-Incident**
      - Review storage allocation
      - Implement quota adjustments if needed
      - Add monitoring for space utilization

   ## Incident Reporting Template

Incident ID: INC-YYYYMMDD-XX
Date/Time: YYYY-MM-DD HH:MM
Type: [System Outage/Performance/Security]
Description: Brief description of the incident

Impact:
- Services affected
- Users affected
- Duration of impact

Root Cause:
- Detailed analysis of what happened

Resolution:
- Steps taken to resolve
- Time to resolution

Preventive Measures:
- Actions to prevent recurrence

Change Management Documentation:

   nano /shared/docs/change_management.md

Template:

   # Change Management Procedures

   Last updated: YYYY-MM-DD

   ## Change Request Process

   1. **Submit Change Request**
      - Complete the change request form
      - Include purpose, scope, impact, and rollback plan

   2. **Review and Approval**
      - Technical review
      - Impact assessment
      - Schedule determination

   3. **Implementation**
      - Notify affected users
      - Implement change
      - Verify success

   4. **Documentation**
      - Update system documentation
      - Record lessons learned

   ## Change Request Form

Change ID: CHG-YYYYMMDD-XX
Requestor: [Name]
Date Submitted: YYYY-MM-DD

Description:
[Detailed description of the change]

Purpose:
[Why this change is needed]

Scope:
[Systems/services affected]

Impact:
[Expected user impact, downtime]

Implementation Plan:
[Step-by-step implementation procedure]

Rollback Plan:
[How to revert if problems occur]

Testing Plan:
[How to verify success]

Schedule:
Proposed date/time: YYYY-MM-DD HH:MM
Estimated duration: XX hours/minutes

   ## Maintenance Window Policy

   Regular maintenance windows:
   - Every first Sunday of the month, 08:00-12:00
   - Users notified 5 days in advance
   - Emergency maintenance requires 24-hour notice when possible

Documentation Tools and Platforms

Implementing a Documentation System:

   nano /shared/docs/setup_mkdocs.sh

Script content:

   #!/bin/bash
   # Setup MkDocs for cluster documentation

   # Install MkDocs
   pip install mkdocs mkdocs-material

   # Create documentation project
   mkdir -p /shared/docs/mkdocs
   cd /shared/docs/mkdocs

   # Initialize MkDocs
   mkdocs new .

   # Configure MkDocs
   cat > mkdocs.yml << EOF
   site_name: Pi HPC Cluster Documentation
   theme:
     name: material
     palette:
       primary: blue
       accent: light blue
   nav:
     - Home: index.md
     - System:
       - Hardware: system/hardware.md
       - Network: system/network.md
       - Installation: system/installation.md
     - User Guide:
       - Getting Started: user/getting_started.md
       - Running Jobs: user/running_jobs.md
       - Software: user/software.md
     - Administration:
       - Monitoring: admin/monitoring.md
       - Incidents: admin/incidents.md
       - Changes: admin/changes.md
   EOF

   # Create directory structure
   mkdir -p docs/{system,user,admin}

   # Create index page
   cat > docs/index.md << EOF
   # Pi HPC Cluster Documentation

   Welcome to the documentation for our Raspberry Pi HPC Cluster.

   ## Overview

   This documentation covers:

   - System architecture and configuration
   - User guide for running jobs
   - Administration procedures

   ## Quick Links

   - [Hardware Information](system/hardware.md)
   - [Getting Started for Users](user/getting_started.md)
   - [Monitoring](admin/monitoring.md)
   EOF

   # Build the docs
   mkdocs build

   # Set up a simple web server
   sudo apt install -y apache2
   sudo ln -sf /shared/docs/mkdocs/site /var/www/html/docs

   echo "Documentation system set up at http://pi-head/docs"

Make executable:

   chmod +x /shared/docs/setup_mkdocs.sh

Creating Documentation Templates:

   mkdir -p /shared/docs/templates
   nano /shared/docs/templates/procedure_template.md

Template content:

   # [Procedure Name]

   Last updated: YYYY-MM-DD  
   Author: [Name]

   ## Purpose

   [Brief description of what this procedure accomplishes]

   ## Prerequisites

   - [Required access, tools, or information]
   - [Preconditions that must be met]

   ## Procedure Steps

   1. [First step]

  # Example command
  command argument

   2. [Second step]
      - [Sub-step A]
      - [Sub-step B]

   3. [Third step]

   ## Verification

   [How to verify successful completion]

   ## Troubleshooting

   | Issue | Possible Cause | Resolution |
   |-------|----------------|------------|
   | [Problem] | [Cause] | [Fix] |

   ## References

   - [Link to related documentation]
   - [External references]

Version Control for Documentation:

   nano /shared/docs/setup_git_docs.sh

Script content:

   #!/bin/bash
   # Set up Git repository for documentation

   # Install Git
   sudo apt install -y git

   # Initialize repository
   cd /shared/docs
   git init

   # Create .gitignore
   cat > .gitignore << EOF
   # Ignore MkDocs build directory
   mkdocs/site/

   # Ignore temporary files
   *~
   *.swp
   *.bak
   EOF

   # Initial commit
   git add .
   git config --local user.name "HPC Admin"
   git config --local user.email "admin@example.com"
   git commit -m "Initial documentation commit"

   # Create a documentation update script
   cat > update_docs.sh << 'EOF'
   #!/bin/bash
   # Script to update documentation

   cd /shared/docs

   # Add all changes
   git add .

   # Commit with timestamp and message
   git commit -m "Documentation update $(date '+%Y-%m-%d %H:%M:%S'): $1"

   # Rebuild MkDocs if it exists
   if [ -d "mkdocs" ]; then
     cd mkdocs
     mkdocs build
   fi

   echo "Documentation updated successfully"
   EOF

   chmod +x update_docs.sh

   echo "Git repository for documentation initialized"

Documentation Best Practices

Table of Contents

Documentation Best Practices

System Documentation

User Documentation

Knowledge Management

Documentation Tools and Platforms