HPC Administration Tutorial - Part 4
Created: 2025-03-16 17:03:13 | Last updated: 2025-03-16 17:03:13 | Status: Public
Documentation Best Practices
This fourth part of our HPC administration tutorial focuses on effective documentation practices. Good documentation is crucial for maintaining institutional knowledge, enabling smooth operations, and ensuring user productivity in an HPC environment.
Table of Contents
- Documentation Best Practices
- System Documentation
- User Documentation
- Knowledge Management
- Documentation Tools and Platforms
Documentation Best Practices
System Documentation
- Hardware Inventory:
Create a comprehensive hardware inventory document:
nano /shared/docs/hardware_inventory.md
Template:
# HPC Cluster Hardware Inventory
Last updated: YYYY-MM-DD
## Head Node (pi-head)
- **Model**: Raspberry Pi 5
- **CPU**: Broadcom BCM2712 (4 cores)
- **RAM**: 8GB LPDDR5
- **Storage**: 64GB microSD + 500GB USB SSD
- **Network**: Gigabit Ethernet
- **MAC Address**: XX:XX:XX:XX:XX:XX
- **Serial Number**: XXXXXXXX
- **Purchase Date**: YYYY-MM-DD
- **Warranty Expiration**: YYYY-MM-DD
## Compute Node 1 (pi-compute-01)
- **Model**: Raspberry Pi 5
- **CPU**: Broadcom BCM2712 (4 cores)
- **RAM**: 8GB LPDDR5
- **Storage**: 32GB microSD
- **Network**: Gigabit Ethernet
- **MAC Address**: XX:XX:XX:XX:XX:XX
- **Serial Number**: XXXXXXXX
- **Purchase Date**: YYYY-MM-DD
- **Warranty Expiration**: YYYY-MM-DD
## Compute Node 2 (pi-compute-02)
- **Model**: Raspberry Pi 5
- **CPU**: Broadcom BCM2712 (4 cores)
- **RAM**: 8GB LPDDR5
- **Storage**: 32GB microSD
- **Network**: Gigabit Ethernet
- **MAC Address**: XX:XX:XX:XX:XX:XX
- **Serial Number**: XXXXXXXX
- **Purchase Date**: YYYY-MM-DD
- **Warranty Expiration**: YYYY-MM-DD
## Network Equipment
- **Switch Model**: XXXXX
- **Ports**: 8-port Gigabit
- **Location**: XXXXX
- **IP Address**: 192.168.1.XXX
- **Purchase Date**: YYYY-MM-DD
## Storage Systems
- **Device**: 500GB USB SSD
- **Connected to**: Head Node
- **Filesystem**: ext4
- **Mount Point**: /shared
- **Purpose**: Shared storage for home directories and applications
- Network Documentation:
nano /shared/docs/network_architecture.md
Template:
# HPC Cluster Network Architecture
Last updated: YYYY-MM-DD
## Network Topology
graph TD
Switch[Gigabit Switch] — HeadNode[Head Node
pi-head
192.168.1.100]
Switch — Compute1[Compute Node 1
pi-compute-01
192.168.1.101]
Switch — Compute2[Compute Node 2
pi-compute-02
192.168.1.102]
## IP Address Allocation
| Hostname | IP Address | MAC Address | Purpose |
|----------|------------|-------------|---------|
| pi-head | 192.168.1.100 | XX:XX:XX:XX:XX:XX | Head/management node |
| pi-compute-01 | 192.168.1.101 | XX:XX:XX:XX:XX:XX | Compute node |
| pi-compute-02 | 192.168.1.102 | XX:XX:XX:XX:XX:XX | Compute node |
## Network Services
| Service | Port | Nodes | Description |
|---------|------|-------|-------------|
| SSH | 22 | All | Secure Shell |
| NFS | 2049 | pi-head (server) | Network File System |
| Slurm | 6817-6819 | All | Slurm communication |
| Prometheus | 9090 | pi-head | Monitoring system |
| Node Exporter | 9100 | All | System metrics |
| Grafana | 3000 | pi-head | Monitoring dashboard |
## Firewall Rules
| Source | Destination | Port | Protocol | Action | Purpose |
|--------|-------------|------|----------|--------|---------|
| 192.168.1.0/24 | Any | 22 | TCP | ALLOW | SSH access |
| 192.168.1.0/24 | Any | 6817-6819 | TCP | ALLOW | Slurm |
| 192.168.1.0/24 | pi-head | 2049 | TCP/UDP | ALLOW | NFS |
| Any | Any | Any | Any | DENY | Default rule |
- Installation and Configuration Guide:
nano /shared/docs/installation_guide.md
Template:
# HPC Cluster Installation and Configuration Guide
Last updated: YYYY-MM-DD
## 1. Base Installation
### Operating System
- OS: Debian Bookworm
- Installation method: Raspberry Pi Imager
- Initial configuration:
- Enabled SSH
- Set hostname
- Updated system packages
### Network Configuration
Static IP configuration in `/etc/network/interfaces.d/eth0`:
auto eth0
iface eth0 inet static
address 192.168.1.XXX
netmask 255.255.255.0
gateway 192.168.1.1
dns-nameservers 8.8.8.8 8.8.4.4
## 2. Slurm Installation
### Package Installation
sudo apt install -y slurmd slurm-client slurm-wlm munge libmunge-dev
### Configuration Files
Location: `/etc/slurm/slurm.conf`
Key parameters:
- ClusterName: pi-cluster
- SlurmctldHost: pi-head
- NodeName definition with CPUs and Memory
- Partition configuration
## 3. Storage Configuration
### NFS Server Setup (Head Node)
Installed packages:
sudo apt install -y nfs-kernel-server
Exports defined in `/etc/exports`:
/shared 192.168.1.0/24(rw,sync,no_subtree_check)
/home 192.168.1.0/24(rw,sync,no_subtree_check)
### NFS Client Setup (Compute Nodes)
Installed packages:
sudo apt install -y nfs-common
Mount entries in `/etc/fstab`:
pi-head:/shared /shared nfs rsize=1048576,wsize=1048576,noatime,nodiratime 0 0
pi-head:/home /home nfs rsize=1048576,wsize=1048576,noatime,nodiratime 0 0
User Documentation
- User Guide:
nano /shared/docs/user_guide.md
Template:
# HPC Cluster User Guide
Last updated: YYYY-MM-DD
## Getting Started
### Account Setup
To request an account, please contact the system administrator.
### Connecting to the Cluster
Connect using SSH:
ssh username@pi-head
### Storage Locations
- `/home/username`: Personal home directory
- `/shared/apps`: Software applications
- `/shared/data`: Shared datasets
- `/scratch`: Temporary high-speed storage (files deleted after 7 days)
## Running Jobs
### Submitting Batch Jobs
Create a job script:
#!/bin/bash
#SBATCH –job-name=myJob
#SBATCH –output=myJob_%j.out
#SBATCH –error=myJob_%j.err
#SBATCH –ntasks=4
#SBATCH –time=01:00:00
# Your commands here
echo “Running on $(hostname)”
sleep 60
Submit the job:
sbatch myjob.sh
### Monitoring Jobs
Check job status:
squeue -u username
View detailed job information:
scontrol show job JOB_ID
Cancel a job:
scancel JOB_ID
### Interactive Jobs
Request an interactive session:
srun –pty bash -i
## Software Environment
### Available Software
Use the module system to access software:
module avail # List available software
module load python-ml # Load a specific module
module list # List loaded modules
module unload python-ml # Unload a module
### Installing Custom Software
For personal use, install in your home directory:
pip install –user package_name
## Getting Help
For assistance, contact:
- Email: admin@example.com
- Submit a ticket: http://helpdesk.example.com
- Software Documentation:
nano /shared/docs/software_modules.md
Template:
# Available Software Modules
Last updated: YYYY-MM-DD
## How to Use Modules
The cluster uses Environment Modules to manage software. Basic commands:
module avail # List available modules
module load
module list # List loaded modules
module unload
module purge # Unload all modules
## Available Modules
| Module Name | Version | Description | Example Usage |
|-------------|---------|-------------|--------------|
| python-ml | 3.9.2 | Python with ML libraries | `module load python-ml` |
| openmpi | 4.1.1 | MPI implementation | `module load openmpi` |
## Python ML Module
The python-ml module provides Python with common machine learning libraries:
- numpy
- scipy
- scikit-learn
- pandas
- matplotlib
Example usage:
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
# Your code here
## OpenMPI Module
The OpenMPI module provides MPI libraries for parallel programming.
Example compilation:
mpicc -o mpi_program mpi_program.c
Example execution:
srun -n 4 ./mpi_program
Knowledge Management
- Incident Response Documentation:
nano /shared/docs/incident_response.md
Template:
# Incident Response Procedures
Last updated: YYYY-MM-DD
## Incident Categories
1. **System Outage**
- Head node failure
- Compute node failure
- Network failure
- Storage failure
2. **Performance Issues**
- Job scheduling delays
- Slow file access
- Network latency
3. **Security Incidents**
- Unauthorized access
- Unusual system behavior
- Resource abuse
## Response Procedures
### Head Node Failure
1. **Assessment**
- Check physical status (power, connections)
- Check remote access capability
- Verify if services are running
2. **Recovery Steps**
- Restart the node if possible
- Check system logs: `sudo journalctl -xb`
- Verify Slurm controller status: `sudo systemctl status slurmctld`
- Restart services as needed
3. **Post-Incident**
- Document the cause
- Update monitoring thresholds if needed
- Review backup procedures
### Storage Access Issues
1. **Assessment**
- Check NFS server status: `sudo systemctl status nfs-kernel-server`
- Verify mount points on compute nodes: `df -h`
- Check disk space and inode usage: `df -h` and `df -i`
2. **Recovery Steps**
- Restart NFS service if needed: `sudo systemctl restart nfs-kernel-server`
- Remount filesystems on clients: `sudo mount -a`
- Clear space if filesystems are full
3. **Post-Incident**
- Review storage allocation
- Implement quota adjustments if needed
- Add monitoring for space utilization
## Incident Reporting Template
Incident ID: INC-YYYYMMDD-XX
Date/Time: YYYY-MM-DD HH:MM
Type: [System Outage/Performance/Security]
Description: Brief description of the incident
Impact:
- Services affected
- Users affected
- Duration of impact
Root Cause:
- Detailed analysis of what happened
Resolution:
- Steps taken to resolve
- Time to resolution
Preventive Measures:
- Actions to prevent recurrence
- Change Management Documentation:
nano /shared/docs/change_management.md
Template:
# Change Management Procedures
Last updated: YYYY-MM-DD
## Change Request Process
1. **Submit Change Request**
- Complete the change request form
- Include purpose, scope, impact, and rollback plan
2. **Review and Approval**
- Technical review
- Impact assessment
- Schedule determination
3. **Implementation**
- Notify affected users
- Implement change
- Verify success
4. **Documentation**
- Update system documentation
- Record lessons learned
## Change Request Form
Change ID: CHG-YYYYMMDD-XX
Requestor: [Name]
Date Submitted: YYYY-MM-DD
Description:
[Detailed description of the change]
Purpose:
[Why this change is needed]
Scope:
[Systems/services affected]
Impact:
[Expected user impact, downtime]
Implementation Plan:
[Step-by-step implementation procedure]
Rollback Plan:
[How to revert if problems occur]
Testing Plan:
[How to verify success]
Schedule:
Proposed date/time: YYYY-MM-DD HH:MM
Estimated duration: XX hours/minutes
## Maintenance Window Policy
Regular maintenance windows:
- Every first Sunday of the month, 08:00-12:00
- Users notified 5 days in advance
- Emergency maintenance requires 24-hour notice when possible
Documentation Tools and Platforms
- Implementing a Documentation System:
nano /shared/docs/setup_mkdocs.sh
Script content:
#!/bin/bash
# Setup MkDocs for cluster documentation
# Install MkDocs
pip install mkdocs mkdocs-material
# Create documentation project
mkdir -p /shared/docs/mkdocs
cd /shared/docs/mkdocs
# Initialize MkDocs
mkdocs new .
# Configure MkDocs
cat > mkdocs.yml << EOF
site_name: Pi HPC Cluster Documentation
theme:
name: material
palette:
primary: blue
accent: light blue
nav:
- Home: index.md
- System:
- Hardware: system/hardware.md
- Network: system/network.md
- Installation: system/installation.md
- User Guide:
- Getting Started: user/getting_started.md
- Running Jobs: user/running_jobs.md
- Software: user/software.md
- Administration:
- Monitoring: admin/monitoring.md
- Incidents: admin/incidents.md
- Changes: admin/changes.md
EOF
# Create directory structure
mkdir -p docs/{system,user,admin}
# Create index page
cat > docs/index.md << EOF
# Pi HPC Cluster Documentation
Welcome to the documentation for our Raspberry Pi HPC Cluster.
## Overview
This documentation covers:
- System architecture and configuration
- User guide for running jobs
- Administration procedures
## Quick Links
- [Hardware Information](system/hardware.md)
- [Getting Started for Users](user/getting_started.md)
- [Monitoring](admin/monitoring.md)
EOF
# Build the docs
mkdocs build
# Set up a simple web server
sudo apt install -y apache2
sudo ln -sf /shared/docs/mkdocs/site /var/www/html/docs
echo "Documentation system set up at http://pi-head/docs"
Make executable:
chmod +x /shared/docs/setup_mkdocs.sh
- Creating Documentation Templates:
mkdir -p /shared/docs/templates
nano /shared/docs/templates/procedure_template.md
Template content:
# [Procedure Name]
Last updated: YYYY-MM-DD
Author: [Name]
## Purpose
[Brief description of what this procedure accomplishes]
## Prerequisites
- [Required access, tools, or information]
- [Preconditions that must be met]
## Procedure Steps
1. [First step]
# Example command
command argument
2. [Second step]
- [Sub-step A]
- [Sub-step B]
3. [Third step]
## Verification
[How to verify successful completion]
## Troubleshooting
| Issue | Possible Cause | Resolution |
|-------|----------------|------------|
| [Problem] | [Cause] | [Fix] |
## References
- [Link to related documentation]
- [External references]
- Version Control for Documentation:
nano /shared/docs/setup_git_docs.sh
Script content:
#!/bin/bash
# Set up Git repository for documentation
# Install Git
sudo apt install -y git
# Initialize repository
cd /shared/docs
git init
# Create .gitignore
cat > .gitignore << EOF
# Ignore MkDocs build directory
mkdocs/site/
# Ignore temporary files
*~
*.swp
*.bak
EOF
# Initial commit
git add .
git config --local user.name "HPC Admin"
git config --local user.email "admin@example.com"
git commit -m "Initial documentation commit"
# Create a documentation update script
cat > update_docs.sh << 'EOF'
#!/bin/bash
# Script to update documentation
cd /shared/docs
# Add all changes
git add .
# Commit with timestamp and message
git commit -m "Documentation update $(date '+%Y-%m-%d %H:%M:%S'): $1"
# Rebuild MkDocs if it exists
if [ -d "mkdocs" ]; then
cd mkdocs
mkdocs build
fi
echo "Documentation updated successfully"
EOF
chmod +x update_docs.sh
echo "Git repository for documentation initialized"