HPC Administration Tutorial - Part 2
Created: 2025-03-16 17:02:17 | Last updated: 2025-03-16 17:02:17 | Status: Public
Account Management, Resource Scheduling, and Storage Solutions
This second part of our HPC administration tutorial covers crucial aspects of managing users, resources, and data in your Raspberry Pi HPC cluster. These skills directly translate to enterprise HPC environments.
Table of Contents
Account Management and Security
User Account Management
Managing user accounts is a core responsibility for HPC administrators:
- Creating User Accounts:
# Create a new user on the head node
sudo adduser researcher1
# Add to relevant groups
sudo usermod -aG users researcher1
- Batch User Creation:
For managing multiple accounts, create a script:
nano /shared/admin/add_users.sh
Script content:
#!/bin/bash
# Script to add multiple users to the HPC cluster
USERLIST=$1
if [ ! -f "$USERLIST" ]; then
echo "Usage: $0 userlist.txt"
exit 1
fi
while read username fullname; do
echo "Creating user: $username ($fullname)"
sudo adduser --gecos "$fullname" --disabled-password "$username"
echo "$username:TemporaryPass123" | sudo chpasswd
sudo usermod -aG users "$username"
sudo mkdir -p /home/$username
sudo chown $username:$username /home/$username
done < "$USERLIST"
echo "User creation complete."
Make it executable:
chmod +x /shared/admin/add_users.sh
Create a user list:
researcher1 "Jane Researcher"
researcher2 "John Scientist"
- User Quota Management:
# Install quota tools
sudo apt install -y quota quotatool
# Enable quotas on filesystem
sudo nano /etc/fstab
Update the home mount with quota options:
/dev/sda1 /home ext4 defaults,usrquota,grpquota 0 1
Set quotas:
sudo setquota -u researcher1 5G 6G 0 0 /home
Security Implementation
- SSH Hardening:
sudo nano /etc/ssh/sshd_config
Add/modify these lines:
PermitRootLogin no
PasswordAuthentication no
AllowGroups users wheel
Restart SSH:
sudo systemctl restart sshd
- Firewall Configuration:
# Install and enable firewall
sudo apt install -y ufw
# Configure rules
sudo ufw default deny incoming
sudo ufw default allow outgoing
sudo ufw allow from 192.168.1.0/24 to any port 22
sudo ufw allow from 192.168.1.0/24 to any port 6817:6819
# Enable firewall
sudo ufw enable
- Accounting and Auditing:
sudo apt install -y auditd
sudo systemctl enable auditd
sudo systemctl start auditd
# Configure audit rules
sudo nano /etc/audit/rules.d/hpc.rules
Add rules:
# Monitor system administration command usage
-a exit,always -F arch=b64 -F euid=0 -S execve -k rootcmd
# Log all sudo commands
-a exit,always -F arch=b64 -F path=/usr/bin/sudo -k sudo_log
Load new rules:
sudo service auditd restart
- Regular Updates:
Create an update script:
nano /shared/admin/update_system.sh
Script content:
#!/bin/bash
# System update script
LOG="/var/log/system_updates.log"
echo "$(date): Beginning system update" >> $LOG
apt update >> $LOG 2>&1
apt -y upgrade >> $LOG 2>&1
echo "$(date): Update complete" >> $LOG
Set up a cron job:
sudo crontab -e
Add weekly updates:
0 2 * * 0 /shared/admin/update_system.sh
Resource Management and Scheduling
Slurm Configuration and Policy Management
- Advanced Slurm Configuration:
sudo nano /etc/slurm/slurm.conf
Enhance with QoS and account configurations:
# QoS settings
PriorityType=priority/multifactor
PriorityWeightQOS=10000
# Job priorities
PriorityWeightAge=1000
PriorityWeightFairshare=5000
PriorityWeightJobSize=1000
PriorityWeightPartition=1000
# Accounting
AccountingStorageType=accounting_storage/slurmdbd
AccountingStorageHost=localhost
AccountingStoragePort=6819
- Implementing Fair Share:
# First, create accounts
sacctmgr add account biology
sacctmgr add account chemistry
# Associate users with accounts
sacctmgr add user researcher1 account=biology
sacctmgr add user researcher2 account=chemistry
# Set fairshare
sacctmgr modify account biology set fairshare=10
sacctmgr modify account chemistry set fairshare=20
- Quality of Service (QoS) Setup:
# Create different QoS levels
sacctmgr add qos high priority=1000
sacctmgr add qos normal priority=100
sacctmgr add qos low priority=10
# Assign QoS to accounts
sacctmgr modify account biology set qos=normal,high
sacctmgr modify account chemistry set qos=normal
- Creating Job Submission Templates:
mkdir -p /shared/templates
CPU job template:
cat > /shared/templates/cpu_job.sh << 'EOF'
#!/bin/bash
#SBATCH --job-name=cpu_job
#SBATCH --output=cpu_job_%j.out
#SBATCH --error=cpu_job_%j.err
#SBATCH --ntasks=4
#SBATCH --time=01:00:00
#SBATCH --qos=normal
# Add your commands below
echo "Running on $(hostname)"
sleep 60
EOF
Practical Job Management
- Job Submission and Monitoring:
# Submit a job
sbatch job_script.sh
# Check job status
squeue
# View detailed job info
scontrol show job JOB_ID
# Cancel a job
scancel JOB_ID
- Node Management:
# View node status
sinfo -N
# Detailed node information
scontrol show node pi-compute-01
# Drain a node (prevent new jobs)
scontrol update nodename=pi-compute-01 state=drain reason="maintenance"
# Resume a node
scontrol update nodename=pi-compute-01 state=resume
- Reservation Management:
# Create a reservation
scontrol create reservation name=maintenance start=2023-12-01T08:00:00 end=2023-12-01T12:00:00 nodes=pi-compute-[01-02]
# View reservations
scontrol show res
# Delete a reservation
scontrol delete reservationname=maintenance
- Interactive Jobs:
# Request an interactive session
srun --pty bash -i
# Interactive session with specific resources
srun --nodes=1 --ntasks-per-node=2 --time=01:00:00 --pty bash -i
Resource Limits and Policies
- Memory Limits:
# Update slurm.conf to enable memory enforcement
sudo nano /etc/slurm/slurm.conf
Add/modify:
SelectTypeParameters=CR_CPU_Memory
Submit a memory-constrained job:
sbatch --mem=2G memory_job.sh
- Time Limits:
# Set default time limit in slurm.conf
DefaultTime=01:00:00
MaxTime=24:00:00
# Per-partition limits
PartitionName=debug Nodes=pi-compute-[01-02] Default=NO MaxTime=01:00:00 State=UP
PartitionName=main Nodes=pi-compute-[01-02] Default=YES MaxTime=24:00:00 State=UP
- Resource Protection:
# Limit concurrent jobs per user
sacctmgr modify qos normal set MaxJobsPerUser=5
# Limit total CPU count per user
sacctmgr modify qos normal set MaxTRESPerUser=cpu=8
Storage Solutions
NFS Advanced Configuration
- Performance Tuning:
sudo nano /etc/exports
Optimize exports:
/shared 192.168.1.0/24(rw,sync,no_subtree_check,no_root_squash,async)
On clients:
sudo nano /etc/fstab
Optimize mounts:
pi-head:/shared /shared nfs rsize=1048576,wsize=1048576,noatime,nodiratime 0 0
- Access Control:
# Create project directories
sudo mkdir -p /shared/projects/{project1,project2}
# Set appropriate permissions
sudo chown :biology /shared/projects/project1
sudo chown :chemistry /shared/projects/project2
sudo chmod 2770 /shared/projects/project1
sudo chmod 2770 /shared/projects/project2
Data Management
- Scratch Space Management:
# Create dedicated scratch space
sudo mkdir -p /scratch
sudo chmod 1777 /scratch
# Add cleanup script
cat > /shared/admin/clean_scratch.sh << 'EOF'
#!/bin/bash
# Find and remove files older than 7 days
find /scratch -type f -atime +7 -delete
# Clean empty directories
find /scratch -type d -empty -delete
EOF
# Set up cron job
sudo crontab -e
Add:
0 2 * * * /shared/admin/clean_scratch.sh
- Backup Strategy:
# Install rsync
sudo apt install -y rsync
# Create backup script
cat > /shared/admin/backup_home.sh << 'EOF'
#!/bin/bash
DATE=$(date +%Y%m%d)
BACKUP_DIR="/shared/backups"
SOURCE_DIR="/home"
# Create backup directory
mkdir -p $BACKUP_DIR
# Run incremental backup
rsync -avz --link-dest=$BACKUP_DIR/latest $SOURCE_DIR $BACKUP_DIR/$DATE
# Update latest link
rm -f $BACKUP_DIR/latest
ln -s $BACKUP_DIR/$DATE $BACKUP_DIR/latest
EOF
# Schedule daily backups
sudo crontab -e
Add:
0 1 * * * /shared/admin/backup_home.sh
- Data Transfer Tools:
# Install globus
pip install globus-cli
# Configure a data transfer node
mkdir -p /shared/data_transfer
# Create example transfer script
cat > /shared/admin/transfer_example.sh << 'EOF'
#!/bin/bash
SOURCE_DIR="/shared/projects/project1"
DEST_DIR="/shared/data_transfer/outgoing"
# Create archive
TAR_FILE="project1_$(date +%Y%m%d).tar.gz"
tar -czf $DEST_DIR/$TAR_FILE -C $SOURCE_DIR .
# Example rsync to external system
# rsync -avz $DEST_DIR/$TAR_FILE user@remote:/path/to/destination/
EOF
Storage Monitoring
- Quota Reporting:
# Create quota report script
cat > /shared/admin/quota_report.sh << 'EOF'
#!/bin/bash
REPORT_FILE="/shared/admin/reports/quota_$(date +%Y%m%d).txt"
mkdir -p $(dirname $REPORT_FILE)
echo "Storage Quota Report - $(date)" > $REPORT_FILE
echo "=============================" >> $REPORT_FILE
echo -e "\nUser Quotas:" >> $REPORT_FILE
repquota -a >> $REPORT_FILE
echo -e "\nDirectory Sizes:" >> $REPORT_FILE
du -sh /home/* 2>/dev/null >> $REPORT_FILE
du -sh /shared/projects/* 2>/dev/null >> $REPORT_FILE
echo "Report generated at $REPORT_FILE"
EOF
chmod +x /shared/admin/quota_report.sh
- Storage Health Monitoring:
# Install monitoring tools
sudo apt install -y smartmontools
# Create storage health check script
cat > /shared/admin/check_storage.sh << 'EOF'
#!/bin/bash
LOG_FILE="/shared/admin/logs/storage_health.log"
mkdir -p $(dirname $LOG_FILE)
echo "Storage Health Check - $(date)" >> $LOG_FILE
# Check disk space
df -h >> $LOG_FILE
# Check disk health if physical disks are present
for disk in /dev/sd?; do
if [ -e "$disk" ]; then
echo -e "\nSMART status for $disk:" >> $LOG_FILE
smartctl -H $disk >> $LOG_FILE 2>&1
fi
done
EOF
chmod +x /shared/admin/check_storage.sh
# Schedule daily checks
sudo crontab -e
Add:
0 3 * * * /shared/admin/check_storage.sh
Next Steps
In the next tutorial, we’ll cover:
- Monitoring and troubleshooting
- Performance tuning
- Documentation best practices
These skills will further enhance your ability to maintain a reliable and efficient HPC environment.