HPC Administration Tutorial - Part 2

Created: 2025-03-16 17:02:17 | Last updated: 2025-03-16 17:02:17 | Status: Public

Account Management, Resource Scheduling, and Storage Solutions

This second part of our HPC administration tutorial covers crucial aspects of managing users, resources, and data in your Raspberry Pi HPC cluster. These skills directly translate to enterprise HPC environments.

Account Management and Security
Resource Management and Scheduling
Storage Solutions

Account Management and Security

User Account Management

Managing user accounts is a core responsibility for HPC administrators:

Creating User Accounts:

   # Create a new user on the head node
   sudo adduser researcher1

   # Add to relevant groups
   sudo usermod -aG users researcher1

Batch User Creation:
For managing multiple accounts, create a script:

   nano /shared/admin/add_users.sh

Script content:

   #!/bin/bash
   # Script to add multiple users to the HPC cluster

   USERLIST=$1

   if [ ! -f "$USERLIST" ]; then
       echo "Usage: $0 userlist.txt"
       exit 1
   fi

   while read username fullname; do
       echo "Creating user: $username ($fullname)"
       sudo adduser --gecos "$fullname" --disabled-password "$username"
       echo "$username:TemporaryPass123" | sudo chpasswd
       sudo usermod -aG users "$username"
       sudo mkdir -p /home/$username
       sudo chown $username:$username /home/$username
   done < "$USERLIST"

   echo "User creation complete."

Make it executable:

   chmod +x /shared/admin/add_users.sh

Create a user list:

   researcher1 "Jane Researcher"
   researcher2 "John Scientist"

User Quota Management:

   # Install quota tools
   sudo apt install -y quota quotatool

   # Enable quotas on filesystem
   sudo nano /etc/fstab

Update the home mount with quota options:

   /dev/sda1 /home ext4 defaults,usrquota,grpquota 0 1

Set quotas:

   sudo setquota -u researcher1 5G 6G 0 0 /home

Security Implementation

SSH Hardening:

   sudo nano /etc/ssh/sshd_config

Add/modify these lines:

   PermitRootLogin no
   PasswordAuthentication no
   AllowGroups users wheel

Restart SSH:

   sudo systemctl restart sshd

Firewall Configuration:

   # Install and enable firewall
   sudo apt install -y ufw

   # Configure rules
   sudo ufw default deny incoming
   sudo ufw default allow outgoing
   sudo ufw allow from 192.168.1.0/24 to any port 22
   sudo ufw allow from 192.168.1.0/24 to any port 6817:6819

   # Enable firewall
   sudo ufw enable

Accounting and Auditing:

   sudo apt install -y auditd
   sudo systemctl enable auditd
   sudo systemctl start auditd

   # Configure audit rules
   sudo nano /etc/audit/rules.d/hpc.rules

Add rules:

   # Monitor system administration command usage
   -a exit,always -F arch=b64 -F euid=0 -S execve -k rootcmd

   # Log all sudo commands
   -a exit,always -F arch=b64 -F path=/usr/bin/sudo -k sudo_log

Load new rules:

   sudo service auditd restart

Regular Updates:
Create an update script:

   nano /shared/admin/update_system.sh

Script content:

   #!/bin/bash
   # System update script

   LOG="/var/log/system_updates.log"

   echo "$(date): Beginning system update" >> $LOG
   apt update >> $LOG 2>&1
   apt -y upgrade >> $LOG 2>&1
   echo "$(date): Update complete" >> $LOG

Set up a cron job:

   sudo crontab -e

Add weekly updates:

   0 2 * * 0 /shared/admin/update_system.sh

Resource Management and Scheduling

Slurm Configuration and Policy Management

Advanced Slurm Configuration:

   sudo nano /etc/slurm/slurm.conf

Enhance with QoS and account configurations:

   # QoS settings
   PriorityType=priority/multifactor
   PriorityWeightQOS=10000

   # Job priorities
   PriorityWeightAge=1000
   PriorityWeightFairshare=5000
   PriorityWeightJobSize=1000
   PriorityWeightPartition=1000

   # Accounting
   AccountingStorageType=accounting_storage/slurmdbd
   AccountingStorageHost=localhost
   AccountingStoragePort=6819

Implementing Fair Share:

   # First, create accounts
   sacctmgr add account biology
   sacctmgr add account chemistry

   # Associate users with accounts
   sacctmgr add user researcher1 account=biology
   sacctmgr add user researcher2 account=chemistry

   # Set fairshare
   sacctmgr modify account biology set fairshare=10
   sacctmgr modify account chemistry set fairshare=20

Quality of Service (QoS) Setup:

   # Create different QoS levels
   sacctmgr add qos high priority=1000
   sacctmgr add qos normal priority=100
   sacctmgr add qos low priority=10

   # Assign QoS to accounts
   sacctmgr modify account biology set qos=normal,high
   sacctmgr modify account chemistry set qos=normal

Creating Job Submission Templates:

   mkdir -p /shared/templates

CPU job template:

   cat > /shared/templates/cpu_job.sh << 'EOF'
   #!/bin/bash
   #SBATCH --job-name=cpu_job
   #SBATCH --output=cpu_job_%j.out
   #SBATCH --error=cpu_job_%j.err
   #SBATCH --ntasks=4
   #SBATCH --time=01:00:00
   #SBATCH --qos=normal

   # Add your commands below
   echo "Running on $(hostname)"
   sleep 60

   EOF

Practical Job Management

Job Submission and Monitoring:

   # Submit a job
   sbatch job_script.sh

   # Check job status
   squeue

   # View detailed job info
   scontrol show job JOB_ID

   # Cancel a job
   scancel JOB_ID

Node Management:

   # View node status
   sinfo -N

   # Detailed node information
   scontrol show node pi-compute-01

   # Drain a node (prevent new jobs)
   scontrol update nodename=pi-compute-01 state=drain reason="maintenance"

   # Resume a node
   scontrol update nodename=pi-compute-01 state=resume

Reservation Management:

   # Create a reservation
   scontrol create reservation name=maintenance start=2023-12-01T08:00:00 end=2023-12-01T12:00:00 nodes=pi-compute-[01-02]

   # View reservations
   scontrol show res

   # Delete a reservation
   scontrol delete reservationname=maintenance

Interactive Jobs:

   # Request an interactive session
   srun --pty bash -i

   # Interactive session with specific resources
   srun --nodes=1 --ntasks-per-node=2 --time=01:00:00 --pty bash -i

Resource Limits and Policies

Memory Limits:

   # Update slurm.conf to enable memory enforcement
   sudo nano /etc/slurm/slurm.conf

Add/modify:

   SelectTypeParameters=CR_CPU_Memory

Submit a memory-constrained job:

   sbatch --mem=2G memory_job.sh

Time Limits:

   # Set default time limit in slurm.conf
   DefaultTime=01:00:00
   MaxTime=24:00:00

   # Per-partition limits
   PartitionName=debug Nodes=pi-compute-[01-02] Default=NO MaxTime=01:00:00 State=UP
   PartitionName=main Nodes=pi-compute-[01-02] Default=YES MaxTime=24:00:00 State=UP

Resource Protection:

   # Limit concurrent jobs per user
   sacctmgr modify qos normal set MaxJobsPerUser=5

   # Limit total CPU count per user
   sacctmgr modify qos normal set MaxTRESPerUser=cpu=8

Storage Solutions

NFS Advanced Configuration

Performance Tuning:

   sudo nano /etc/exports

Optimize exports:

   /shared 192.168.1.0/24(rw,sync,no_subtree_check,no_root_squash,async)

On clients:

   sudo nano /etc/fstab

Optimize mounts:

   pi-head:/shared /shared nfs rsize=1048576,wsize=1048576,noatime,nodiratime 0 0

Access Control:

   # Create project directories
   sudo mkdir -p /shared/projects/{project1,project2}

   # Set appropriate permissions
   sudo chown :biology /shared/projects/project1
   sudo chown :chemistry /shared/projects/project2

   sudo chmod 2770 /shared/projects/project1
   sudo chmod 2770 /shared/projects/project2

Data Management

Scratch Space Management:

   # Create dedicated scratch space
   sudo mkdir -p /scratch
   sudo chmod 1777 /scratch

   # Add cleanup script
   cat > /shared/admin/clean_scratch.sh << 'EOF'
   #!/bin/bash

   # Find and remove files older than 7 days
   find /scratch -type f -atime +7 -delete

   # Clean empty directories
   find /scratch -type d -empty -delete
   EOF

   # Set up cron job
   sudo crontab -e

Add:

   0 2 * * * /shared/admin/clean_scratch.sh

Backup Strategy:

   # Install rsync
   sudo apt install -y rsync

   # Create backup script
   cat > /shared/admin/backup_home.sh << 'EOF'
   #!/bin/bash

   DATE=$(date +%Y%m%d)
   BACKUP_DIR="/shared/backups"
   SOURCE_DIR="/home"

   # Create backup directory
   mkdir -p $BACKUP_DIR

   # Run incremental backup
   rsync -avz --link-dest=$BACKUP_DIR/latest $SOURCE_DIR $BACKUP_DIR/$DATE

   # Update latest link
   rm -f $BACKUP_DIR/latest
   ln -s $BACKUP_DIR/$DATE $BACKUP_DIR/latest
   EOF

   # Schedule daily backups
   sudo crontab -e

Add:

   0 1 * * * /shared/admin/backup_home.sh

Data Transfer Tools:

   # Install globus
   pip install globus-cli

   # Configure a data transfer node
   mkdir -p /shared/data_transfer

   # Create example transfer script
   cat > /shared/admin/transfer_example.sh << 'EOF'
   #!/bin/bash

   SOURCE_DIR="/shared/projects/project1"
   DEST_DIR="/shared/data_transfer/outgoing"

   # Create archive
   TAR_FILE="project1_$(date +%Y%m%d).tar.gz"
   tar -czf $DEST_DIR/$TAR_FILE -C $SOURCE_DIR .

   # Example rsync to external system
   # rsync -avz $DEST_DIR/$TAR_FILE user@remote:/path/to/destination/
   EOF

Storage Monitoring

Quota Reporting:

   # Create quota report script
   cat > /shared/admin/quota_report.sh << 'EOF'
   #!/bin/bash

   REPORT_FILE="/shared/admin/reports/quota_$(date +%Y%m%d).txt"
   mkdir -p $(dirname $REPORT_FILE)

   echo "Storage Quota Report - $(date)" > $REPORT_FILE
   echo "=============================" >> $REPORT_FILE

   echo -e "\nUser Quotas:" >> $REPORT_FILE
   repquota -a >> $REPORT_FILE

   echo -e "\nDirectory Sizes:" >> $REPORT_FILE
   du -sh /home/* 2>/dev/null >> $REPORT_FILE
   du -sh /shared/projects/* 2>/dev/null >> $REPORT_FILE

   echo "Report generated at $REPORT_FILE"
   EOF

   chmod +x /shared/admin/quota_report.sh

Storage Health Monitoring:

   # Install monitoring tools
   sudo apt install -y smartmontools

   # Create storage health check script
   cat > /shared/admin/check_storage.sh << 'EOF'
   #!/bin/bash

   LOG_FILE="/shared/admin/logs/storage_health.log"
   mkdir -p $(dirname $LOG_FILE)

   echo "Storage Health Check - $(date)" >> $LOG_FILE

   # Check disk space
   df -h >> $LOG_FILE

   # Check disk health if physical disks are present
   for disk in /dev/sd?; do
     if [ -e "$disk" ]; then
       echo -e "\nSMART status for $disk:" >> $LOG_FILE
       smartctl -H $disk >> $LOG_FILE 2>&1
     fi
   done
   EOF

   chmod +x /shared/admin/check_storage.sh

   # Schedule daily checks
   sudo crontab -e

Add:

   0 3 * * * /shared/admin/check_storage.sh

Next Steps

In the next tutorial, we’ll cover:
- Monitoring and troubleshooting
- Performance tuning
- Documentation best practices

These skills will further enhance your ability to maintain a reliable and efficient HPC environment.

Account Management, Resource Scheduling, and Storage Solutions

Table of Contents

Account Management and Security

User Account Management

Security Implementation

Resource Management and Scheduling

Slurm Configuration and Policy Management

Practical Job Management

Resource Limits and Policies

Storage Solutions

NFS Advanced Configuration

Data Management

Storage Monitoring

Next Steps