HPC Administration Tutorial - Part 5
Created: 2025-03-16 17:03:26 | Last updated: 2025-03-16 17:03:26 | Status: Public
Professional Skill Development
This final part of our HPC administration tutorial focuses on developing professional skills that are essential for success in HPC system administration, particularly in research environments.
Table of Contents
- Professional Skill Development
- Technical Skill Development
- Research Environment Competencies
- Communication Skills
- Career Growth
Professional Skill Development
Technical Skill Development
- Learning Paths for HPC Administration:
Foundational Skills:
- Linux system administration (RedHat/CentOS/Rocky Linux preferred)
- Networking fundamentals (TCP/IP, routing, firewalls)
- Storage systems (NFS, parallel file systems, RAID)
- Scripting (Bash, Python)
- Configuration management (Ansible, Puppet)
HPC-Specific Skills:
- Resource managers and job schedulers (Slurm, PBS Pro, LSF)
- High-performance networking (InfiniBand, RoCE)
- Parallel programming models (MPI, OpenMP)
- Container technologies for HPC (Singularity/Apptainer, Charliecloud)
- Performance monitoring and analysis tools
Advanced Skills:
- GPU computing and administration
- Cloud integration for hybrid HPC
- Automation and orchestration
- Security in HPC environments
- Machine learning operations (MLOps)
- Hands-on Projects for Skill Building:
Project 1: Performance Benchmarking
# Create benchmark script
nano /shared/admin/projects/benchmark.sh
Script content:
#!/bin/bash
# HPC Cluster Benchmarking Script
RESULTS_DIR="/shared/admin/projects/benchmark_results/$(date +%Y%m%d)"
mkdir -p $RESULTS_DIR
# CPU Performance - STREAM
cd /tmp
git clone https://github.com/jeffhammond/STREAM.git
cd STREAM
gcc -O3 -fopenmp stream.c -o stream
./stream > $RESULTS_DIR/stream_results.txt
# MPI Performance - OSU Micro-Benchmarks
cd /tmp
wget http://mvapich.cse.ohio-state.edu/download/mvapich/osu-micro-benchmarks-5.9.tar.gz
tar xzf osu-micro-benchmarks-5.9.tar.gz
cd osu-micro-benchmarks-5.9
./configure CC=mpicc CXX=mpicxx
make
cd mpi/pt2pt
mpirun -n 2 ./osu_latency > $RESULTS_DIR/mpi_latency.txt
mpirun -n 2 ./osu_bw > $RESULTS_DIR/mpi_bandwidth.txt
# I/O Performance - IOR
cd /tmp
git clone https://github.com/hpc/ior.git
cd ior
./bootstrap
./configure
make
mpirun -n 4 ./src/ior -a POSIX -b 1g -t 4m -i 5 -v -C > $RESULTS_DIR/ior_results.txt
echo "Benchmarking complete. Results saved to $RESULTS_DIR"
Project 2: Auto-scaling Compute Resources
nano /shared/admin/projects/autoscale.py
Script content:
#!/usr/bin/env python3
"""
HPC Cluster Auto-scaling Script
This script monitors the Slurm queue and dynamically adjusts
the availability of compute nodes based on demand.
"""
import subprocess
import time
import logging
import os
# Configure logging
logging.basicConfig(
filename='/shared/admin/logs/autoscale.log',
level=logging.INFO,
format='%(asctime)s - %(levelname)s - %(message)s'
)
# Configuration
IDLE_THRESHOLD = 30 # minutes before powering down idle nodes
CHECK_INTERVAL = 5 # minutes between checks
MIN_NODES = 1 # minimum nodes to keep running
def get_queue_status():
"""Get current Slurm queue status"""
try:
result = subprocess.run(['squeue', '--noheader', '--format=%i'],
stdout=subprocess.PIPE, text=True)
jobs = result.stdout.strip().split('\n')
# Filter out empty strings
jobs = [j for j in jobs if j]
return len(jobs)
except Exception as e:
logging.error(f"Error getting queue status: {e}")
return 0
def get_node_status():
"""Get current node status"""
try:
result = subprocess.run(['sinfo', '--noheader', '--format=%n %t'],
stdout=subprocess.PIPE, text=True)
nodes = result.stdout.strip().split('\n')
idle_nodes = [n.split()[0] for n in nodes if 'idle' in n]
active_nodes = [n.split()[0] for n in nodes if 'alloc' in n or 'mix' in n]
down_nodes = [n.split()[0] for n in nodes if 'down' in n or 'drain' in n]
return {
'idle': idle_nodes,
'active': active_nodes,
'down': down_nodes
}
except Exception as e:
logging.error(f"Error getting node status: {e}")
return {'idle': [], 'active': [], 'down': []}
def node_idle_time(node):
"""Get how long a node has been idle"""
try:
result = subprocess.run(['sinfo', '--noheader', '-n', node, '--format=%I'],
stdout=subprocess.PIPE, text=True)
idle_time = result.stdout.strip()
# Convert HH:MM:SS to minutes
h, m, s = idle_time.split(':')
return int(h) * 60 + int(m)
except Exception as e:
logging.error(f"Error getting idle time for node {node}: {e}")
return 0
def suspend_node(node):
"""Suspend an idle node"""
try:
logging.info(f"Suspending node {node}")
subprocess.run(['scontrol', 'update', f'NodeName={node}', 'State=DRAIN',
'Reason="Auto-suspended due to inactivity"'])
# In a real environment, you might use IPMI or similar to power down
# subprocess.run(['ipmitool', '-H', f'{node}-ipmi', '-U', 'admin', '-P', 'password',
# 'power', 'off'])
return True
except Exception as e:
logging.error(f"Error suspending node {node}: {e}")
return False
def resume_node(node):
"""Resume a suspended node"""
try:
logging.info(f"Resuming node {node}")
# In a real environment, you might use IPMI or similar to power up
# subprocess.run(['ipmitool', '-H', f'{node}-ipmi', '-U', 'admin', '-P', 'password',
# 'power', 'on'])
# Give the node time to boot
time.sleep(60)
# Update Slurm
subprocess.run(['scontrol', 'update', f'NodeName={node}', 'State=RESUME'])
return True
except Exception as e:
logging.error(f"Error resuming node {node}: {e}")
return False
def main():
"""Main loop"""
logging.info("Starting HPC auto-scaling service")
while True:
job_count = get_queue_status()
node_status = get_node_status()
logging.info(f"Current status: {job_count} jobs, " +
f"{len(node_status['active'])} active nodes, " +
f"{len(node_status['idle'])} idle nodes, " +
f"{len(node_status['down'])} down nodes")
# Check if we need to resume nodes due to pending jobs
if job_count > len(node_status['active']):
for node in node_status['down']:
if 'Auto-suspended' in subprocess.run(
['scontrol', 'show', 'node', node],
stdout=subprocess.PIPE, text=True).stdout:
resume_node(node)
break
# Check if we should suspend idle nodes
active_and_idle_count = len(node_status['active']) + len(node_status['idle'])
if active_and_idle_count > MIN_NODES:
for node in node_status['idle']:
if node_idle_time(node) > IDLE_THRESHOLD:
suspend_node(node)
# Sleep until next check
time.sleep(CHECK_INTERVAL * 60)
if __name__ == "__main__":
main()
Project 3: User Support Ticket System
nano /shared/admin/projects/ticket_system.py
Script content:
#!/usr/bin/env python3
"""
Simple Ticket System for HPC User Support
This script provides a basic command-line ticket system
for tracking and managing user support requests.
"""
import os
import json
import datetime
import argparse
import uuid
# Configuration
TICKET_DB = "/shared/admin/tickets/tickets.json"
os.makedirs(os.path.dirname(TICKET_DB), exist_ok=True)
# Initialize empty database if it doesn't exist
if not os.path.exists(TICKET_DB):
with open(TICKET_DB, 'w') as f:
json.dump([], f)
def load_tickets():
"""Load tickets from database"""
with open(TICKET_DB, 'r') as f:
return json.load(f)
def save_tickets(tickets):
"""Save tickets to database"""
with open(TICKET_DB, 'w') as f:
json.dump(tickets, f, indent=2)
def create_ticket(args):
"""Create a new ticket"""
tickets = load_tickets()
# Generate ticket ID
ticket_id = str(uuid.uuid4())[:8]
# Create ticket
ticket = {
'id': ticket_id,
'user': args.user,
'subject': args.subject,
'description': args.description,
'status': 'open',
'priority': args.priority,
'created': datetime.datetime.now().isoformat(),
'updated': datetime.datetime.now().isoformat(),
'assigned_to': '',
'comments': []
}
tickets.append(ticket)
save_tickets(tickets)
print(f"Ticket {ticket_id} created successfully")
def list_tickets(args):
"""List tickets"""
tickets = load_tickets()
# Filter tickets
if args.status:
tickets = [t for t in tickets if t['status'] == args.status]
if args.user:
tickets = [t for t in tickets if t['user'] == args.user]
if args.assigned:
tickets = [t for t in tickets if t['assigned_to'] == args.assigned]
# Sort tickets
if args.sort == 'priority':
priority_order = {'low': 0, 'medium': 1, 'high': 2, 'critical': 3}
tickets.sort(key=lambda t: priority_order.get(t['priority'], 0), reverse=True)
elif args.sort == 'date':
tickets.sort(key=lambda t: t['created'])
# Display tickets
if not tickets:
print("No tickets found")
return
print(f"Found {len(tickets)} ticket(s):")
for t in tickets:
print(f"[{t['id']}] [{t['status']}] [{t['priority']}] {t['subject']} - {t['user']}")
def view_ticket(args):
"""View details of a ticket"""
tickets = load_tickets()
# Find ticket
ticket = next((t for t in tickets if t['id'] == args.id), None)
if not ticket:
print(f"Ticket {args.id} not found")
return
# Display ticket details
print(f"Ticket: {ticket['id']}")
print(f"Subject: {ticket['subject']}")
print(f"User: {ticket['user']}")
print(f"Status: {ticket['status']}")
print(f"Priority: {ticket['priority']}")
print(f"Created: {ticket['created']}")
print(f"Updated: {ticket['updated']}")
print(f"Assigned to: {ticket['assigned_to'] or 'Unassigned'}")
print(f"\nDescription:\n{ticket['description']}")
if ticket['comments']:
print("\nComments:")
for comment in ticket['comments']:
print(f"[{comment['date']}] {comment['author']}:")
print(f" {comment['text']}")
def update_ticket(args):
"""Update a ticket"""
tickets = load_tickets()
# Find ticket
ticket_idx = next((i for i, t in enumerate(tickets) if t['id'] == args.id), None)
if ticket_idx is None:
print(f"Ticket {args.id} not found")
return
# Update fields
if args.status:
tickets[ticket_idx]['status'] = args.status
if args.priority:
tickets[ticket_idx]['priority'] = args.priority
if args.assign:
tickets[ticket_idx]['assigned_to'] = args.assign
if args.comment:
comment = {
'author': os.environ.get('USER', 'admin'),
'date': datetime.datetime.now().isoformat(),
'text': args.comment
}
tickets[ticket_idx]['comments'].append(comment)
tickets[ticket_idx]['updated'] = datetime.datetime.now().isoformat()
save_tickets(tickets)
print(f"Ticket {args.id} updated successfully")
def main():
"""Main function"""
parser = argparse.ArgumentParser(description='HPC Support Ticket System')
subparsers = parser.add_subparsers(dest='command', help='Commands')
# Create command
create_parser = subparsers.add_parser('create', help='Create a new ticket')
create_parser.add_argument('--user', required=True, help='User submitting the ticket')
create_parser.add_argument('--subject', required=True, help='Ticket subject')
create_parser.add_argument('--description', required=True, help='Ticket description')
create_parser.add_argument('--priority', default='medium',
choices=['low', 'medium', 'high', 'critical'],
help='Ticket priority')
# List command
list_parser = subparsers.add_parser('list', help='List tickets')
list_parser.add_argument('--status', help='Filter by status')
list_parser.add_argument('--user', help='Filter by user')
list_parser.add_argument('--assigned', help='Filter by assigned user')
list_parser.add_argument('--sort', default='date', choices=['date', 'priority'],
help='Sort tickets')
# View command
view_parser = subparsers.add_parser('view', help='View ticket details')
view_parser.add_argument('id', help='Ticket ID')
# Update command
update_parser = subparsers.add_parser('update', help='Update a ticket')
update_parser.add_argument('id', help='Ticket ID')
update_parser.add_argument('--status', choices=['open', 'in_progress', 'resolved', 'closed'],
help='Update status')
update_parser.add_argument('--priority', choices=['low', 'medium', 'high', 'critical'],
help='Update priority')
update_parser.add_argument('--assign', help='Assign to user')
update_parser.add_argument('--comment', help='Add a comment')
args = parser.parse_args()
if args.command == 'create':
create_ticket(args)
elif args.command == 'list':
list_tickets(args)
elif args.command == 'view':
view_ticket(args)
elif args.command == 'update':
update_ticket(args)
else:
parser.print_help()
if __name__ == "__main__":
main()
Make scripts executable:
chmod +x /shared/admin/projects/benchmark.sh
chmod +x /shared/admin/projects/autoscale.py
chmod +x /shared/admin/projects/ticket_system.py
Research Environment Competencies
- Understanding Research Computing Needs:
Key Research Computing Characteristics:
- Domain-specific workloads: Different research domains have distinct computational patterns
- Varying time scales: Jobs ranging from minutes to weeks
- Data intensity: Managing and processing large datasets
- Iterative workflows: Researchers often refine models and analyses
- Specialized software: Custom codes and specialized applications
Common Research Domains and Their Requirements:
Domain | Typical Workload Characteristics | Common Software | Storage Needs |
---|---|---|---|
Computational Fluid Dynamics | MPI-heavy, long-running | OpenFOAM, ANSYS Fluent | Medium datasets, high I/O |
Genomics | High throughput, embarrassingly parallel | BWA, BLAST, Bowtie | Very large datasets, sequential I/O |
Machine Learning | GPU-accelerated, memory-intensive | TensorFlow, PyTorch | Medium to large datasets |
Molecular Dynamics | Highly parallel, GPU-accelerated | GROMACS, NAMD, LAMMPS | Small input, large output |
Climate Modeling | Long-running, MPI-intensive | WRF, CESM | Extremely large datasets |
Effective Research Support Practices:
- Regular consultations with researchers
- Attendance at departmental seminars
- User surveys and feedback collection
- Collaboration on research publications (methods sections)
- Knowledge transfer through workshops and training
- Balancing Resources and Priorities:
Resource Allocation Policies:
nano /shared/admin/policies/resource_allocation.md
Content:
# HPC Resource Allocation Policy
## Allocation Types
1. **Standard Allocation**
- Available to all authorized users
- Default CPU time: 1,000 core-hours per month
- Default storage: 100GB home, 1TB project space
- Job limits: 4 concurrent jobs, max 24-hour runtime
2. **Priority Allocation**
- Available to funded projects
- CPU time based on project requirements
- Additional storage and longer job limits
- Higher job priority (higher QoS)
3. **Urgent Allocation**
- For time-sensitive needs (e.g., publication deadlines)
- Limited duration (typically 1-2 weeks)
- Requires justification and approval
## Allocation Request Process
1. Submit request form detailing:
- Project description and research goals
- Computational requirements
- Software dependencies
- Storage needs
- Timeline
2. Review by HPC administrators
3. Allocation activation
## Usage Monitoring
- Monthly usage reports sent to users
- Quarterly reviews of allocation utilization
- Adjustments based on usage patterns
## Renewal Process
- Standard allocations renew automatically
- Priority allocations require annual renewal
- Renewal includes:
- Summary of research outcomes
- Updated resource requirements
- Publication acknowledgments
Fairshare Implementation:
nano /shared/admin/scripts/setup_fairshare.sh
Script content:
#!/bin/bash
# Configure Slurm fairshare for research groups
# Create accounts for research groups
sacctmgr add account biology description="Biology Department"
sacctmgr add account chemistry description="Chemistry Department"
sacctmgr add account physics description="Physics Department"
# Set fairshare values (higher values get higher priority)
# Actual values would depend on funding levels, etc.
sacctmgr modify account biology set fairshare=10
sacctmgr modify account chemistry set fairshare=15
sacctmgr modify account physics set fairshare=10
# Create QoS levels
sacctmgr add qos normal
sacctmgr add qos high priority=2000
sacctmgr add qos urgent priority=10000 MaxWall=24:00:00 MaxJobsPerUser=2
# Associate accounts with QoS
sacctmgr modify account biology set qos=normal,high
sacctmgr modify account chemistry set qos=normal,high,urgent
sacctmgr modify account physics set qos=normal,high
# Sample user setup
sacctmgr add user alice account=biology
sacctmgr add user bob account=chemistry
sacctmgr add user charlie account=physics
echo "Fairshare configuration complete"
- Research Software Support:
Software Installation Workflow:
nano /shared/admin/procedures/software_installation.md
Content:
# Research Software Installation Procedure
## Request Phase
1. Receive software installation request
- Software name and version
- Research purpose
- External dependencies
- Number of potential users
2. Evaluate request
- License compatibility with HPC use
- Resource requirements
- Community support level
- Installation complexity
## Installation Planning
1. Choose installation method:
- Native compilation
- Package manager (apt, yum)
- Environment modules
- Container (Singularity/Apptainer)
2. Create installation script template
3. Test in development environment
## Installation Process
1. **Native Compilation Example**:
# Create build directory
mkdir -p /tmp/build
cd /tmp/build
# Download source
wget https://example.com/software-1.0.tar.gz
tar xzf software-1.0.tar.gz
cd software-1.0
# Configure and build
./configure --prefix=/shared/apps/software/1.0
make -j4
make install
# Create environment module
mkdir -p /shared/modulefiles/software
cat > /shared/modulefiles/software/1.0 << EOF
#%Module1.0
proc ModulesHelp { } {
puts stderr "Software 1.0"
}
module-whatis "Software 1.0"
set prefix /shared/apps/software/1.0
prepend-path PATH \$prefix/bin
prepend-path LD_LIBRARY_PATH \$prefix/lib
EOF
2. **Container Installation Example**:
# Pull Singularity container
singularity pull docker://example/software:1.0
# Move to shared location
mv software_1.0.sif /shared/containers/
# Create wrapper script
cat > /shared/apps/bin/software << EOF
#!/bin/bash
singularity exec /shared/containers/software_1.0.sif software "\$@"
EOF
chmod +x /shared/apps/bin/software
# Create environment module
mkdir -p /shared/modulefiles/software
cat > /shared/modulefiles/software/1.0 << EOF
#%Module1.0
proc ModulesHelp { } {
puts stderr "Software 1.0 (container)"
}
module-whatis "Software 1.0 (container)"
prepend-path PATH /shared/apps/bin
EOF
## Validation and Documentation
1. Test the installation with basic functionality
2. Create documentation:
- Usage examples
- Known limitations
- Dependency information
3. Announce to users:
- Email notification
- Documentation link
- Training opportunity if needed
## Maintenance
1. Track software versions and update schedule
2. Monitor usage statistics
3. Plan deprecation for outdated versions
Communication Skills
- User Support Best Practices:
Support Standard Operating Procedure:
nano /shared/admin/procedures/support_sop.md
Content:
# HPC User Support: Standard Operating Procedure
## Support Channels
- Email support: hpc-help@example.com
- Ticket system: https://example.com/help
- Office hours: Tuesday and Thursday, 10am-12pm
- Slack channel: #hpc-support
## Response Time Standards
| Issue Type | Response Time | Resolution Target |
|------------|---------------|-------------------|
| System outage | 1 hour | 4 hours |
| Job failures | 4 hours | 24 hours |
| Software issues | 8 hours | 48 hours |
| General questions | 24 hours | 72 hours |
## Ticket Workflow
1. **Ticket Creation**
- Automated acknowledgment sent to user
- Initial categorization and priority assignment
2. **Triage Process**
- Review issue details
- Verify if duplicate/related tickets exist
- Assign to appropriate team member
3. **Investigation**
- Gather necessary information
- Reproduce issue when possible
- Check system logs and job history
4. **Resolution**
- Implement solution
- Document fix
- Verify solution with user
5. **Follow-up**
- Close ticket once user confirms resolution
- Add to knowledge base if appropriate
## Communication Guidelines
- Use clear, non-technical language unless user is technical
- Provide step-by-step instructions
- Include command examples when relevant
- Explain "why" not just "what" when appropriate
- Manage expectations about timelines
## Difficult Situations
- For frustrated users, acknowledge their frustration
- For highly technical issues, involve specialists early
- For recurring problems, develop a permanent solution
- For feature requests, add to development backlog
## Knowledge Sharing
- Document all solutions in internal knowledge base
- Discuss common issues in weekly team meetings
- Identify trends for proactive improvements
- Publish FAQs based on common questions
- Technical Documentation Skills:
Documentation Style Guide:
nano /shared/admin/policies/documentation_style.md
Content:
# HPC Documentation Style Guide
## General Principles
- **Accuracy**: Technical information must be verified
- **Clarity**: Use simple language where possible
- **Completeness**: Include all necessary information
- **Consistency**: Follow standard formats
- **Currency**: Regular updates and reviews
## Document Structure
Every document should include:
1. **Header**
- Title
- Last updated date
- Author/maintainer
2. **Purpose Statement**
- Brief description of what the document covers
3. **Main Content**
- Organized into logical sections
- Use headings and subheadings (H1, H2, H3)
4. **Examples**
- Include practical examples
- Use realistic scenarios
5. **References**
- Links to related documentation
- External resources when appropriate
## Formatting Guidelines
- Use Markdown for all documentation
- Code blocks: Use triple backticks with language
- Commands: Use `monospace` for inline commands
- Variables: Use italics or [placeholders]
- Important notes: Use blockquotes or highlight boxes
## Writing Style
- Use active voice
- Write in present tense
- Keep sentences and paragraphs short
- Avoid jargon unless necessary
- Define acronyms on first use
## Code Examples
- Include comments explaining complex commands
- Show expected output when helpful
- Indicate optional parameters
- Use placeholder values with clear naming
## Screenshots and Diagrams
- Use screenshots sparingly and only when necessary
- Label key elements in diagrams
- Ensure diagrams are accessible and legible
- Include alt text for images
## Document Review Process
1. Technical review by peer
2. Clarity review by someone outside the team
3. Final review by documentation owner
4. Quarterly reviews for accuracy
- Training and Knowledge Transfer:
Workshop Planning Guide:
nano /shared/admin/templates/workshop_template.md
Content: