HPC Administration Tutorial - Part 5

Created: 2025-03-16 17:03:26 | Last updated: 2025-03-16 17:03:26 | Status: Public

Professional Skill Development

This final part of our HPC administration tutorial focuses on developing professional skills that are essential for success in HPC system administration, particularly in research environments.

Table of Contents

Professional Skill Development

Technical Skill Development

  1. Learning Paths for HPC Administration:

Foundational Skills:
- Linux system administration (RedHat/CentOS/Rocky Linux preferred)
- Networking fundamentals (TCP/IP, routing, firewalls)
- Storage systems (NFS, parallel file systems, RAID)
- Scripting (Bash, Python)
- Configuration management (Ansible, Puppet)

HPC-Specific Skills:
- Resource managers and job schedulers (Slurm, PBS Pro, LSF)
- High-performance networking (InfiniBand, RoCE)
- Parallel programming models (MPI, OpenMP)
- Container technologies for HPC (Singularity/Apptainer, Charliecloud)
- Performance monitoring and analysis tools

Advanced Skills:
- GPU computing and administration
- Cloud integration for hybrid HPC
- Automation and orchestration
- Security in HPC environments
- Machine learning operations (MLOps)

  1. Hands-on Projects for Skill Building:

Project 1: Performance Benchmarking

   # Create benchmark script
   nano /shared/admin/projects/benchmark.sh

Script content:

   #!/bin/bash
   # HPC Cluster Benchmarking Script

   RESULTS_DIR="/shared/admin/projects/benchmark_results/$(date +%Y%m%d)"
   mkdir -p $RESULTS_DIR

   # CPU Performance - STREAM
   cd /tmp
   git clone https://github.com/jeffhammond/STREAM.git
   cd STREAM
   gcc -O3 -fopenmp stream.c -o stream
   ./stream > $RESULTS_DIR/stream_results.txt

   # MPI Performance - OSU Micro-Benchmarks
   cd /tmp
   wget http://mvapich.cse.ohio-state.edu/download/mvapich/osu-micro-benchmarks-5.9.tar.gz
   tar xzf osu-micro-benchmarks-5.9.tar.gz
   cd osu-micro-benchmarks-5.9
   ./configure CC=mpicc CXX=mpicxx
   make
   cd mpi/pt2pt
   mpirun -n 2 ./osu_latency > $RESULTS_DIR/mpi_latency.txt
   mpirun -n 2 ./osu_bw > $RESULTS_DIR/mpi_bandwidth.txt

   # I/O Performance - IOR
   cd /tmp
   git clone https://github.com/hpc/ior.git
   cd ior
   ./bootstrap
   ./configure
   make
   mpirun -n 4 ./src/ior -a POSIX -b 1g -t 4m -i 5 -v -C > $RESULTS_DIR/ior_results.txt

   echo "Benchmarking complete. Results saved to $RESULTS_DIR"

Project 2: Auto-scaling Compute Resources

   nano /shared/admin/projects/autoscale.py

Script content:

   #!/usr/bin/env python3
   """
   HPC Cluster Auto-scaling Script

   This script monitors the Slurm queue and dynamically adjusts 
   the availability of compute nodes based on demand.
   """

   import subprocess
   import time
   import logging
   import os

   # Configure logging
   logging.basicConfig(
       filename='/shared/admin/logs/autoscale.log',
       level=logging.INFO,
       format='%(asctime)s - %(levelname)s - %(message)s'
   )

   # Configuration
   IDLE_THRESHOLD = 30  # minutes before powering down idle nodes
   CHECK_INTERVAL = 5   # minutes between checks
   MIN_NODES = 1        # minimum nodes to keep running

   def get_queue_status():
       """Get current Slurm queue status"""
       try:
           result = subprocess.run(['squeue', '--noheader', '--format=%i'], 
                                 stdout=subprocess.PIPE, text=True)
           jobs = result.stdout.strip().split('\n')
           # Filter out empty strings
           jobs = [j for j in jobs if j]
           return len(jobs)
       except Exception as e:
           logging.error(f"Error getting queue status: {e}")
           return 0

   def get_node_status():
       """Get current node status"""
       try:
           result = subprocess.run(['sinfo', '--noheader', '--format=%n %t'], 
                                 stdout=subprocess.PIPE, text=True)
           nodes = result.stdout.strip().split('\n')
           idle_nodes = [n.split()[0] for n in nodes if 'idle' in n]
           active_nodes = [n.split()[0] for n in nodes if 'alloc' in n or 'mix' in n]
           down_nodes = [n.split()[0] for n in nodes if 'down' in n or 'drain' in n]

           return {
               'idle': idle_nodes,
               'active': active_nodes,
               'down': down_nodes
           }
       except Exception as e:
           logging.error(f"Error getting node status: {e}")
           return {'idle': [], 'active': [], 'down': []}

   def node_idle_time(node):
       """Get how long a node has been idle"""
       try:
           result = subprocess.run(['sinfo', '--noheader', '-n', node, '--format=%I'], 
                                  stdout=subprocess.PIPE, text=True)
           idle_time = result.stdout.strip()
           # Convert HH:MM:SS to minutes
           h, m, s = idle_time.split(':')
           return int(h) * 60 + int(m)
       except Exception as e:
           logging.error(f"Error getting idle time for node {node}: {e}")
           return 0

   def suspend_node(node):
       """Suspend an idle node"""
       try:
           logging.info(f"Suspending node {node}")
           subprocess.run(['scontrol', 'update', f'NodeName={node}', 'State=DRAIN', 
                          'Reason="Auto-suspended due to inactivity"'])
           # In a real environment, you might use IPMI or similar to power down
           # subprocess.run(['ipmitool', '-H', f'{node}-ipmi', '-U', 'admin', '-P', 'password', 
           #               'power', 'off'])
           return True
       except Exception as e:
           logging.error(f"Error suspending node {node}: {e}")
           return False

   def resume_node(node):
       """Resume a suspended node"""
       try:
           logging.info(f"Resuming node {node}")
           # In a real environment, you might use IPMI or similar to power up
           # subprocess.run(['ipmitool', '-H', f'{node}-ipmi', '-U', 'admin', '-P', 'password', 
           #               'power', 'on'])

           # Give the node time to boot
           time.sleep(60)

           # Update Slurm
           subprocess.run(['scontrol', 'update', f'NodeName={node}', 'State=RESUME'])
           return True
       except Exception as e:
           logging.error(f"Error resuming node {node}: {e}")
           return False

   def main():
       """Main loop"""
       logging.info("Starting HPC auto-scaling service")

       while True:
           job_count = get_queue_status()
           node_status = get_node_status()

           logging.info(f"Current status: {job_count} jobs, " +
                       f"{len(node_status['active'])} active nodes, " +
                       f"{len(node_status['idle'])} idle nodes, " +
                       f"{len(node_status['down'])} down nodes")

           # Check if we need to resume nodes due to pending jobs
           if job_count > len(node_status['active']):
               for node in node_status['down']:
                   if 'Auto-suspended' in subprocess.run(
                       ['scontrol', 'show', 'node', node], 
                       stdout=subprocess.PIPE, text=True).stdout:
                       resume_node(node)
                       break

           # Check if we should suspend idle nodes
           active_and_idle_count = len(node_status['active']) + len(node_status['idle'])
           if active_and_idle_count > MIN_NODES:
               for node in node_status['idle']:
                   if node_idle_time(node) > IDLE_THRESHOLD:
                       suspend_node(node)

           # Sleep until next check
           time.sleep(CHECK_INTERVAL * 60)

   if __name__ == "__main__":
       main()

Project 3: User Support Ticket System

   nano /shared/admin/projects/ticket_system.py

Script content:

   #!/usr/bin/env python3
   """
   Simple Ticket System for HPC User Support

   This script provides a basic command-line ticket system
   for tracking and managing user support requests.
   """

   import os
   import json
   import datetime
   import argparse
   import uuid

   # Configuration
   TICKET_DB = "/shared/admin/tickets/tickets.json"
   os.makedirs(os.path.dirname(TICKET_DB), exist_ok=True)

   # Initialize empty database if it doesn't exist
   if not os.path.exists(TICKET_DB):
       with open(TICKET_DB, 'w') as f:
           json.dump([], f)

   def load_tickets():
       """Load tickets from database"""
       with open(TICKET_DB, 'r') as f:
           return json.load(f)

   def save_tickets(tickets):
       """Save tickets to database"""
       with open(TICKET_DB, 'w') as f:
           json.dump(tickets, f, indent=2)

   def create_ticket(args):
       """Create a new ticket"""
       tickets = load_tickets()

       # Generate ticket ID
       ticket_id = str(uuid.uuid4())[:8]

       # Create ticket
       ticket = {
           'id': ticket_id,
           'user': args.user,
           'subject': args.subject,
           'description': args.description,
           'status': 'open',
           'priority': args.priority,
           'created': datetime.datetime.now().isoformat(),
           'updated': datetime.datetime.now().isoformat(),
           'assigned_to': '',
           'comments': []
       }

       tickets.append(ticket)
       save_tickets(tickets)

       print(f"Ticket {ticket_id} created successfully")

   def list_tickets(args):
       """List tickets"""
       tickets = load_tickets()

       # Filter tickets
       if args.status:
           tickets = [t for t in tickets if t['status'] == args.status]
       if args.user:
           tickets = [t for t in tickets if t['user'] == args.user]
       if args.assigned:
           tickets = [t for t in tickets if t['assigned_to'] == args.assigned]

       # Sort tickets
       if args.sort == 'priority':
           priority_order = {'low': 0, 'medium': 1, 'high': 2, 'critical': 3}
           tickets.sort(key=lambda t: priority_order.get(t['priority'], 0), reverse=True)
       elif args.sort == 'date':
           tickets.sort(key=lambda t: t['created'])

       # Display tickets
       if not tickets:
           print("No tickets found")
           return

       print(f"Found {len(tickets)} ticket(s):")
       for t in tickets:
           print(f"[{t['id']}] [{t['status']}] [{t['priority']}] {t['subject']} - {t['user']}")

   def view_ticket(args):
       """View details of a ticket"""
       tickets = load_tickets()

       # Find ticket
       ticket = next((t for t in tickets if t['id'] == args.id), None)
       if not ticket:
           print(f"Ticket {args.id} not found")
           return

       # Display ticket details
       print(f"Ticket: {ticket['id']}")
       print(f"Subject: {ticket['subject']}")
       print(f"User: {ticket['user']}")
       print(f"Status: {ticket['status']}")
       print(f"Priority: {ticket['priority']}")
       print(f"Created: {ticket['created']}")
       print(f"Updated: {ticket['updated']}")
       print(f"Assigned to: {ticket['assigned_to'] or 'Unassigned'}")
       print(f"\nDescription:\n{ticket['description']}")

       if ticket['comments']:
           print("\nComments:")
           for comment in ticket['comments']:
               print(f"[{comment['date']}] {comment['author']}:")
               print(f"  {comment['text']}")

   def update_ticket(args):
       """Update a ticket"""
       tickets = load_tickets()

       # Find ticket
       ticket_idx = next((i for i, t in enumerate(tickets) if t['id'] == args.id), None)
       if ticket_idx is None:
           print(f"Ticket {args.id} not found")
           return

       # Update fields
       if args.status:
           tickets[ticket_idx]['status'] = args.status
       if args.priority:
           tickets[ticket_idx]['priority'] = args.priority
       if args.assign:
           tickets[ticket_idx]['assigned_to'] = args.assign
       if args.comment:
           comment = {
               'author': os.environ.get('USER', 'admin'),
               'date': datetime.datetime.now().isoformat(),
               'text': args.comment
           }
           tickets[ticket_idx]['comments'].append(comment)

       tickets[ticket_idx]['updated'] = datetime.datetime.now().isoformat()

       save_tickets(tickets)
       print(f"Ticket {args.id} updated successfully")

   def main():
       """Main function"""
       parser = argparse.ArgumentParser(description='HPC Support Ticket System')
       subparsers = parser.add_subparsers(dest='command', help='Commands')

       # Create command
       create_parser = subparsers.add_parser('create', help='Create a new ticket')
       create_parser.add_argument('--user', required=True, help='User submitting the ticket')
       create_parser.add_argument('--subject', required=True, help='Ticket subject')
       create_parser.add_argument('--description', required=True, help='Ticket description')
       create_parser.add_argument('--priority', default='medium', 
                               choices=['low', 'medium', 'high', 'critical'], 
                               help='Ticket priority')

       # List command
       list_parser = subparsers.add_parser('list', help='List tickets')
       list_parser.add_argument('--status', help='Filter by status')
       list_parser.add_argument('--user', help='Filter by user')
       list_parser.add_argument('--assigned', help='Filter by assigned user')
       list_parser.add_argument('--sort', default='date', choices=['date', 'priority'], 
                             help='Sort tickets')

       # View command
       view_parser = subparsers.add_parser('view', help='View ticket details')
       view_parser.add_argument('id', help='Ticket ID')

       # Update command
       update_parser = subparsers.add_parser('update', help='Update a ticket')
       update_parser.add_argument('id', help='Ticket ID')
       update_parser.add_argument('--status', choices=['open', 'in_progress', 'resolved', 'closed'], 
                               help='Update status')
       update_parser.add_argument('--priority', choices=['low', 'medium', 'high', 'critical'], 
                               help='Update priority')
       update_parser.add_argument('--assign', help='Assign to user')
       update_parser.add_argument('--comment', help='Add a comment')

       args = parser.parse_args()

       if args.command == 'create':
           create_ticket(args)
       elif args.command == 'list':
           list_tickets(args)
       elif args.command == 'view':
           view_ticket(args)
       elif args.command == 'update':
           update_ticket(args)
       else:
           parser.print_help()

   if __name__ == "__main__":
       main()

Make scripts executable:

   chmod +x /shared/admin/projects/benchmark.sh
   chmod +x /shared/admin/projects/autoscale.py
   chmod +x /shared/admin/projects/ticket_system.py

Research Environment Competencies

  1. Understanding Research Computing Needs:

Key Research Computing Characteristics:
- Domain-specific workloads: Different research domains have distinct computational patterns
- Varying time scales: Jobs ranging from minutes to weeks
- Data intensity: Managing and processing large datasets
- Iterative workflows: Researchers often refine models and analyses
- Specialized software: Custom codes and specialized applications

Common Research Domains and Their Requirements:

Domain Typical Workload Characteristics Common Software Storage Needs
Computational Fluid Dynamics MPI-heavy, long-running OpenFOAM, ANSYS Fluent Medium datasets, high I/O
Genomics High throughput, embarrassingly parallel BWA, BLAST, Bowtie Very large datasets, sequential I/O
Machine Learning GPU-accelerated, memory-intensive TensorFlow, PyTorch Medium to large datasets
Molecular Dynamics Highly parallel, GPU-accelerated GROMACS, NAMD, LAMMPS Small input, large output
Climate Modeling Long-running, MPI-intensive WRF, CESM Extremely large datasets

Effective Research Support Practices:
- Regular consultations with researchers
- Attendance at departmental seminars
- User surveys and feedback collection
- Collaboration on research publications (methods sections)
- Knowledge transfer through workshops and training

  1. Balancing Resources and Priorities:

Resource Allocation Policies:

   nano /shared/admin/policies/resource_allocation.md

Content:

   # HPC Resource Allocation Policy

   ## Allocation Types

   1. **Standard Allocation**
     - Available to all authorized users
     - Default CPU time: 1,000 core-hours per month
     - Default storage: 100GB home, 1TB project space
     - Job limits: 4 concurrent jobs, max 24-hour runtime

   2. **Priority Allocation**
     - Available to funded projects
     - CPU time based on project requirements
     - Additional storage and longer job limits
     - Higher job priority (higher QoS)

   3. **Urgent Allocation**
     - For time-sensitive needs (e.g., publication deadlines)
     - Limited duration (typically 1-2 weeks)
     - Requires justification and approval

   ## Allocation Request Process

   1. Submit request form detailing:
      - Project description and research goals
      - Computational requirements
      - Software dependencies
      - Storage needs
      - Timeline

   2. Review by HPC administrators

   3. Allocation activation

   ## Usage Monitoring

   - Monthly usage reports sent to users
   - Quarterly reviews of allocation utilization
   - Adjustments based on usage patterns

   ## Renewal Process

   - Standard allocations renew automatically
   - Priority allocations require annual renewal
   - Renewal includes:
     - Summary of research outcomes
     - Updated resource requirements
     - Publication acknowledgments

Fairshare Implementation:

   nano /shared/admin/scripts/setup_fairshare.sh

Script content:

   #!/bin/bash
   # Configure Slurm fairshare for research groups

   # Create accounts for research groups
   sacctmgr add account biology description="Biology Department"
   sacctmgr add account chemistry description="Chemistry Department"
   sacctmgr add account physics description="Physics Department"

   # Set fairshare values (higher values get higher priority)
   # Actual values would depend on funding levels, etc.
   sacctmgr modify account biology set fairshare=10
   sacctmgr modify account chemistry set fairshare=15
   sacctmgr modify account physics set fairshare=10

   # Create QoS levels
   sacctmgr add qos normal
   sacctmgr add qos high priority=2000
   sacctmgr add qos urgent priority=10000 MaxWall=24:00:00 MaxJobsPerUser=2

   # Associate accounts with QoS
   sacctmgr modify account biology set qos=normal,high
   sacctmgr modify account chemistry set qos=normal,high,urgent
   sacctmgr modify account physics set qos=normal,high

   # Sample user setup
   sacctmgr add user alice account=biology
   sacctmgr add user bob account=chemistry
   sacctmgr add user charlie account=physics

   echo "Fairshare configuration complete"
  1. Research Software Support:

Software Installation Workflow:

   nano /shared/admin/procedures/software_installation.md

Content:

   # Research Software Installation Procedure

   ## Request Phase

   1. Receive software installation request
      - Software name and version
      - Research purpose
      - External dependencies
      - Number of potential users

   2. Evaluate request
      - License compatibility with HPC use
      - Resource requirements
      - Community support level
      - Installation complexity

   ## Installation Planning

   1. Choose installation method:
      - Native compilation
      - Package manager (apt, yum)
      - Environment modules
      - Container (Singularity/Apptainer)

   2. Create installation script template

   3. Test in development environment

   ## Installation Process

   1. **Native Compilation Example**:
  # Create build directory
  mkdir -p /tmp/build
  cd /tmp/build

  # Download source
  wget https://example.com/software-1.0.tar.gz
  tar xzf software-1.0.tar.gz
  cd software-1.0

  # Configure and build
  ./configure --prefix=/shared/apps/software/1.0
  make -j4
  make install

  # Create environment module
  mkdir -p /shared/modulefiles/software
  cat > /shared/modulefiles/software/1.0 << EOF
  #%Module1.0

  proc ModulesHelp { } {
      puts stderr "Software 1.0"
  }

  module-whatis "Software 1.0"

  set prefix /shared/apps/software/1.0

  prepend-path PATH \$prefix/bin
  prepend-path LD_LIBRARY_PATH \$prefix/lib
  EOF
   2. **Container Installation Example**:
  # Pull Singularity container
  singularity pull docker://example/software:1.0

  # Move to shared location
  mv software_1.0.sif /shared/containers/

  # Create wrapper script
  cat > /shared/apps/bin/software << EOF
  #!/bin/bash
  singularity exec /shared/containers/software_1.0.sif software "\$@"
  EOF

  chmod +x /shared/apps/bin/software

  # Create environment module
  mkdir -p /shared/modulefiles/software
  cat > /shared/modulefiles/software/1.0 << EOF
  #%Module1.0

  proc ModulesHelp { } {
      puts stderr "Software 1.0 (container)"
  }

  module-whatis "Software 1.0 (container)"

  prepend-path PATH /shared/apps/bin
  EOF
   ## Validation and Documentation

   1. Test the installation with basic functionality

   2. Create documentation:
      - Usage examples
      - Known limitations
      - Dependency information

   3. Announce to users:
      - Email notification
      - Documentation link
      - Training opportunity if needed

   ## Maintenance

   1. Track software versions and update schedule

   2. Monitor usage statistics

   3. Plan deprecation for outdated versions

Communication Skills

  1. User Support Best Practices:

Support Standard Operating Procedure:

   nano /shared/admin/procedures/support_sop.md

Content:

   # HPC User Support: Standard Operating Procedure

   ## Support Channels

   - Email support: hpc-help@example.com
   - Ticket system: https://example.com/help
   - Office hours: Tuesday and Thursday, 10am-12pm
   - Slack channel: #hpc-support

   ## Response Time Standards

   | Issue Type | Response Time | Resolution Target |
   |------------|---------------|-------------------|
   | System outage | 1 hour | 4 hours |
   | Job failures | 4 hours | 24 hours |
   | Software issues | 8 hours | 48 hours |
   | General questions | 24 hours | 72 hours |

   ## Ticket Workflow

   1. **Ticket Creation**
      - Automated acknowledgment sent to user
      - Initial categorization and priority assignment

   2. **Triage Process**
      - Review issue details
      - Verify if duplicate/related tickets exist
      - Assign to appropriate team member

   3. **Investigation**
      - Gather necessary information
      - Reproduce issue when possible
      - Check system logs and job history

   4. **Resolution**
      - Implement solution
      - Document fix
      - Verify solution with user

   5. **Follow-up**
      - Close ticket once user confirms resolution
      - Add to knowledge base if appropriate

   ## Communication Guidelines

   - Use clear, non-technical language unless user is technical
   - Provide step-by-step instructions
   - Include command examples when relevant
   - Explain "why" not just "what" when appropriate
   - Manage expectations about timelines

   ## Difficult Situations

   - For frustrated users, acknowledge their frustration
   - For highly technical issues, involve specialists early
   - For recurring problems, develop a permanent solution
   - For feature requests, add to development backlog

   ## Knowledge Sharing

   - Document all solutions in internal knowledge base
   - Discuss common issues in weekly team meetings
   - Identify trends for proactive improvements
   - Publish FAQs based on common questions
  1. Technical Documentation Skills:

Documentation Style Guide:

   nano /shared/admin/policies/documentation_style.md

Content:

   # HPC Documentation Style Guide

   ## General Principles

   - **Accuracy**: Technical information must be verified
   - **Clarity**: Use simple language where possible
   - **Completeness**: Include all necessary information
   - **Consistency**: Follow standard formats
   - **Currency**: Regular updates and reviews

   ## Document Structure

   Every document should include:

   1. **Header**
      - Title
      - Last updated date
      - Author/maintainer

   2. **Purpose Statement**
      - Brief description of what the document covers

   3. **Main Content**
      - Organized into logical sections
      - Use headings and subheadings (H1, H2, H3)

   4. **Examples**
      - Include practical examples
      - Use realistic scenarios

   5. **References**
      - Links to related documentation
      - External resources when appropriate

   ## Formatting Guidelines

   - Use Markdown for all documentation
   - Code blocks: Use triple backticks with language
   - Commands: Use `monospace` for inline commands
   - Variables: Use italics or [placeholders]
   - Important notes: Use blockquotes or highlight boxes

   ## Writing Style

   - Use active voice
   - Write in present tense
   - Keep sentences and paragraphs short
   - Avoid jargon unless necessary
   - Define acronyms on first use

   ## Code Examples

   - Include comments explaining complex commands
   - Show expected output when helpful
   - Indicate optional parameters
   - Use placeholder values with clear naming

   ## Screenshots and Diagrams

   - Use screenshots sparingly and only when necessary
   - Label key elements in diagrams
   - Ensure diagrams are accessible and legible
   - Include alt text for images

   ## Document Review Process

   1. Technical review by peer
   2. Clarity review by someone outside the team
   3. Final review by documentation owner
   4. Quarterly reviews for accuracy
  1. Training and Knowledge Transfer:

Workshop Planning Guide:

   nano /shared/admin/templates/workshop_template.md

Content: