Slurm Administration & Systems Architecture (Santa Rosa) Job at Midjourney, Santa Rosa, CA

THJUNGtpcTRGK2VSditTNmtVNlRaQ2pjUkE9PQ==
  • Midjourney
  • Santa Rosa, CA

Job Description

Overview

We are seeking a highly skilled HPC/AI/ML Cluster Engineer to support the design, deployment, and ongoing operations of large-scale HPC environments powered by Slurm. This role centers on cluster engineering, administration, and performance optimization, with emphasis on GPU-accelerated computing, advanced networking, and workload scheduling. In this role, you will work closely with our researchers, vendors, and partners to manage Slurm clusters that are used for AI/ML workloads.

Responsibilities

Cluster Engineering & Deployment

  • Participate in the design and bring-up of bare metal HPC/AI/ML environments
  • Architect compute node definitions (NUMA, GRES GPU topologies, CPU pinning) and Slurm partitioning strategies for diverse workloads.
  • Integrate heterogeneous hardware platforms into cohesive scheduling environments.
  • Develop provisioning and imaging workflows (Ansible, MAAS, cloud-init, CI/CD pipelines) for reproducible cluster build-out.
  • Coordinate communications between vendors, researchers, and other partners during cluster bring-up and operation.

Slurm Management

  • Configure and operate the Slurm Workload Manager.
  • Build custom Slurm plugins and scripts (epilog/prolog, pam_slurm_adopt) to extend functionality and integrate with authentication,and monitoring.
  • Manage federated Slurm setups across multi-site or hybrid cloud environments.

System Administration & Monitoring

  • Administer Linux HPC environments, including network configuration, storage integration, and kernel tuning for HPC workloads.
  • Deploy and maintain observability stacks for system health, GPU metrics, and job monitoring.
  • Automate failure detection, node health checks, and job cleanup to ensure high uptime and reliability.
  • Manage security and access control (LDAP/SSSD, VPN, PAM, SSH session auditing).

User & Stakeholder Support

  • Assist cluster users with developing workflows that make efficient use of compute resources.
  • Containerize HPC applications with Docker/Podman/Enroot-Pyxis and integrate GPU-aware runtimes into Slurm jobs.
  • Automate cost accounting and cluster usage reporting.

Qualifications

  • 7+ years experience in HPC cluster administration and engineering, with deep knowledge of Slurm.
  • Familiarity with common AI/ML software package dependencies and workflows
  • Expert in Slurm configuration, partition design, QoS/preemption policies, and GRES GPU scheduling.
  • Strong background in Linux system administration, networking, and performance tuning for HPC environments.
  • Hands-on experience with parallel file system, advanced networking (InfiniBand, RoCE, 100/200 GbE), and monitoring stacks.
  • Proficient with automation tools (Ansible, Terraform, CI/CD pipelines) and version control.
  • Demonstrated ability to operate GPU-accelerated clusters at scale.

Job Tags

Part time,

Similar Jobs

FedEx

Package Handler - Part Time (Warehouse like) Job at FedEx

 ...career! Federal Express Corporation (FEC) is part of the rapidly growing warehouse and...  ...per hour paid weekly for both full and part time opportunities ~$5,250 tuition reimbursement...  ...: We are offering the following shifts: Morning Sort 8:30am to 12:30pm Mon-Fri, Day Sort 1... 

Domino's Franchise

General Manager02351 N Dixie Dr Job at Domino's Franchise

Job Description ABOUT THE JOB You were born to be the boss. We know. You get up in the morning and you make sure everyone else in the house is doing what they need to do. Then you go to work and you make sure that everyone there is doing what they need to do, even ...

Delivery & Distribution Solutions, LLC

Independent Courier Charlotte and Surrounding Areas Job at Delivery & Distribution Solutions, LLC

 ...Job Description Delivery & Distribution Solutions is looking for part time independent contract couriers for local parcel deliveries. Delivery & Distribution Solutions (DDS) is a fast-growing high energy company that delivers hospitality with every package and interaction... 

Wayfair

Manager & Channel Lead, Paid Search Marketing - Wayfair.ca Job at Wayfair

 ...Manager & Channel Lead, Paid Search Marketing for Wayfair.ca Wayfair Canadas Marketing team is looking for a data-driven Manager to drive strategy and execution across our Paid Search channels (SEM, PLA). As a leader on Wayfair.cas marketing team, you will help... 

KidzThrive ABA

Registered Behavior Technician (RBT) Queens NY Job at KidzThrive ABA

 ...Registered Behavior Technician (RBT) for ABA Overview: We are currently seeking a passionate and dedicated ABA therapist to join our team at KidzThrive ABA. As an ABA therapist, you will have the opportunity to make a positive impact on the lives of individuals with ASD...