Site Reliability Engineer (SRE) - AI Infrastructure (San Francisco) Job at Hamilton Barnes Associates Limited, San Francisco, CA

S2JmMWt5NjdGdXlkc3VLOG5FcWFZU25lVHc9PQ==
  • Hamilton Barnes Associates Limited
  • San Francisco, CA

Job Description

Are you looking for an exciting new opportunity?

Join a stealth-mode hyperscale data center startup building a next-generation AI and cloud platform designed for startups and advanced research, powered by thousands of H100, H200, and B200 GPUs available on demand. Their platform supports everything from rapid experimentation to full-scale model training and inference, with flexible orchestration via Slurm, Kubernetes, or direct SSH access.

This is a rare opportunity to work at the intersection of hyperscale infrastructure and AI, shaping the operational backbone of one of the largest GPU clusters in private deployment. If you want to build and operate infrastructure for frontier AI workloads, automate systems at petascale, and be part of a founding engineering team, this is the place to do it.

Responsibilities

  • Design, deploy, and maintain large-scale GPU clusters (H100/H200/B200) for training and inference workloads.
  • Build automation pipelines for provisioning, scaling, and monitoring compute resources across Slurm and Kubernetes environments.
  • Develop observability, alerting, and auto-healing systems for high-availability GPU workloads.
  • Collaborate with ML, networking, and platform teams to optimise resource scheduling, GPU utilization, and data flow.
  • Implement infrastructure-as-code, CI/CD pipelines, and reliability standards across thousands of nodes.
  • Diagnose performance bottlenecks and drive continuous improvements in reliability, latency, and throughput.

Skills / Must Have

  • 7+ years of experience in SRE, DevOps, or Infrastructure Engineering roles supporting large-scale compute environments.
  • Strong handson experience with Kubernetes and Slurm for cluster orchestration and workload management.
  • Deep knowledge of Linux systems, networking, and GPU infrastructure (NVIDIA H100/H200/B200 preferred).
  • Proficiency in Python, Go, or Bash for automation, tooling, and performance tuning.
  • Experience with observability stacks (Prometheus, Grafana, Loki) and incident response frameworks.
  • Familiarity with highperformance computing (HPC) or AI/ML training infrastructure at scale.
  • Background in reliability engineering, distributed systems, or hardware acceleration environments is a strong plus.

Benefits

  • Equity

Salary

  • $300,000 gross per year
#J-18808-Ljbffr

Job Tags

Full time, Flexible hours,

Similar Jobs

CGI Technologies and Solutions, Inc.

Business Development Director - Department of Homeland Security Job at CGI Technologies and Solutions, Inc.

 ...Business Development Director - Department of Homeland Security Category: Leadership and Management Roles Main location: United States, Virginia, Fairfax Alternate Location(s): United States, Virginia, Arlington Position ID: J0925-1070 Employment Type... 

Brandon Tomes Automotive

AUTO SALES PERSON SUBARU Job at Brandon Tomes Automotive

 ...to MAKE YOUR OWN SCHEDULE based off of 21+ cars sold monthly!!!!! Brandon Tomes Subaru has been a highly successful established dealership in McKinney, Texas for years. McKinney, Texas has been voted in the top cities to live in the country numerous times. At Brandon... 

Wellspring Nurse Source

Travel Pharmacy Technician Job at Wellspring Nurse Source

 ...Job Description Wellspring Nurse Source is seeking a travel Pharmacy Technician for a travel job in Lincoln City, Oregon. Job Description & Requirements ~ Specialty: Pharmacy Technician ~ Discipline: Allied Health Professional ~ Start Date: ASAP ~ Duration... 

BRIA

Regional Nurse Consultant Job at BRIA

 ...Job Description Job Description Description: Join us at the Nexus of care and compassion. Regional Nurse Consultant Benefits: ~ Employee rewards program ~ BCBS healthcare coverage ~401k ~ PTO package and paid holidays ~ Team-oriented work environment... 

The Laurels of Huber Heights

MDS Coordinator (RN) Job at The Laurels of Huber Heights

 ...Are you an experienced Registered MDS nurse interested in the next step? The RN MDS Coordinator provides oversight of the RAI process and conducts assessments and care plan coordination for guests. The MDS Coordinator supervises the Care Management Nurse, MDS Nurse...