Site Reliability Engineer (SRE) - AI Infrastructure (San Francisco) Job at Hamilton Barnes Associates Limited, San Francisco, CA

S2JmMWt5NjdGdXlkc3VLOG5FcWFZU25lVHc9PQ==
  • Hamilton Barnes Associates Limited
  • San Francisco, CA

Job Description

Are you looking for an exciting new opportunity?

Join a stealth-mode hyperscale data center startup building a next-generation AI and cloud platform designed for startups and advanced research, powered by thousands of H100, H200, and B200 GPUs available on demand. Their platform supports everything from rapid experimentation to full-scale model training and inference, with flexible orchestration via Slurm, Kubernetes, or direct SSH access.

This is a rare opportunity to work at the intersection of hyperscale infrastructure and AI, shaping the operational backbone of one of the largest GPU clusters in private deployment. If you want to build and operate infrastructure for frontier AI workloads, automate systems at petascale, and be part of a founding engineering team, this is the place to do it.

Responsibilities

  • Design, deploy, and maintain large-scale GPU clusters (H100/H200/B200) for training and inference workloads.
  • Build automation pipelines for provisioning, scaling, and monitoring compute resources across Slurm and Kubernetes environments.
  • Develop observability, alerting, and auto-healing systems for high-availability GPU workloads.
  • Collaborate with ML, networking, and platform teams to optimise resource scheduling, GPU utilization, and data flow.
  • Implement infrastructure-as-code, CI/CD pipelines, and reliability standards across thousands of nodes.
  • Diagnose performance bottlenecks and drive continuous improvements in reliability, latency, and throughput.

Skills / Must Have

  • 7+ years of experience in SRE, DevOps, or Infrastructure Engineering roles supporting large-scale compute environments.
  • Strong handson experience with Kubernetes and Slurm for cluster orchestration and workload management.
  • Deep knowledge of Linux systems, networking, and GPU infrastructure (NVIDIA H100/H200/B200 preferred).
  • Proficiency in Python, Go, or Bash for automation, tooling, and performance tuning.
  • Experience with observability stacks (Prometheus, Grafana, Loki) and incident response frameworks.
  • Familiarity with highperformance computing (HPC) or AI/ML training infrastructure at scale.
  • Background in reliability engineering, distributed systems, or hardware acceleration environments is a strong plus.

Benefits

  • Equity

Salary

  • $300,000 gross per year
#J-18808-Ljbffr

Job Tags

Full time, Flexible hours,

Similar Jobs

Georgia IT Inc

SAP MM VIM Expert - Denver CO Job at Georgia IT Inc

Position - SAP MM VIM ExpertWork Location - Denver COEmployment Type: 12 Months plus contractRate - DOEStart Date immediatelyExpenses will pay by client.Main areas of responsibility Lead the design, configuration, and implementation of SAP MM and VIM modules....

Domino's Franchise

DELIVERY DRIVERS! Job at Domino's Franchise

 ...Domino's Pizza - from fresh out of the oven to the customer's door - our drivers make it happen. Money! Our delivery experts are paid cash daily from earned tips and delivery reimbursement. That is money in your pocket every night! In addition, they earn an hourly wage.... 

CSL Behring

Director, Search Platform Lead Job at CSL Behring

 ...Responsibilities & Accountabilities Lead the design and evolution of search and intelligence platforms that aggregate and analyze data from...  ...s largest plasma collection networks, CSL Plasma. Our parent company, CSL, headquartered in Melbourne, Australia, employs 32,000... 

Randstad

Mortgage vault specialist Job at Randstad

 ...to lift and move up to 35 pounds? This Mortgage Vault Document Operations Specialist role...  ...experience is required) Focused Microsoft Office (2 years of experience is required)...  ...of experience: 2 years Experience level: Entry Level Randstad is a world leader in matching... 

Senior Persons Living Connected

Janitorial and Custodial Worker (Full-Time) Job at Senior Persons Living Connected

Mission Statement: Understand the aspirations of seniors and respond with innovative supports. Vision Statement: Building inclusive communities where all seniors are connected to living their best possible life. What to expect when you join SPLC: Competitive...