Portfolio Jobs

Polychain Capital
companies
Jobs

Principal Site Reliability Engineer

Crusoe

Crusoe

Administration, Software Engineering
San Francisco, CA, USA
USD 261k-326k / year + Equity
Posted on Dec 13, 2025

Location

San Francisco, CA - US

Employment Type

Full time

Location Type

On-site

Department

Cloud Engineering

Crusoe's mission is to accelerate the abundance of energy and intelligence. We’re crafting the engine that powers a world where people can create ambitiously with AI — without sacrificing scale, speed, or sustainability.

Be a part of the AI revolution with sustainable technology at Crusoe. Here, you'll drive meaningful innovation, make a tangible impact, and join a team that’s setting the pace for responsible, transformative cloud infrastructure.

About This Role

As a Principal Site Reliability Engineer, you will play a critical role in designing and operating a next-generation NeoCloud built for AI, GPU, and high-performance workloads. This role sits at the intersection of infrastructure architecture, reliability engineering, and technical leadership. You’ll set reliability strategy, influence platform design, and ensure the cloud scales safely, efficiently, and predictably as customer demand accelerates.

You are a hands-on technical leader who thrives in complex distributed systems, drives clarity in ambiguous environments, and raises the bar for operational excellence across the organization.

What You’ll Be Working On

  • Define and own the reliability architecture for a NeoCloud platform supporting GPU-dense, latency-sensitive, and large-scale distributed workloads

  • Design and evolve SLOs, SLIs, and error budgets that meaningfully balance reliability, velocity, and customer experience

  • Lead incident response strategy for high-severity events, including root cause analysis and long-term remediation

  • Architect and improve observability systems (metrics, logs, tracing) to support rapid detection and diagnosis at scale

  • Partner with Infrastructure, Networking, Hardware, and Platform teams to influence system design before production issues occur

  • Drive automation across provisioning, deployment, capacity management, and failure recovery

  • Establish best practices for on-call health, operational readiness, and production change management

  • Serve as a technical authority and mentor for senior and staff-level engineers across the SRE and infrastructure org

What You’ll Bring to the Team

  • 10+ years of experience operating and scaling large-scale distributed systems in production environments

  • Deep expertise in SRE principles: reliability modeling, incident management, toil reduction, and systems thinking

  • Strong background in cloud or infrastructure platforms (public cloud, private cloud, or NeoCloud environments)

  • Hands-on experience with Kubernetes and containerized workloads at scale

  • Proficiency in one or more programming languages (Go, Python, Rust, or similar) with production-grade code ownership

  • Strong understanding of Linux systems, networking fundamentals, and performance bottlenecks

  • Proven ability to lead through influence — setting direction across teams without direct authority

  • Exceptional communication skills, especially during high-stakes incidents and cross-functional decision-making

Bonus Points

  • Experience supporting GPU-based, AI/ML, or HPC workloads

  • Familiarity with bare-metal provisioning, hardware lifecycle management, or data center operations

  • Experience building or scaling a NeoCloud or cloud-adjacent platform from early growth to maturity

  • Background in capacity planning for GPU, storage, or high-throughput networking environments

  • Passion for sustainable infrastructure or next-generation cloud architectures

Benefits:

  • Industry competitive pay

  • Restricted Stock Units in a fast growing, well-funded technology company

  • Health insurance package options that include HDHP and PPO, vision, and dental for you and your dependents

  • Employer contributions to HSA accounts

  • Paid Parental Leave

  • Paid life insurance, short-term and long-term disability

  • Teladoc

  • 401(k) with a 100% match up to 4% of salary

  • Generous paid time off and holiday schedule

  • Cell phone reimbursement

  • Tuition reimbursement

  • Subscription to the Calm app

  • MetLife Legal

  • Company paid commuter benefit; $300 per month

Compensation:

Compensation will be paid in the range of $261,000 - $326,000 + Bonus. Restricted Stock Units are included in all offers. Compensation to be determined by the applicant’s education, experience, knowledge, skills, and abilities, as well as internal equity and alignment with market data.

Crusoe is an Equal Opportunity Employer. Employment decisions are made without regard to race, color, religion, disability, genetic information, pregnancy, citizenship, marital status, sex/gender, sexual preference/ orientation, gender identity, age, veteran status, national origin, or any other status protected by law or regulation.