[Remote] Staff Site Reliability Engineer
Note: The job is a remote job and is open to candidates in USA. Domino Data Lab is a company that builds software for AI-driven organizations to enhance data science and AI solutions. They are looking for a Staff Site Reliability Engineer to lead the development of AI-assisted reliability tooling, improve observability, and mentor engineers while enhancing operational practices.
Responsibilities
- Lead the development of Domino's internal AI-assisted reliability tooling, including systems that analyze tickets, logs, traces, and documentation to help teams resolve outages faster with less recurring toil
- Improve the observability coverage and signal quality for our most critical customer-facing systems, so engineers have more to work with throughout the development and support lifecycle
- Own incident response end-to-end, from detection to remediation, and leave each problem space better documented, better understood, and less likely to recur
- Guide the development of customer and user-facing observability tools within our products
- Define and mature SLO/SLI frameworks for priority services, turning abstract reliability goals into measurable, actionable standards
- Scale cloud operations practices for Domino’s single-tenant SaaS offering, and work with engineering teams to improve the reliability and repeatability of customer deployments and upgrades
- Mentor other engineers and shape how SRE is practiced at Domino, including incident response workflows, operational readiness expectations, and post-incident learning culture
Skills
- Deep experience in Site Reliability Engineering, platform engineering, or a software engineering role with genuine, hands-on operational ownership
- Fluency with Kubernetes, Linux, cloud platforms, and observability tooling, and the ability to use them to investigate complex, real-world production problems
- A strong ability to perceive and close reliability gaps in technical products, tools and processes
- Strong software engineering skills in Python or Go, with a track record of building internal tools or services that people actually rely on
- Comfort leading technically ambiguous work and influencing direction across teams without needing direct authority to get things done
- A history of improving reliability through engineering and automation, not just putting out fires manually
- Strong communication skills and real experience mentoring engineers or shaping technical decision-making on your team
- Sound judgment about AI/LLM tooling: you know where it genuinely helps in operational workflows and where it adds noise instead of signal
- Experience with LLM-based systems, retrieval workflows, SaaS platform operations, or building tooling for support or developer teams
Benefits
- Equity
- Company bonus or sales commissions/bonuses
- 401(k) plan
- Medical, dental, and vision benefits
- Wellness stipends
Company Overview
Company H1B Sponsorship