Back to the roster

[Remote] Sr Site Reliability Engineer

Remote Full-time Hiring now

Note: The job is a remote job and is open to candidates in USA. Commence is a company focused on data-centric transformation in healthcare, aiming to improve health outcomes through efficient processes. They are seeking a Senior Site Reliability Engineer to ensure the reliability and operational health of their healthcare data platform, collaborating with engineering teams and managing incident responses.

Responsibilities

  • Design, implement, and own observability infrastructure including metrics, logging, tracing, and alerting across distributed systems
  • Define and enforce SLOs, SLIs, and error budgets in partnership with product and engineering teams
  • Lead incident response: triage, coordinate remediation, conduct blameless post-mortems, and drive systemic fixes
  • Build and maintain CI/CD pipelines that support rapid, safe delivery of changes to production
  • Collaborate with engineering teams on infrastructure changes; able to read, modify, and contribute to existing infrastructure-as-code (Terraform or CloudFormation)
  • Design and operate highly available, fault-tolerant systems—including auto-scaling, failover, and disaster recovery strategies
  • Reduce operational toil through automation; eliminate manual processes before they become habits
  • Collaborate with software engineers to establish reliability-first design patterns and review architectures for operational risk
  • Manage Kubernetes or container orchestration environments at scale
  • Ensure systems meet compliance and security requirements, particularly those applicable to healthcare data (HIPAA, SOC 2)
  • Provide technical mentorship and guidance to engineers across the organization on reliability practices
  • Participate in on-call rotation with a commitment to continuously reducing the need for it

Skills

  • 7+ years of experience in SRE, platform engineering, or DevOps roles
  • Exceptional problem-solving under pressure—demonstrated track record of diagnosing complex, high-stakes system failures and building durable solutions
  • Deep hands-on experience with AWS services including EC2, EKS/ECS, Lambda, RDS, S3, CloudWatch, and related tooling
  • Familiarity with infrastructure-as-code (Terraform or CloudFormation)—able to contribute to existing configurations
  • Experience designing and operating distributed systems with strict availability and latency requirements
  • Proficiency in at least one scripting or systems language (Python, Go, Bash, or similar) for automation and tooling
  • Experience with container orchestration (Kubernetes, ECS) in production environments
  • Expertise in observability tooling (OpenSearch, Prometheus/Grafana, or equivalent)
  • Hands-on experience with CI/CD platforms (GitHub Actions, Jenkins, CircleCI, or similar)
  • Proven ability to define and operationalize SLOs and error budgets
  • Experience with relational and NoSQL databases—performance tuning, replication, and backup strategies
  • Strong working knowledge of networking fundamentals: DNS, load balancing, VPCs, TLS
  • Excellent communication skills—able to translate technical risk into business impact for non-engineering stakeholders
  • AWS Certifications (Solutions Architect, DevOps Engineer, or SysOps Administrator)
  • Experience in healthcare technology or other regulated industries (HIPAA, SOC 2, FedRAMP)
  • Familiarity with chaos engineering practices and tooling
  • Experience with data pipeline reliability (ETL/ELT workflows, streaming systems)
  • Exposure to AI/ML infrastructure and the reliability challenges unique to model serving
  • Familiarity with additional cloud platforms (Azure, Google Cloud)
  • Contributions to open-source reliability or infrastructure tooling

Company Overview

  • Commence delivers AI-driven healthcare data platform and clinical expertise that supports analytics, decisions, and workflow improvement. It was founded in undefined, and is headquartered in Virginia Beach, Virginia, USA, with a workforce of 501-1000 employees. Its website is https://commence.ai.
  • Apply To This Job

    Related roles