Back to the roster

[Remote] Cloud Operations Engineer

Remote Full-time Hiring now

Note: The job is a remote job and is open to candidates in USA. O'Reilly Media is dedicated to sharing the knowledge of innovators and helping professionals develop expertise. As a Cloud Operations Engineer, you will work on systems and tooling that power the learning platform, focusing on infrastructure-as-code and maintaining Kubernetes while collaborating with product engineering teams.

Responsibilities

  • Maintaining and updating our Kubernetes cluster to ensure steady-state operations
  • Writing or extending Terraform modules to provision and manage cloud infrastructure
  • Contributing features to the Python CLI tooling we use to manage infrastructure workflows
  • Design, build, and maintain cloud infrastructure using infrastructure-as-code (Terraform) on GCP
  • Manage and evolve our Kubernetes platform, including cluster operations, workload configuration, and service mesh (Istio)
  • Develop and improve internal tooling that abstracts cloud complexity and improves the developer experience
  • Collaborate with product engineering teams to understand service deployment needs and deliver infrastructure solutions
  • Monitor platform health using Datadog; proactively identify and resolve performance, availability, and security issues
  • Participate in on-call rotation and incident response; drive blameless post-mortems and eliminate recurring issues at their root cause
  • Define and track service-level indicators and objectives (SLIs/SLOs) for critical platform components
  • Implement and refine alerting, dashboards, and runbooks that reduce mean time to resolution
  • Embed security best practices into infrastructure workflows (DevSecOps) — not as an afterthought, but as a design principle
  • Help maintain cloud security posture, IAM hygiene, and policy guardrails across our cloud environment
  • Stay current with cloud security developments and proactively surface risks to the team
  • Execute and maintain our automated disaster recovery processes
  • Work closely with product engineering teams to understand their needs and remove infrastructure friction
  • Document systems, processes, and architectural decisions clearly so knowledge is shared, not siloed
  • Recommend improvements to tooling, architecture, and processes — and help drive them to completion
  • Keep current with the evolving cloud-native ecosystem and bring relevant knowledge back to the team

Skills

  • Bachelor's degree in Computer Science or a related field
  • 5+ years of experience working in cloud infrastructure, platform engineering, or a related discipline
  • In lieu of degree, equivalent education and/or experience may be considered
  • Hands-on experience with Kubernetes in production environments (cluster management, workloads, networking)
  • Proficiency with infrastructure-as-code tools, particularly Terraform
  • Experience with at least one major cloud provider (GCP, AWS, or Azure)
  • Solid scripting and automation skills in Python, Bash, or a comparable language
  • Experience with modern observability platforms (Datadog, Grafana, or similar)
  • Strong understanding of Linux systems administration
  • Working knowledge of CI/CD concepts and tools (GitHub Actions, ArgoCD, Jenkins, or similar)
  • Excellent communication skills — you write clearly, ask good questions, and explain complex systems accessibly
  • AI-Augmented Development: Has the ability to demonstrate using AI-enabled development tools (e.g., Claude Code, Cursor) to streamline coding, debugging, and infrastructure-as-code authoring
  • Experience with service mesh technologies such as Istio or Linkerd
  • Familiarity with GitOps workflows and tools (ArgoCD, Flux)
  • Experience with DevSecOps practices and tooling (Snyk, Trivy, OPA, or similar)
  • Working knowledge of SQL databases (PostgreSQL or MySQL)
  • Familiarity with FinOps practices and cloud cost optimization
  • Experience building or consuming internal developer platforms (IDPs)
  • Configuration management experience (Ansible, Chef, or similar)
  • Relevant certifications (CKA, CKAD, AWS/GCP Professional, or similar)

Company Overview

  • Inspiring the future for more than 45 years We share the knowledge and teach the skills people need to change their world. It was founded in 1978, and is headquartered in Seattle, Washington, USA, with a workforce of 201-500 employees. Its website is http://dankaminsky.com.
  • Apply To This Job

    Related roles