[Remote] Site Reliability Engineer
Note: The job is a remote job and is open to candidates in USA. TalentDome Staffing is a high-growth, AI-driven narrative intelligence startup seeking a Senior Site Reliability Engineer (SRE) / Infrastructure Engineer. The role requires operational ownership of a production environment, focusing on infrastructure orchestration, high-throughput scaling, and GPU application deployment to support massive data flows.
Responsibilities
- Infrastructure Orchestration: Maintain, optimize, and expand the core infrastructure, ensuring everything is cleanly declared via Terraform and managed across high-performance Kubernetes clusters
- High-Throughput Scaling: Design and manage environments capable of sustaining immense data ingestion scaling, high-throughput pipelines, and massive search database operations
- GPU Application Deployment: Collaborate with the R&D team to successfully deploy, optimize, and manage highly specialized machine learning and AI applications running on GPUs
- System Optimization & Reliability: Partner closely with backend teams to heavily optimize production Java deployments and Python workflows, guaranteeing maximum uptime, high availability, and seamless scaling
- Technical Leadership: Serve as a foundational pillar for infrastructure architecture, establishing operational best practices without requiring handholding or micro-management
Skills
- 8+ years of dedicated, hands-on experience with Kubernetes and Terraform
- Ideally 15+ years of total technical experience in infrastructure or site reliability engineering
- Deep architectural mastery of deployment systems, cluster orchestration, and high-availability scaling
- Proven cloud hosting experience, with strong proficiency in AWS
- Exposure to or experience with GCP is a significant advantage for supporting R&D workflows
- Concrete experience deploying and scaling application workflows that interface with GPUs and high-volume data ingestion layers
- Familiarity with or exposure to optimizing runtime environments for Java and Python applications is highly beneficial
- Exceptional self-direction and problem-solving capability
- Professional maturity to eventually step into a formal leadership role as the infrastructure team expands
Benefits
- True Operational Autonomy: The opportunity to architect and scale greenfield deployments for a rapidly expanding AI data platform.
- High-Caliber Environment: Collaborate directly with an elite team of backend engineers and machine learning R&D specialists.
- Flexible, Modern Workspace: Enjoy 100% remote working flexibility across the United States.
- Open to equity incentives
Company Overview