[Remote] Senior Engineer - Platform Integration
Note: The job is a remote job and is open to candidates in USA. Core42 is a leader in AI-powered cloud and digital infrastructure, driving transformative technology solutions globally. They are seeking a Senior Engineer - Platform Integration to design, build, and operate core GPUaaS control plane services while collaborating with infrastructure teams to manage GPU resources at scale.
Responsibilities
- Design, build, and operate core GPUaaS control plane services
- Develop backend APIs and microservices (Python, Go, or Node.js)
- Integrate deeply with Kubernetes APIs for provisioning, scheduling, and multitenancy
- Build and maintain authentication, authorization, and identity systems (OAuth2, SSO, RBAC, LDAP)
- Design and implement usage tracking and billing systems with strong correctness guarantees
- Design PostgreSQL schemas optimized for scale, auditing, and reliability
- Build CI/CD pipelines and deployment automation for platform services
- Collaborate with infrastructure teams to surface GPU and system telemetry
- Own systems in production including reliability, failure modes, and performance
Skills
- 4–7 years of software engineering experience in backend, platform, or infrastructure roles
- Strong backend engineering experience in Python (FastAPI), Go, or Node.js
- Hands-on experience with Kubernetes in production environments
- Experience building and operating REST and/or gRPC APIs
- Strong understanding of microservices architecture and cloud-native systems
- Experience with PostgreSQL schema design, performance, and migrations
- Familiarity with authentication/authorization systems (OAuth2, SAML, JWT, RBAC)
- Experience working on systems that require high reliability and correctness under failure conditions
- Ability to operate independently in ambiguous or greenfield environments
- Experience with GPU infrastructure, HPC environments, or AI/ML platforms
- Experience with Kubernetes controllers, operators, Helm, or cluster lifecycle tooling
- Exposure to Slurm or hybrid Kubernetes/HPC scheduling systems
- Experience with observability stacks (Prometheus, Grafana, OpenTelemetry)
- Experience building developer platforms or internal infrastructure tools
- Familiarity with MLOps tooling (Kubeflow, MLflow, PyTorch pipelines)
- Experience with GitOps workflows (ArgoCD, Flux, etc.)
- Experience working at cloud providers or infrastructure-heavy SaaS companies
- Exposure to distributed scheduling systems or resource orchestration platforms
- Experience with high-scale multi-tenant systems
Benefits
- Bonus and benefits on top
- Reasonable accommodations to qualified individuals with disabilities throughout the application and employment process
Company Overview