Delivery — AI Cloud & GPU
Senior Engineer — GPU Infrastructure & MLOps
Build and operate the GPU infrastructure that runs AI at scale.
Remote / US or IndiaRemoteFull-TimeReq CDC-017
About the role
The Senior Engineer for GPU Infrastructure & MLOps deploys and operates GPU clusters, manages AI platform environments, and supports clients with infrastructure lifecycle management for AI workloads.
What you will do
- Deploy and configure GPU server clusters (NVIDIA DGX, HGX, and OEM GPU servers)
- Install and configure InfiniBand and Ethernet networking for GPU clusters
- Deploy and manage Kubernetes clusters with GPU operator, device plugins, and scheduling
- Configure and manage distributed AI training infrastructure (Slurm, Ray, or MPI)
- Support MLOps toolchain deployment: MLflow, Weights & Biases, NVIDIA NIM
- Monitor GPU cluster performance: utilization, thermal, memory, and network metrics
- Automate provisioning and lifecycle management using Ansible, Terraform, or equivalent
- Respond to infrastructure incidents in client GPU environments
What we need
- 5+ years in GPU infrastructure, HPC, or AI platform engineering
- Hands-on experience deploying NVIDIA GPU servers and InfiniBand networking
- Proficiency with Kubernetes (GPU operator, device plugins, scheduling)
- Linux system administration proficiency at scale
- Experience with infrastructure automation: Ansible, Terraform, or Pulumi
Nice to have
- NVIDIA DGX-certified administrator or equivalent
- Experience with NCCL collective operations and distributed training optimization
Apply
Apply for Senior Engineer — GPU Infrastructure & MLOps
Tell us about yourself and attach your resume. We review every application personally.