Delivery — AI Cloud & GPU

Senior Engineer — GPU Infrastructure & MLOps

Build and operate the GPU infrastructure that runs AI at scale.

Remote / US or IndiaRemoteFull-TimeReq CDC-017

About the role

The Senior Engineer for GPU Infrastructure & MLOps deploys and operates GPU clusters, manages AI platform environments, and supports clients with infrastructure lifecycle management for AI workloads.

What you will do

Deploy and configure GPU server clusters (NVIDIA DGX, HGX, and OEM GPU servers)
Install and configure InfiniBand and Ethernet networking for GPU clusters
Deploy and manage Kubernetes clusters with GPU operator, device plugins, and scheduling
Configure and manage distributed AI training infrastructure (Slurm, Ray, or MPI)
Support MLOps toolchain deployment: MLflow, Weights & Biases, NVIDIA NIM
Monitor GPU cluster performance: utilization, thermal, memory, and network metrics
Automate provisioning and lifecycle management using Ansible, Terraform, or equivalent
Respond to infrastructure incidents in client GPU environments

What we need

5+ years in GPU infrastructure, HPC, or AI platform engineering
Hands-on experience deploying NVIDIA GPU servers and InfiniBand networking
Proficiency with Kubernetes (GPU operator, device plugins, scheduling)
Linux system administration proficiency at scale
Experience with infrastructure automation: Ansible, Terraform, or Pulumi

Nice to have

NVIDIA DGX-certified administrator or equivalent
Experience with NCCL collective operations and distributed training optimization

Apply

Apply for Senior Engineer — GPU Infrastructure & MLOps

Tell us about yourself and attach your resume. We review every application personally.