CloudData.Center
Delivery — AI Cloud & GPU

Senior Engineer — GPU Infrastructure & MLOps

Build and operate the GPU infrastructure that runs AI at scale.

Remote / US or IndiaRemoteFull-TimeReq CDC-017

About the role

The Senior Engineer for GPU Infrastructure & MLOps deploys and operates GPU clusters, manages AI platform environments, and supports clients with infrastructure lifecycle management for AI workloads.

What you will do

  • Deploy and configure GPU server clusters (NVIDIA DGX, HGX, and OEM GPU servers)
  • Install and configure InfiniBand and Ethernet networking for GPU clusters
  • Deploy and manage Kubernetes clusters with GPU operator, device plugins, and scheduling
  • Configure and manage distributed AI training infrastructure (Slurm, Ray, or MPI)
  • Support MLOps toolchain deployment: MLflow, Weights & Biases, NVIDIA NIM
  • Monitor GPU cluster performance: utilization, thermal, memory, and network metrics
  • Automate provisioning and lifecycle management using Ansible, Terraform, or equivalent
  • Respond to infrastructure incidents in client GPU environments

What we need

  • 5+ years in GPU infrastructure, HPC, or AI platform engineering
  • Hands-on experience deploying NVIDIA GPU servers and InfiniBand networking
  • Proficiency with Kubernetes (GPU operator, device plugins, scheduling)
  • Linux system administration proficiency at scale
  • Experience with infrastructure automation: Ansible, Terraform, or Pulumi

Nice to have

  • NVIDIA DGX-certified administrator or equivalent
  • Experience with NCCL collective operations and distributed training optimization
Apply

Apply for Senior Engineer — GPU Infrastructure & MLOps

Tell us about yourself and attach your resume. We review every application personally.

By applying, you consent to CloudData.Center storing your application details for recruiting purposes. We never share your information.