Senior Cloud Platform Engineer
Building GPU/AI infrastructure at scale · CNCF Golden Kubestronaut · Open Source · Researcher
Senior Cloud Platform Engineer at W.W. Grainger, Inc. with deep expertise in cloud-native GPU/AI infrastructure, Kubernetes ecosystems, and platform engineering. I build open-source tools for GPU workload autoscaling, observability, and topology-aware incident response.
Actively contributing to CNCF and ASWF projects with 26+ merged PRs across 15+ projects. Published researcher with 13+ peer-reviewed papers on AI/ML infrastructure, Kubernetes, and platform engineering.
Recognized as a CNCF Golden Kubestronaut — one of the elite professionals holding all five Kubernetes certifications. Community member of the Dragonfly project and active contributor to Volcano, KEDA, OpenTelemetry, and more.
Volcano GPU NUMA-aware scheduler (3-repo PR), KEDA GPU Scaler, Kube Topology Agent, Dragonfly Community Member, IEEE peer reviewer, HPSF & InfoQ speaker
OTel GPU Receiver, OpenTelemetry docs contributions, Kubernetes website docs PRs, published 6 peer-reviewed papers on AI/ML infrastructure
OpenColorIO release signing & Vulkan tests, OpenCue subscription recalculation, OpenImageIO bug fix, RAWtoACES docs, xSTUDIO links fix
Achieved all 16 CNCF certifications including CKS, CKA, CKAD, KCNA, KCSA plus 11 Golden-tier certs. Published first peer-reviewed papers on Kubernetes and zero-trust infrastructure
One of the elite professionals who have earned all CNCF Kubernetes and Cloud Native certifications — demonstrating comprehensive expertise across the entire cloud-native ecosystem. A highly selective professional designation held by fewer than 400 practitioners globally.
Certified Kubernetes Security Specialist
Certified Kubernetes Administrator
Certified Kubernetes Application Developer
Kubernetes & Cloud Native Associate
Kubernetes & Cloud Native Security Associate
Prometheus Certified Associate
Certified GitOps Associate
Certified Cilium Associate
Certified Argo Project Associate
Istio Certified Associate
Kyverno Certified Associate
OpenTelemetry Certified Associate
Cloud Native Platform Associate
Cloud Native Platform Engineer
Certified Backstage Associate
Linux Foundation Certified SysAdmin
Elected via community governance vote — contributing to AI/ML model distribution, Helm charts, and dragonfly-injector
Active contributor across Volcano, Dragonfly, KEDA, Kubernetes, OpenTelemetry, and more
Contributing to Academy Software Foundation projects — OpenColorIO, OpenCue, OpenImageIO, RAWtoACES, xSTUDIO
Recognized by Oracle for strong technical expertise and community contribution in cloud infrastructure and Kubernetes
Active contributor to CNCF & ASWF foundation projects — 26+ PRs across 15+ repos
Cloud-native batch scheduling for AI/HPC
P2P file distribution & image acceleration
Production-grade container orchestration
Distributed transactional key-value database
Kubernetes event-driven autoscaling
Observability framework
Bare metal host provisioning for K8s
K8s-native packaging & resource management
Color management library
Cloud rendering management
Image processing library
RAW to ACES conversion
Playback & review application
13+ peer-reviewed research papers on Cloud-Native, Kubernetes, AI/ML Operations, and Platform Engineering
Open source tools for GPU autoscaling, observability, and topology-aware infrastructure
Independent repository developing an event-driven GPU autoscaler using KEDA’s External gRPC Scaler interface. Native NVML metrics, DaemonSet deployment, pre-built scaling profiles for vLLM, Triton, and training workloads. Not yet merged into the KEDA core repository.
Referenced in KEDA #7538
OpenTelemetry Collector receiver for NVIDIA GPU metrics. GPU utilization, memory, temperature via NVML. Standard OTLP export with built-in Prometheus exporter.
Kubernetes knowledge graph & automated root-cause analysis. Real-time resource topology, graph-based incident investigation, AlertManager webhook integration.
Kubernetes-native autoscaler for AI inference workloads. Custom scaling algorithms, GPU-focused policies, latency SLA enforcement, Prometheus metrics.
Comprehensive Kubernetes certification study guides covering all CNCF certifications. Interactive quizzes, flashcards, lab exercises, and PDF generation.
Industry publications, foundation blogs, and personal technical writing
How to restore the Golden Path for ML engineers by pushing GPU scaling complexity down the stack — edge-native NVML telemetry, KEDA External Scaler architecture, and eliminating the Prometheus latency trap.
The models are ready but the pipes aren’t — how CI/CD pipelines, GPU scheduling, model distribution, and governance are killing enterprise AI deployments.
Implementing zero-trust security on Oracle Kubernetes Engine with Terraform — IAM policies, network security groups, workload identity, and confidential computing.
Formal verification of ArgoCD manifests — resource invariants, temporal logic, and rollback safety for mission-critical deployments.
Dynamic resource allocation, in-place vertical scaling, and immutability improvements in Kubernetes v1.35 for AI/ML workloads and FinOps.
How AI agents, eBPF, and LLMs are transforming SRE from reactive incident management to autonomous self-healing infrastructure.
Why internal developer platforms need to prioritize the Java ecosystem — bridging enterprise reality with platform engineering ideals.
How ArgoCD v3 evolves from a sync tool to the backbone of modern platform engineering — multi-tenancy, scalability, and GitOps at enterprise scale.
Journey from Java developer to earning all CNCF certifications — what it takes to unlearn and re-learn in the cloud-native world.
How Dragonfly’s P2P architecture accelerates large AI model downloads — HuggingFace and ModelScope integration.
Challenges of scaling agentic AI in telecom infrastructure — cost implications and architectural considerations for autonomous networks.
Conference presentations on cloud-native infrastructure, GitOps, and HPC
Pavan Madduri, W.W. Grainger
Applying ArgoCD, Kubernetes, and GitOps workflows to HPC environments — bridging the gap between cloud-native DevOps and scientific computing.
Pavan Madduri, W.W. Grainger
CI/CD pipelines, testing strategies, and automation for scientific and research software development — making open source science reproducible and maintainable.
Pavan Madduri (Grainger), Rohit Dhawan (Amazon), Alina Astapovich (Storytel), Goutham Rao (NeuBird) · Moderated by Renato Losio (InfoQ)
How AI agents and generative models are being used for incident detection, root cause analysis, and automated remediation — reducing MTTR and operational load at scale.
Providing expert technical commentary on AI infrastructure, data center architecture, and cloud-native operations for industry publications including VKTR, ReadWrite, and Techopedia.
Providing architectural feedback and early platform contributions for enterprise AI agents (e.g., Future AGI). Coordinating technical documentation and letters of support with key open-source project maintainers across CNCF and ASWF foundations.
Serving as a technical peer reviewer and judge for international IEEE conferences and journals
Peer reviewing submissions on IoT architectures, edge computing, and distributed systems for one of IEEE’s highest-impact journals.
Reviewing research papers on AI systems, electrical engineering, and their intersection with cloud-native infrastructure.
Evaluating papers on cloud computing architectures, container orchestration, and scalable infrastructure design.
Reviewing research on network communications, distributed systems, and telecom infrastructure.
Evaluating submissions on computer networking, cloud infrastructure, and distributed computing systems.
Always open to connecting with fellow engineers in the cloud-native and AI/ML space