Pavan Madduri

Pavan Madduri

Senior Cloud Platform Engineer

Building GPU/AI infrastructure at scale · CNCF Golden Kubestronaut · Open Source · Researcher

Golden Kubestronaut CNCF Contributor 13+ Publications

About Me

Senior Cloud Platform Engineer at W.W. Grainger, Inc. with deep expertise in cloud-native GPU/AI infrastructure, Kubernetes ecosystems, and platform engineering. I build open-source tools for GPU workload autoscaling, observability, and topology-aware incident response.

Actively contributing to CNCF and ASWF projects with 26+ merged PRs across 15+ projects. Published researcher with 13+ peer-reviewed papers on AI/ML infrastructure, Kubernetes, and platform engineering.

Recognized as a CNCF Golden Kubestronaut — one of the elite professionals holding all five Kubernetes certifications. Community member of the Dragonfly project and active contributor to Volcano, KEDA, OpenTelemetry, and more.

0
Open Source PRs
0
Projects Contributed
0
Publications
0
CNCF Certifications

GitHub Activity

GitHub Contribution Heatmap
GitHub Streak GitHub Stats

Open Source Journey

2026 — Present

GPU NUMA Topology & AI Infrastructure

Volcano GPU NUMA-aware scheduler (3-repo PR), KEDA GPU Scaler, Kube Topology Agent, Dragonfly Community Member, IEEE peer reviewer, HPSF & InfoQ speaker

VolcanoKEDADragonflyIEEEHPSF
2025

Cloud-Native Observability & Platform Engineering

OTel GPU Receiver, OpenTelemetry docs contributions, Kubernetes website docs PRs, published 6 peer-reviewed papers on AI/ML infrastructure

OpenTelemetryKubernetesResearch
2024

ASWF & Foundation Contributions

OpenColorIO release signing & Vulkan tests, OpenCue subscription recalculation, OpenImageIO bug fix, RAWtoACES docs, xSTUDIO links fix

OpenColorIOOpenCueOpenImageIORAWtoACESxSTUDIO
2023

Golden Kubestronaut & Certification Journey

Achieved all 16 CNCF certifications including CKS, CKA, CKAD, KCNA, KCSA plus 11 Golden-tier certs. Published first peer-reviewed papers on Kubernetes and zero-trust infrastructure

CNCFKubestronautCertifications

Achievements

CNCF Golden Kubestronaut

One of the elite professionals who have earned all CNCF Kubernetes and Cloud Native certifications — demonstrating comprehensive expertise across the entire cloud-native ecosystem. A highly selective professional designation held by fewer than 400 practitioners globally.

Kubestronaut Core Certifications

CKS

Certified Kubernetes Security Specialist

CKA

Certified Kubernetes Administrator

CKAD

Certified Kubernetes Application Developer

KCNA

Kubernetes & Cloud Native Associate

KCSA

Kubernetes & Cloud Native Security Associate

Golden Kubestronaut Certifications

PCA

Prometheus Certified Associate

CGOA

Certified GitOps Associate

CCA

Certified Cilium Associate

CAPA

Certified Argo Project Associate

ICA

Istio Certified Associate

KCA

Kyverno Certified Associate

OTCA

OpenTelemetry Certified Associate

CNPA

Cloud Native Platform Associate

CNPE

Cloud Native Platform Engineer

CBA

Certified Backstage Associate

LFCS

Linux Foundation Certified SysAdmin

Community Recognition

Dragonfly Community Member

Elected via community governance vote — contributing to AI/ML model distribution, Helm charts, and dragonfly-injector

CNCF Contributor

Active contributor across Volcano, Dragonfly, KEDA, Kubernetes, OpenTelemetry, and more

ASWF Contributor

Contributing to Academy Software Foundation projects — OpenColorIO, OpenCue, OpenImageIO, RAWtoACES, xSTUDIO

Oracle ACE Associate

Recognized by Oracle for strong technical expertise and community contribution in cloud infrastructure and Kubernetes

Original Open Source Contributions

Active contributor to CNCF & ASWF foundation projects — 26+ PRs across 15+ repos

CNCF (Cloud Native Computing Foundation)

Volcano

Cloud-native batch scheduling for AI/HPC

Dragonfly

P2P file distribution & image acceleration

Community Member

Kubernetes

Production-grade container orchestration

  • #53891 Document deployment.kubernetes.io/* annotations
  • #53892 kubectl apply view-last-applied docs

TiKV

Distributed transactional key-value database

  • #19225 Add AGENTS.md for AI agent guidance

KEDA

Kubernetes event-driven autoscaling

  • keda-docs#1658 Remove deprecated metricName from docs
  • #7538 GPU/AI inference scaler architectural analysis

OpenTelemetry

Observability framework

  • #8632 Add .NET troubleshooting page

Metal³

Bare metal host provisioning for K8s

  • #624 Fix redirect links in tryit.md

kpt

K8s-native packaging & resource management

  • #4278 Fix kpt fn doc for KRM functions

ASWF (Academy Software Foundation)

OpenColorIO

Color management library

  • #2229 Release signing workflow
  • #2230 Dependabot configuration
  • #2243 Vulkan unit test framework

OpenCue

Cloud rendering management

  • #2134 Scheduled subscription recalculation

OpenImageIO

Image processing library

  • #4976 Fix IBA::compare_Yee() channel access

RAWtoACES

RAW to ACES conversion

  • #222 Build developer documentation

xSTUDIO

Playback & review application

  • #186 Fix broken build guide links

Scholarly Articles & Publications

13+ peer-reviewed research papers on Cloud-Native, Kubernetes, AI/ML Operations, and Platform Engineering

AI & Agentic Systems

1

AI Security: Preemptive Cybersecurity — Using AI Agents for Proactive Threat Hunting in Cloud-Native Environments

2

Agentic AI Introduction: Model Context Protocol (MCP) — Bridging LLMs and Real-Time Kubernetes Observability

3

Scale & LLM-Ops: Architecting LLM-as-a-Service — Infrastructure for High-Concurrency Agentic Workloads

SRE & Self-Healing Infrastructure

4

Agentic SRE Teams: Human-Agent Collaboration — A New Operational Model for Autonomous Incident Response

5

Autonomous Remediation: Reinforcement Learning for Self-Healing Infrastructure and Human-Agent Collaboration

6

From PagerDuty to ‘Agentic Ops’: The Rise of Self-Healing Kubernetes

Platform Engineering & GitOps

7

Platform Engineering Foundations: The IDP — Reducing Cognitive Load for Java Developers

8

GitOps & Stability: Formal Verification of ArgoCD Manifests — Preventing Deployment Drift

9

Beyond Basic Sync: Why ArgoCD v3 is the Backbone of Modern Platform Engineering

Kubernetes & Cloud Infrastructure

10

The Efficiency Era: How Kubernetes v1.35 Finally Solves the “Restart” Headache

11

FinOps: Predictive Autoscaling Using Time-Series Analysis to Reduce Cloud Waste in EKS Clusters

12

Zero-Trust Infrastructure: Automated Identity Governance in Kubernetes — Framework for Zero-Trust Microservices

13

Multi-Cluster Orchestration: Cross-Cluster Service Meshes in High-Traffic Retail Environments

Featured Projects

Open source tools for GPU autoscaling, observability, and topology-aware infrastructure

KEDA GPU Scaler Independent Repository

Independent repository developing an event-driven GPU autoscaler using KEDA’s External gRPC Scaler interface. Native NVML metrics, DaemonSet deployment, pre-built scaling profiles for vLLM, Triton, and training workloads. Not yet merged into the KEDA core repository.

GogRPCNVMLKubernetesHelm

Referenced in KEDA #7538

OpenTelemetry GPU Receiver

OpenTelemetry Collector receiver for NVIDIA GPU metrics. GPU utilization, memory, temperature via NVML. Standard OTLP export with built-in Prometheus exporter.

GoOpenTelemetryNVMLPrometheus

Kube Topology Agent

Kubernetes knowledge graph & automated root-cause analysis. Real-time resource topology, graph-based incident investigation, AlertManager webhook integration.

GoKubernetes APIKnowledge GraphHelm

KubeAI Autoscaler

Kubernetes-native autoscaler for AI inference workloads. Custom scaling algorithms, GPU-focused policies, latency SLA enforcement, Prometheus metrics.

GoKubernetesCRDHelm

Golden Kubestronaut Learning

Comprehensive Kubernetes certification study guides covering all CNCF certifications. Interactive quizzes, flashcards, lab exercises, and PDF generation.

MkDocsPythonKubernetesEducation

Ingress2Gateway

Convert Kubernetes Ingress resources to Gateway API. Supports ALB, GCE, Nginx annotations with automated migration and validation.

PythonKubernetesGateway APIHelm

Technical Expertise

Container Orchestration & GitOps

KubernetesArgoCDDockerCrossplaneHelmFlux

Cloud Platforms

AWSAzureEKSEC2S3IAM

Observability

PrometheusGrafanaOpenTelemetrySplunkDatadog

Policy & Security

KyvernoOPAZero-TrustRBACNetwork Policies

CI/CD

GitHub ActionsJenkinsFluxUrbanCode Deploy

Languages & Tools

GoPythonRustTerraformgRPCBash

GPU / AI Infrastructure

NVIDIA NVMLCUDAvLLMTritonKEDAVolcano

Big Data

PrestoDBTrinoApache SupersetAlluxioJupyter

Technical Writing

Industry publications, foundation blogs, and personal technical writing

Industry Writing

Academic & Foundation Writing

Personal Blogs

Conferences & Speaking

Conference presentations on cloud-native infrastructure, GitOps, and HPC

HPSF Conference 2026 Productivity, Performance & the HPC Pipeline

GitOps for HPC: Bringing Cloud-Native DevOps Practices to High Performance Computing Environments

Pavan Madduri, W.W. Grainger

Applying ArgoCD, Kubernetes, and GitOps workflows to HPC environments — bridging the gap between cloud-native DevOps and scientific computing.

Chicago River Ballroom A-D Intermediate
HPSF Conference 2026 Building & Sustaining Community

DevOps for Scientific Software: Tools, Practices, and Automation Strategies

Pavan Madduri, W.W. Grainger

CI/CD pipelines, testing strategies, and automation for scientific and research software development — making open source science reproducible and maintainable.

Chicago River Ballroom A-D Beginner
InfoQ Live · Apr 21, 2026 Roundtable Panel

AI-Powered SRE for Autonomous Incident Response

Pavan Madduri (Grainger), Rohit Dhawan (Amazon), Alina Astapovich (Storytel), Goutham Rao (NeuBird) · Moderated by Renato Losio (InfoQ)

How AI agents and generative models are being used for incident detection, root cause analysis, and automated remediation — reducing MTTR and operational load at scale.

Online Panel Discussion

Expert Commentary & Industry Impact

Media & Press Expert Commentary

Media & Press Commentary

Providing expert technical commentary on AI infrastructure, data center architecture, and cloud-native operations for industry publications including VKTR, ReadWrite, and Techopedia.

Industry Advisory & Collaboration

Industry Advisory & Collaboration

Providing architectural feedback and early platform contributions for enterprise AI agents (e.g., Future AGI). Coordinating technical documentation and letters of support with key open-source project maintainers across CNCF and ASWF foundations.

Judging & Peer Review

Serving as a technical peer reviewer and judge for international IEEE conferences and journals

IEEE Technical Peer Reviewer & Judge

IEEE Internet of Things Journal (IoT)

Peer reviewing submissions on IoT architectures, edge computing, and distributed systems for one of IEEE’s highest-impact journals.

IEEE AIEEE 2026 Technical Peer Reviewer & Judge

IEEE International Conference on AI & Electrical Engineering (AIEEE 2026)

Reviewing research papers on AI systems, electrical engineering, and their intersection with cloud-native infrastructure.

IEEE CloudCOM 2026 Technical Peer Reviewer & Judge

IEEE International Conference on Cloud Computing Technology & Science (CloudCOM 2026)

Evaluating papers on cloud computing architectures, container orchestration, and scalable infrastructure design.

IEEE COMM 2026 Technical Peer Reviewer & Judge

IEEE International Conference on Communications (COMM 2026)

Reviewing research on network communications, distributed systems, and telecom infrastructure.

IEEE ICCCN 2026 Technical Peer Reviewer & Judge

IEEE International Conference on Computer Communications & Networks (ICCCN 2026)

Evaluating submissions on computer networking, cloud infrastructure, and distributed computing systems.

Let's Connect

Always open to connecting with fellow engineers in the cloud-native and AI/ML space