Pavan Madduri - Cloud-Native & AI Infrastructure Engineer

About Me

Senior Cloud Platform Engineer at W.W. Grainger, Inc. with deep expertise in cloud-native GPU/AI infrastructure, Kubernetes ecosystems, and platform engineering. I build open-source tools for GPU workload autoscaling, observability, and topology-aware incident response.

Actively contributing to CNCF projects with 31+ merged PRs across 17+ projects. Published researcher with 21+ articles and papers on AI/ML infrastructure, Kubernetes, and platform engineering.

Recognized as a CNCF Golden Kubestronaut — one of the elite professionals holding all five Kubernetes certifications. Community member of the Dragonfly project and active contributor to Volcano, KEDA, OpenTelemetry, and more.

0

Open Source PRs

0

Projects Contributed

0

Publications

0

CNCF Certifications

GitHub Activity

Open Source Journey

2026 — Present

GPU NUMA Topology & AI Infrastructure

Volcano GPU NUMA-aware scheduler (3-repo PR), KEDA GPU Scaler, Kube Topology Agent, Dragonfly Community Member, IEEE peer reviewer, HPSF & InfoQ speaker

VolcanoKEDADragonflyIEEEHPSF

2025

Cloud-Native Observability & Platform Engineering

OTel GPU Receiver, OpenTelemetry docs contributions, Kubernetes website docs PRs, published 6 peer-reviewed papers on AI/ML infrastructure

OpenTelemetryKubernetesResearch

2024

GPU/AI Infrastructure Contributions

OpenColorIO release signing & Vulkan tests, OpenCue subscription recalculation, OpenImageIO bug fix, RAWtoACES docs, xSTUDIO links fix

OpenColorIOOpenCueOpenImageIORAWtoACESxSTUDIO

2023

Golden Kubestronaut & Certification Journey

Achieved all 16 CNCF certifications including CKS, CKA, CKAD, KCNA, KCSA plus 11 Golden-tier certs. Published first peer-reviewed papers on Kubernetes and zero-trust infrastructure

CNCFKubestronautCertifications

Achievements

Kubestronaut Core Certifications

CKS

Certified Kubernetes Security Specialist

CKA

Certified Kubernetes Administrator

CKAD

Certified Kubernetes Application Developer

KCNA

Kubernetes & Cloud Native Associate

KCSA

Kubernetes & Cloud Native Security Associate

Golden Kubestronaut Certifications

PCA

Prometheus Certified Associate

CGOA

Certified GitOps Associate

CCA

Certified Cilium Associate

CAPA

Certified Argo Project Associate

ICA

Istio Certified Associate

KCA

Kyverno Certified Associate

OTCA

OpenTelemetry Certified Associate

CNPA

Cloud Native Platform Associate

CNPE

Cloud Native Platform Engineer

CBA

Certified Backstage Associate

LFCS

Linux Foundation Certified SysAdmin

Community Recognition

Dragonfly Community Member

Elected via community governance vote — contributing to AI/ML model distribution, Helm charts, and dragonfly-injector

CNCF Contributor

Active contributor across Volcano, Dragonfly, KEDA, Kubernetes, OpenTelemetry, and more

HAMi Contributor

Contributing to HAMi (Heterogeneous AI Computing Virtualization Middleware) — GPU sharing and virtualization for Kubernetes

Oracle ACE Associate

Recognized by Oracle for strong technical expertise and community contribution in cloud infrastructure and Kubernetes

Original Open Source Contributions

Active contributor to CNCF foundation projects — 31+ PRs across 17+ repos

CNCF (Cloud Native Computing Foundation)

Volcano

Cloud-native batch scheduling for AI/HPC

#5095 GPU NUMA topology awareness in scheduler
apis#229 GPUInfo type to NumatopoSpec CRD
resource-exporter#12 GPU NUMA topology discovery via sysfs

Dragonfly

P2P file distribution & image acceleration

client#1665 Hugging Face backend with hf:// protocol
client#1673 ModelScope backend with modelscope:// protocol
d7y.io#386 hf:// protocol documentation
d7y.io#398 P2P AI model downloads blog post
helm-charts#455 Injector support to helm chart
helm-charts#480 Replace deprecated MySQL chart

Community Member

Kubernetes

Production-grade container orchestration

#53891 Document deployment.kubernetes.io/* annotations
#53892 kubectl apply view-last-applied docs

TiKV

Distributed transactional key-value database

#19225 Add AGENTS.md for AI agent guidance

KEDA

Kubernetes event-driven autoscaling

keda-docs#1658 Remove deprecated metricName from docs
keda-docs#1769 Fix datadog scaler typos across all versions
#7538 GPU/AI inference scaler architectural analysis

OpenTelemetry

Observability framework

#8632 Add .NET troubleshooting page

Metal³

Bare metal host provisioning for K8s

#624 Fix redirect links in tryit.md

kpt

K8s-native packaging & resource management

#4278 Fix kpt fn doc for KRM functions

HAMi

Heterogeneous AI Computing Virtualization Middleware

#1893 Add unit tests for nvinternal info, mig, and watch packages

traceAI

Open-source LLM observability SDK

#165 Fix exporter shutdown and thread safety in Python SDK
#166 Add Go SDK with OpenAI instrumentor

Scholarly Articles & Publications

21+ published articles and research papers on Cloud-Native, Kubernetes, AI/ML Operations, and Platform Engineering

Google Scholar ResearchGate

AI & Agentic Systems

1

AI Security: Preemptive Cybersecurity — Using AI Agents for Proactive Threat Hunting in Cloud-Native Environments

ResearchGate

2

Agentic AI Introduction: Model Context Protocol (MCP) — Bridging LLMs and Real-Time Kubernetes Observability

ResearchGate

3

Scale & LLM-Ops: Architecting LLM-as-a-Service — Infrastructure for High-Concurrency Agentic Workloads

ResearchGate

SRE & Self-Healing Infrastructure

4

Agentic SRE Teams: Human-Agent Collaboration — A New Operational Model for Autonomous Incident Response

ResearchGate

5

Autonomous Remediation: Reinforcement Learning for Self-Healing Infrastructure and Human-Agent Collaboration

ResearchGate

6

From PagerDuty to ‘Agentic Ops’: The Rise of Self-Healing Kubernetes

ResearchGate

Platform Engineering & GitOps

7

Platform Engineering Foundations: The IDP — Reducing Cognitive Load for Java Developers

ResearchGate

8

GitOps & Stability: Formal Verification of ArgoCD Manifests — Preventing Deployment Drift

ResearchGate

9

Beyond Basic Sync: Why ArgoCD v3 is the Backbone of Modern Platform Engineering

ResearchGate

Kubernetes & Cloud Infrastructure

10

The Efficiency Era: How Kubernetes v1.35 Finally Solves the “Restart” Headache

ResearchGate

11

FinOps: Predictive Autoscaling Using Time-Series Analysis to Reduce Cloud Waste in EKS Clusters

ResearchGate

12

Zero-Trust Infrastructure: Automated Identity Governance in Kubernetes — Framework for Zero-Trust Microservices

ResearchGate

13

Multi-Cluster Orchestration: Cross-Cluster Service Meshes in High-Traffic Retail Environments

ResearchGate

Featured Projects

Open source tools for GPU autoscaling, observability, and topology-aware infrastructure

KEDA GPU Scaler Independent Repository

Independent repository developing an event-driven GPU autoscaler using KEDA’s External gRPC Scaler interface. Native NVML metrics, DaemonSet deployment, pre-built scaling profiles for vLLM, Triton, and training workloads. Not yet merged into the KEDA core repository.

GogRPCNVMLKubernetesHelm

Referenced in KEDA #7538

OpenTelemetry GPU Receiver

OpenTelemetry Collector receiver for NVIDIA GPU metrics. GPU utilization, memory, temperature via NVML. Standard OTLP export with built-in Prometheus exporter.

GoOpenTelemetryNVMLPrometheus

Kube Topology Agent

Kubernetes knowledge graph & automated root-cause analysis. Real-time resource topology, graph-based incident investigation, AlertManager webhook integration.

GoKubernetes APIKnowledge GraphHelm

KubeAI Autoscaler

Kubernetes-native autoscaler for AI inference workloads. Custom scaling algorithms, GPU-focused policies, latency SLA enforcement, Prometheus metrics.

GoKubernetesCRDHelm

Golden Kubestronaut Learning

Comprehensive Kubernetes certification study guides covering all CNCF certifications. Interactive quizzes, flashcards, lab exercises, and PDF generation.

MkDocsPythonKubernetesEducation

Ingress2Gateway

Convert Kubernetes Ingress resources to Gateway API. Supports ALB, GCE, Nginx annotations with automated migration and validation.

PythonKubernetesGateway APIHelm

Technical Expertise

Container Orchestration & GitOps

KubernetesArgoCDDockerCrossplaneHelmFlux

Cloud Platforms

AWSAzureEKSEC2S3IAM

Observability

PrometheusGrafanaOpenTelemetrySplunkDatadog

Policy & Security

KyvernoOPAZero-TrustRBACNetwork Policies

CI/CD

GitHub ActionsJenkinsFluxUrbanCode Deploy

Languages & Tools

GoPythonRustTerraformgRPCBash

GPU / AI Infrastructure

NVIDIA NVMLCUDAvLLMTritonKEDAVolcano

Big Data

PrestoDBTrinoApache SupersetAlluxioJupyter

Technical Writing

Industry publications, foundation blogs, and personal technical writing

Industry Writing

VKTR

Jun 23, 2026

The Blast Radius of Agentic Ops: Why Autonomous AI Needs Zero-Trust Guardrails

Why autonomous AI agents need zero-trust guardrails at the infrastructure layer — moving beyond RBAC to real-time action evaluation with policy-as-code.

Cloud Native Now

Jun 9, 2026

Stop Wasting GPU Budget: Autoscaling AI Inference on Kubernetes with KEDA

How to eliminate GPU budget waste with KEDA external scalers — native NVML metrics, DaemonSet architecture, and scaling profiles for vLLM, Triton, and training workloads.

Cloud Native Now

May 22, 2026

Shattering the Kubernetes Registry Bottleneck: Scaling Enterprise CI/CD with P2P Mesh Architecture

How P2P mesh architecture eliminates registry bottlenecks in enterprise CI/CD — Dragonfly's distributed caching, bandwidth optimization, and multi-datacenter image distribution at scale.

Cloud Native Now

May 15, 2026

The Inference Bottleneck: Architecting Kubernetes Autoscaling for Production LLMs

Why standard HPA fails for LLM inference — token-aware autoscaling, KV cache pressure, GPU memory headroom, and building KEDA scalers for production serving.

VKTR

May 15, 2026

Agentic AIOps: Building the Guardrails for Autonomous Infrastructure

Why autonomous infrastructure needs execution guardrails — policy-as-code with Kyverno, blast-radius containment, and building trust boundaries for AI agents in production.

Cloud Native Now

May 13, 2026

Architecting Enterprise GitOps: Scaling Argo CD on OKE

Multi-cluster Argo CD on Oracle Kubernetes Engine — ApplicationSets, cluster secrets, RBAC delegation, and progressive delivery patterns for enterprise GitOps.

Cloud Native Now

May 13, 2026

Deploying Docker AI Agents on OCI and OKE

Running Docker-based AI agents on Oracle Cloud — Docker Model Runner, GPU shapes, container orchestration, and agent sandboxing on OKE.

Platform Engineering

May 1, 2026

Abstracting AI Infrastructure: Native GPU Scaling for Internal Developer Platforms

How to restore the Golden Path for ML engineers by pushing GPU scaling complexity down the stack — edge-native NVML telemetry, KEDA External Scaler architecture, and eliminating the Prometheus latency trap.

VKTR

April 28, 2026

Why Enterprise AI Fails: The 4 Infrastructure Bottlenecks Nobody Wants to Talk About

The models are ready but the pipes aren’t — how CI/CD pipelines, GPU scheduling, model distribution, and governance are killing enterprise AI deployments.

Cloud Native Now

March 26, 2026

Zero-Trust on OKE: How to Actually Secure Your Clusters With Terraform

Implementing zero-trust security on Oracle Kubernetes Engine with Terraform — IAM policies, network security groups, workload identity, and confidential computing.

Cloud Native Now

March 10, 2026

Beyond the Green Checkmark: Using Formal Verification to Stop ArgoCD Drift

Formal verification of ArgoCD manifests — resource invariants, temporal logic, and rollback safety for mission-critical deployments.

Cloud Native Now

March 3, 2026

The Efficiency Era: How Kubernetes v1.35 Finally Solves the “Restart” Headache

Dynamic resource allocation, in-place vertical scaling, and immutability improvements in Kubernetes v1.35 for AI/ML workloads and FinOps.

Cloud Native Now

February 27, 2026

From PagerDuty to ‘Agentic Ops’: The Rise of Self-Healing Kubernetes

How AI agents, eBPF, and LLMs are transforming SRE from reactive incident management to autonomous self-healing infrastructure.

Platform Engineering

April 2, 2026

The IDP Paradox: Why Your Internal Developer Platform Needs a “Java-First” Strategy

Why internal developer platforms need to prioritize the Java ecosystem — bridging enterprise reality with platform engineering ideals.

Platform Engineering

February 27, 2026

Beyond Basic Sync: Why ArgoCD v3 is the Backbone of Modern Platform Engineering

How ArgoCD v3 evolves from a sync tool to the backbone of modern platform engineering — multi-tenancy, scalability, and GitOps at enterprise scale.

Academic & Foundation Writing

CNCF Blog

April 20, 2026

From public static void main to Golden Kubestronaut: The Art of Unlearning

Journey from Java developer to earning all CNCF certifications — what it takes to unlearn and re-learn in the cloud-native world.

CNCF Blog

April 6, 2026

Peer-to-Peer Acceleration for AI Model Distribution with Dragonfly

How Dragonfly’s P2P architecture accelerates large AI model downloads — HuggingFace and ModelScope integration.

IEEE ComSoc

March 30, 2026

The Financial Trap of Autonomous Networks: Scaling Agentic AI in the Telecom Core

Challenges of scaling agentic AI in telecom infrastructure — cost implications and architectural considerations for autonomous networks.

Personal Blogs

WordPress

DevOps best practices & K8s deep-dives

Medium

Technical tutorials & architecture guides

Dev.to

Developer-focused cloud-native content

Conferences & Speaking

Conference presentations on cloud-native infrastructure, GitOps, and HPC

HPSF Conference 2026 Productivity, Performance & the HPC Pipeline

GitOps for HPC: Bringing Cloud-Native DevOps Practices to High Performance Computing Environments

Pavan Madduri, W.W. Grainger

Applying ArgoCD, Kubernetes, and GitOps workflows to HPC environments — bridging the gap between cloud-native DevOps and scientific computing.

Chicago River Ballroom A-D Intermediate

Schedule Slides Video

HPSF Conference 2026 Building & Sustaining Community

DevOps for Scientific Software: Tools, Practices, and Automation Strategies

Pavan Madduri, W.W. Grainger

CI/CD pipelines, testing strategies, and automation for scientific and research software development — making open source science reproducible and maintainable.

Chicago River Ballroom A-D Beginner

Schedule Slides Video

InfoQ Live · Apr 21, 2026 Roundtable Panel

AI-Powered SRE for Autonomous Incident Response

Pavan Madduri (Grainger), Rohit Dhawan (Amazon), Alina Astapovich (Storytel), Goutham Rao (NeuBird) · Moderated by Renato Losio (InfoQ)

How AI agents and generative models are being used for incident detection, root cause analysis, and automated remediation — reducing MTTR and operational load at scale.

Online Panel Discussion

Event Page

Media Mentions & Expert Commentary

Quoted as a subject-matter expert across 11+ publications on enterprise AI, GPU infrastructure, cloud security, and platform engineering

AI Business Informa PLC

OpenAI vs. Anthropic vs. Google: But the Model Isn’t the Point

“The real dependency risk comes from the orchestration, workflow and data integration layers built around them… Relying on third-party orchestration is where real lock-ins happen.”

VKTR Simpler Media Group

Enterprise AI Costs Climb as GPU Demand Outpaces Supply

“The architecture that works is a routing layer: simple tasks go to a lightweight SLM, complex reasoning escalates to the frontier model. You stop paying frontier prices for envelope-delivery workloads.”

Techopedia 10M+ monthly visitors

AI Experts Call for a Reality Check on Allbirds’ Pivot

“GPU capacity is genuinely hard to get right now… You can’t buy that institutional knowledge with a convertible note and a rebrand.”

Reworked Simpler Media Group

AI Agents and the Process Documentation Fallacy

“If an AI agent is trained purely by observing the official workflow in the ticketing platform, it’s learning a fantasy… You have to fence the AI in.”

InfoSec Relations Cybersecurity

Agentic AI is Exposing the Accountability Gap in Cloud Security

“We enforce this with Policy-as-Code at the admission layer, so the agent’s available responses are constrained by the infrastructure itself, not by a governance doc that someone wrote once and nobody checks.”

Tech Round UK International

Meta Acquires Moltbook: What Responsibility Do Regulators Have?

“We are building autonomous agents without implementing Zero Trust security… Regulators must urgently pivot to regulating Agentic Privileges.”

TLDR Newsletter 3M+ subscribers

Featured Mention

CNCF GPU autoscaling blog featured to 3M+ subscribers — one of the largest daily tech newsletters globally.

Habr (VKTech) 10M+ visitors

GPU Auto-Scaling on Kubernetes with KEDA

Russian-language adaptation of CNCF blog by VKTech (VK/Mail.ru Group) — 4,500+ views in first 13 hours. International reach beyond English-speaking audience.

Cloud Native Now Techstrong Group

Stop Wasting GPU Budget: Autoscaling AI Inference with KEDA

Primary author — GPU autoscaling architecture, keda-gpu-scaler, and scale-to-zero for AI inference on Kubernetes.

Y Square Technology Tech Analysis

AI Agent Documentation Reality Gap

Quoted on enterprise AI agent deployment challenges and the gap between documented processes and operational reality.

CNCF Official Recognition

CNCF LinkedIn 500K+ followers

GPU Autoscaling with KEDA

“Pavan Madduri breaks down how to build a KEDA external scaler via a DaemonSet to query NVML over gRPC directly — cutting metric latency from 15–30s to 2–4s.”
204+ likes · 28 reposts · 3 comments

CNCF Twitter/X @CloudNativeFdn

GPU Autoscaling with KEDA

“See how to build a KEDA external scaler via a DaemonSet to query NVML over gRPC directly, with scaling profiles for vLLM, Triton, and training workloads.”
2,122 views · 24 likes · 7 bookmarks

CNCF Bluesky cncf.io

GPU Autoscaling with KEDA

Featured across all 3 CNCF social platforms — LinkedIn, Twitter/X, and Bluesky.

CNCF LinkedIn 500K+ followers

Golden Kubestronaut Journey

“From public static void main to Golden Kubestronaut: The Art of Unlearning — Pavan Madduri shares his journey through all five Kubernetes certifications.”
26+ likes · 1 repost

Industry Advisory & Collaboration

Providing architectural feedback and early platform contributions for enterprise AI agents (e.g., Future AGI). Coordinating technical documentation and letters of support with key open-source project maintainers across CNCF foundations.

Judging & Peer Review

Serving as a technical peer reviewer and judge for international IEEE conferences and journals

IEEE Technical Peer Reviewer & Judge

IEEE Internet of Things Journal (IoT)

Peer reviewing submissions on IoT architectures, edge computing, and distributed systems for one of IEEE’s highest-impact journals.

IEEE AIEEE 2026 Technical Peer Reviewer & Judge

IEEE International Conference on AI & Electrical Engineering (AIEEE 2026)

Reviewing research papers on AI systems, electrical engineering, and their intersection with cloud-native infrastructure.

IEEE CloudCOM 2026 Technical Peer Reviewer & Judge

IEEE International Conference on Cloud Computing Technology & Science (CloudCOM 2026)

Evaluating papers on cloud computing architectures, container orchestration, and scalable infrastructure design.

IEEE COMM 2026 Technical Peer Reviewer & Judge

IEEE International Conference on Communications (COMM 2026)

Reviewing research on network communications, distributed systems, and telecom infrastructure.

IEEE ICCCN 2026 Technical Peer Reviewer & Judge

IEEE International Conference on Computer Communications & Networks (ICCCN 2026)

Evaluating submissions on computer networking, cloud infrastructure, and distributed computing systems.

Let's Connect

Always open to connecting with fellow engineers in the cloud-native and AI/ML space

LinkedIn GitHub Google Scholar ResearchGate Medium Dev.to Email