Pavan Madduri pmady

Senior Cloud Platform Engineer building GPU/AI infrastructure at scale.
CNCF Golden Kubestronaut. Oracle ACE Associate. Dragonfly Community Member.
31+ PRs across 17 open-source projects in CNCF, ASWF, and beyond.
If GPUs need scheduling, scaling, or observability on Kubernetes — that's what I build.

⚡ What I'm Building


🎮 GPU Autoscaling	KEDA External Scaler with native NVML metrics, DaemonSet architecture, scaling profiles for vLLM, Triton, and training workloads. Referenced in KEDA #7538 and published on CNCF Blog.
🔬 GPU NUMA Topology	Volcano scheduler plugin for NUMA-aware GPU placement — topology discovery via sysfs, CRD extensions, and cross-socket affinity optimization.
📡 GPU Observability	OpenTelemetry Collector receiver for GPU metrics (NVML-native) and Docker Desktop Extension for real-time GPU monitoring dashboards.
🧠 Topology-Aware AIOps	Knowledge graph of Kubernetes resources with graph-based root-cause traversal, AlertManager webhook integration, and blast-radius analysis.
☁️ Platform Engineering	Kubernetes, ArgoCD, Crossplane, Docker, KEDA — production platforms serving enterprise workloads at scale.
📝 Technical Writing	19 published articles across CNCF Blog, IEEE ComSoc, Platform Engineering, VKTR, Cloud Native Now, and Medium.

🏆 Certifications & Recognition

Golden Kubestronaut — All five Kubernetes certifications: KCNA, CKA, CKAD, CKS, KCSA

🚀 Featured Projects

🎮 KEDA GPU Scaler

KEDA External gRPC Scaler for GPU/AI workloads

🎮 Native NVML — Direct GPU metrics via go-nvml
🚀 Scaling Profiles — vLLM, Triton, training presets
📦 DaemonSet — Per-node GPU metric collection
🔄 Scale-to-Zero — GPU-aware idle detection
📈 Prometheus — Optional /metrics endpoint

Tech: Go · gRPC · NVIDIA NVML · Kubernetes · Helm

Referenced in KEDA #7538 | CNCF Blog

📡 OpenTelemetry GPU Receiver

OpenTelemetry Collector receiver for GPU metrics

🔋 NVIDIA NVML — GPU utilization, memory, temperature
📊 OTel Native — Standard OTLP export pipeline
🖥️ Multi-GPU — All devices on the node
📈 Prometheus — Built-in Prometheus exporter

Tech: Go · OpenTelemetry Collector SDK · NVML

🐳 Docker GPU Dashboard Extension

Real-time NVIDIA GPU metrics in Docker Desktop

📊 Live Dashboard — Utilization, memory, temperature, power
📈 History Charts — 2-minute rolling Recharts graphs
🚦 Alert Thresholds — Color-coded green/yellow/red
🎭 Mock Mode — Develop without GPU hardware

Tech: Go · React · Recharts · Docker Extension SDK · NVML

🧠 Kube Topology Agent

K8s knowledge graph & automated root-cause analysis

🗺️ Knowledge Graph — Real-time resource topology
🔍 Root-Cause Traversal — Graph-based incident investigation
🎮 GPU Aware — Training/inference/batch classification
🔔 AlertManager — Webhook integration for auto-investigation

Tech: Go · Kubernetes API · Gorilla Mux · Helm

More projects: KubeAI Autoscaler · Ingress2Gateway · Golden Kubestronaut Learning · LLMOps

🌱 Open Source Contributions

31+ PRs across 17 projects in CNCF, ASWF, and open-source foundations.

CNCF (Cloud Native Computing Foundation)

Project	Description	Contributions
Dragonfly	P2P-based file distribution and image acceleration	client#1665 - Add Hugging Face backend support with hf:// protocol, client#1673 - Add ModelScope backend support with modelscope:// protocol, d7y.io#386 - Add hf:// protocol documentation, d7y.io#398 - Add P2P-accelerated AI model downloads blog post, helm-charts#455 - Add injector support to helm chart, helm-charts#480 - Replace deprecated bitnamilegacy/mysql with bitnami/mysql
Kubernetes	Production-Grade Container Orchestration	#53891 - Document deployment.kubernetes.io/* annotations, #53892 - Add kubectl apply view-last-applied documentation
TiKV	Distributed transactional key-value database	#19225 - Add AGENTS.md for AI agent guidance
Volcano	Cloud-native batch scheduling for AI/HPC	#5095 - GPU NUMA topology awareness in scheduler, apis#229 - Add GPUInfo type to NumatopoSpec CRD, resource-exporter#12 - GPU NUMA topology discovery via sysfs
HAMi	Heterogeneous AI Computing Virtualization Middleware	#1893 - Add unit tests for nvinternal info, mig, and watch packages
KEDA	Kubernetes Event-driven Autoscaling	keda-docs#1658 - Removing metricName from the kedadocs, keda-docs#1769 - Fix datadog scaler typos across all versions, #7538 - GPU/AI inference scaler architectural analysis
Metal³	Bare metal host provisioning for Kubernetes	#624 - Fix redirect links in tryit.md
OpenTelemetry	Observability framework	#8632 - Add .NET troubleshooting page
kpt	Kubernetes-native packaging and resource management	#4278 - Fix kpt fn doc command for KRM functions expecting input
traceAI	Open-source LLM observability SDK	#165 - Fix exporter shutdown and thread safety in Python SDK, #166 - Add Go SDK with OpenAI instrumentor

ASWF (Academy Software Foundation)

Project	Description	Contributions
OpenColorIO	Color management library	#2229 - Add release signing workflow, #2230 - Add Dependabot configuration, #2243 - Add Vulkan unit test framework
OpenCue	Cloud rendering management system	#2134 - Add scheduled subscription recalculation task
OpenImageIO	Image processing library	#4976 - Fix IBA::compare_Yee() channel access
RAWtoACES	RAW to ACES image conversion	#222 - Add build developer documentation
xSTUDIO	Playback and review application	#186 - Fix broken build guide links

🧰 Tech Stack

📝 Publications

19 articles published across CNCF Blog, IEEE ComSoc, Platform Engineering, VKTR, Cloud Native Now, and Medium.

Title	Publication	Date
GPU Autoscaling on Kubernetes with KEDA: Building an External Scaler	CNCF Blog	May 2026
Shattering the Kubernetes Registry Bottleneck: Scaling Enterprise CI/CD with P2P Mesh Architecture	Cloud Native Now	May 2026
The Inference Bottleneck: Architecting Kubernetes Autoscaling for Production LLMs	Cloud Native Now	May 2026
Agentic AIOps: Building the Guardrails for Autonomous Infrastructure	VKTR	May 2026
Architecting Enterprise GitOps: Scaling Argo CD on OKE	Cloud Native Now	May 2026
Deploying Docker AI Agents on OCI and OKE	Cloud Native Now	May 2026
Abstracting AI Infrastructure: Native GPU Scaling for Internal Developer Platforms	Platform Engineering	May 2026
Why Enterprise AI Fails: The 4 Infrastructure Bottlenecks Nobody Wants to Talk About	VKTR	Apr 2026
From public static void main to Golden Kubestronaut: The Art of Unlearning	CNCF Blog	Apr 2026
Peer-to-Peer Acceleration for AI Model Distribution with Dragonfly	CNCF Blog	Apr 2026
The IDP Paradox: Why Your Internal Developer Platform Needs a "Java-First" Strategy	Platform Engineering	Apr 2026
The Financial Trap of Autonomous Networks: Scaling Agentic AI in the Telecom Core	IEEE ComSoc	Mar 2026
Zero-Trust on OKE: How to Actually Secure Your Clusters With Terraform	Cloud Native Now	Mar 2026
Beyond the Green Checkmark: Using Formal Verification to Stop ArgoCD Drift	Cloud Native Now	Mar 2026
The Efficiency Era: How Kubernetes v1.35 Finally Solves the "Restart" Headache	Cloud Native Now	Mar 2026
Beyond Basic Sync: Why ArgoCD v3 is the Backbone of Modern Platform Engineering	Platform Engineering	Feb 2026
From PagerDuty to 'Agentic Ops': The Rise of Self-Healing Kubernetes	Cloud Native Now	Feb 2026
I Replaced a $3/hr GPU Dev Workflow with Docker Model Runner	Medium	May 2026
GPU-Aware Autoscaling for Docker Containers	Medium	May 2026

📊 GitHub Stats

Stats updated on 2026-05-31 13:04 UTC

🐍 Contribution Activity

🤝 Let's Connect

Building GPU infrastructure for Kubernetes? Working on CNCF projects? Let's collaborate.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly