This repository is the official community hub for Azure SRE Agent. Here you'll find:
- 🐛 Report Issues — File bugs, feature requests, and feedback via GitHub Issues
- 📚 Resources — Curated links to docs, videos, blogs, and community content for Azure SRE Agent
- 🧪 Labs — Hands-on labs and sample environments to deploy, break, and fix apps with Azure SRE Agent (see the
labs/folder)
| Resource | Link |
|---|---|
| Product Home Page | https://www.azure.com/sreagent |
| Portal (Create & Manage Agents) | https://aka.ms/sreagent |
| Documentation | https://aka.ms/sreagent/newdocs |
| Pricing & Billing | https://aka.ms/sreagent/pricing |
| All Blogs | https://aka.ms/sreagent/blog |
| YouTube Channel | https://aka.ms/sreagent/youtube |
| GitHub — Azure SRE Agent (Report Issues, Official Labs & Resources) | https://aka.ms/sreagent/github |
| Hands-on Lab | https://aka.ms/sreagent/lab |
| GitHub — Official Plugins | https://github.com/Azure/sre-agent-plugins |
| Tech Community Discussions | https://aka.ms/sreagent/discussions |
| Agentic DevOps Live | https://aka.ms/agenticdevopslive |
| X (Twitter) | https://x.com/azuresreagent |
The official Microsoft Azure product overview — a concise explainer of what Azure SRE Agent is, how it works, and the problems it solves. 🔗 https://www.youtube.com/watch?v=6vDrThUjDOc · 6,156 views · 158 likes
Satya Nadella highlights Azure SRE Agent as a key example of AI-driven operations transforming how engineering teams manage reliability at scale. 🔗 https://www.youtube.com/watch?v=3hPeKDtLvPg · 2,548 views · 26 likes
Scott Hanselman walks through Azure SRE Agent on Azure Friday, showing how it reduces operational toil and lets teams focus on innovation. 🔗 https://www.youtube.com/watch?v=5c9pl8_DI3w · 4,264 views · 75 likes
The GA launch video demonstrating Azure SRE Agent performing root cause analysis with full code context through deep GitHub integration. 🔗 https://www.youtube.com/watch?v=1vKoxPeep_M · 582 views · 25 likes
Deep-dive Build session covering end-to-end SRE Agent capabilities: automated investigation, remediation, proactive monitoring, and custom hooks. 🔗 https://www.youtube.com/watch?v=bK3SIQoE_Nc · 12,294 views · 129 likes
- Fix It Before They Feel It: Proactive .NET Reliability with Azure SRE Agent — dotnet · 1,466 views
- Azure SRE Agent - Incident Management with PagerDuty — Azure SRE Agent (official) · 547 views
- Azure SRE Agent - Your 24/7 Automated Response Team — Mariusz Ferdyn · 313 views
- Azure's New SRE Agent Is INSANE — Here's Why you Should Pay Attention — TechTalks with Gil · 249 views
- SRE Agent Series: What Is Azure SRE Agent and How to Create One Step by Step — JBSWiki · 204 views
- Azure SRE Agent Explained — Cloud Talk with Jonnychipz · 160 views
- SRE Agent Series: I Let an Azure SRE Agent Manage My Subscription — Here's What Happened — JBSWiki · 143 views
- Agentic DevOps: Azure SRE Agent with GitHub Copilot Coding Agent demo — Jorge Balderas · new
- Event-Driven IaC Operations: Terraform Drift Detection via HTTP Triggers — Vineela Suri · 10 min read. End-to-end pipeline: Terraform Cloud webhook triggers SRE Agent to classify drift as benign/risky/critical, correlate with incidents, and ship a fix — including a "DO NOT revert" recommendation that prevents turning a mitigated incident into an outage.
- Managing Multi-Tenant Azure Resources with SRE Agent and Lighthouse — Pranab Mandal · 6 min read. Step-by-step guide to configuring Azure Lighthouse delegation so a single SRE Agent can monitor and manage resources across multiple tenants — covering ARM templates, RBAC roles, and managed identity setup.
- New in Azure SRE Agent: Log Analytics and Application Insights Connectors — Dalibor Kovacevic · 3 min read. Native MCP-backed connectors for Log Analytics and App Insights — connect a workspace, auto-grant RBAC, and the agent queries ContainerLog, Syslog, exceptions, and traces directly during investigations.
- Azure Monitor in Azure SRE Agent: Autonomous Alert Investigation and Intelligent Merging — Vineela Suri · 9 min read. Full walkthrough of Azure Monitor integration: Incident Response Plans, alert merging (7 firings → 1 thread), auto-resolve trade-offs, and a live AKS + Redis scenario where the agent fixes a bad credential autonomously.
- 3 Ways to Get More from Azure SRE Agent — dchelupati · 4 min read. Practical cost and value tips: start narrow with incident routing, replace high-frequency polling with push/batch patterns, and keep scheduled task threads fresh with "new chat thread for each run."
- How We Build and Use Azure SRE Agent with Agentic Workflows — Shamir AbdulAziz · 6 min read. Customer Zero blog: how Microsoft embedded agents across the SDLC to build SRE Agent — 35K+ incidents handled, 50K+ developer hours saved, App Service time-to-mitigation down from 40.5 hours to 3 minutes.
- An Update to the Active Flow Billing Model — Mayunk Jain · 3 min read. Active flow billing moves from time-based to token-based usage, with per-model-provider AAU rates. Always-on pricing unchanged at 4 AAUs per agent-hour.
- Announcing General Availability for the Azure SRE Agent — Mayunk Jain · 4 min read. GA announcement: 1,300+ agents deployed internally at Microsoft, 35K+ incidents mitigated, 20K+ engineering hours saved. Covers deep context, built-in computation, memory and learning, and Ecolab customer story.
- What's New in Azure SRE Agent in the GA Release — dchelupati · 2 min read. Companion to the GA announcement: redesigned onboarding, deep context, code interpreter, memory, skills, subagents, Python tools, agent hooks, and MCP connectors.
- The Agent That Investigates Itself (SRE4SRE) — Sanchit Mehta · 11 min read. Deep technical post — the SRE Agent investigating its own KV cache regression, demonstrating how the team uses the product to maintain the product.
- Azure SRE Agent Now Builds Expertise Like Your Best Engineer (Deep Context) — dchelupati · 6 min read. How the agent operates with continuous access to source code, persistent memory across investigations, and background intelligence that runs when nobody is asking questions.
- What It Takes to Give SRE Agent a Useful Starting Point (Onboarding) — Dalibor Kovacevic · 10 min read. Designing the guided onboarding flow: connecting code, logs, incidents, Azure resources, and knowledge files so a new agent becomes useful on day one.
- Agent Hooks: Production-Grade Governance for Azure SRE Agent — Vineela Suri · 9 min read. Governance primitives for controlling agent behavior: stop hooks, PostToolUse hooks, and global hooks that enforce approval gates and safety boundaries.
- An AI-Led SDLC: Building an End-to-End Agentic Software Development Lifecycle with Azure and GitHub — owaino · 16 min read. Full agentic SDLC walkthrough: Spec-Kit → GitHub Coding Agent → Code Quality → CI/CD → SRE Agent — with the SRE Agent closing the loop by opening GitHub issues for the coding agent to fix.
- Context Engineering: Lessons from Building Azure SRE Agent — Sanchit Mehta · 8 min read. Engineering lessons: started with 100+ tools and 50+ specialized agents, ended with 5 core tools and generalist agents — why less is more in agent design.
| Repo | Stars | Description |
|---|---|---|
| microsoft/sre-agent | 83 | Official hands-on lab — sample environments, walkthroughs, and prompt guides |
| matthansen0/azure-sre-agent-sandbox | 52 | Fully automated sandbox deployment with AKS break-fix scenarios |
| paulasilvatech/Agentic-Ops-Dev | 23 | Agentic Operations & Observability Workshop |
Deploy an Azure SRE Agent connected to a sample application with a single azd up command. Watch it diagnose and remediate issues autonomously.
Learn more: What is Azure SRE Agent?
| Tool | macOS | Windows |
|---|---|---|
| Azure CLI 2.60+ | brew install azure-cli |
winget install Microsoft.AzureCLI |
| Azure Developer CLI 1.9+ | brew install azd |
winget install Microsoft.Azd |
| Git 2.x | brew install git |
winget install Git.Git (includes Git Bash) |
| Python 3.10+ | brew install python3 |
winget install Python.Python.3.12 |
Windows note: After installing Python, disable the Windows Store app aliases: Settings → Apps → Advanced app settings → App execution aliases → turn OFF
python.exeandpython3.exe
- Active Azure subscription
- Owner role on the subscription (needed for RBAC role assignments)
- Register the resource provider:
az provider register -n Microsoft.App --wait
- GitHub account (for code search and issue triage scenarios — uses OAuth sign-in, or a fine-grained PAT scoped to your fork with
Contents:Read,Issues:Read+Write,Metadata:Readfor least-privilege access)
Run the prereqs script to verify everything is installed:
# macOS/Linux
bash scripts/prereqs.sh
# Windows (Git Bash or CMD)
"C:\Program Files\Git\bin\bash.exe" scripts/prereqs.sh# 1. Clone the repo
git clone https://github.com/dm-chelupati/sre-agent-lab.git
cd sre-agent-lab
git submodule update --init --recursive
# 2. Sign in to Azure
az login
azd auth login
# 3. Create environment and deploy
azd env new sre-lab
azd up
# Select your subscription and eastus2 as the regionREM 1. Clone the repo (in CMD or PowerShell)
git clone https://github.com/dm-chelupati/sre-agent-lab.git
cd sre-agent-lab
git submodule update --init --recursive
REM 2. Sign in to Azure
az login
azd auth login
REM 3. Create environment and deploy
azd env new sre-lab
azd up
REM If post-provision fails with 'bash not found' or 'Python not found':
set PATH=%PATH%;C:\Users\%USERNAME%\AppData\Local\Programs\Python\Python312
"C:\Program Files\Git\bin\bash.exe" scripts/post-provision.shDeployment takes ~8-12 minutes.
| Resource | Service | Purpose | Docs |
|---|---|---|---|
| SRE Agent | Microsoft.App/agents |
AI agent for incident investigation | Overview |
| Grubify API | Azure Container Apps | Sample app to monitor | |
| Grubify Frontend | Azure Container Apps | Sample app UI | |
| Log Analytics | Microsoft.OperationalInsights |
Log storage for KQL queries | Azure Observability |
| App Insights | Microsoft.Insights |
Request tracing and exceptions | |
| Alert Rules | Microsoft.Insights/metricAlerts |
HTTP 5xx and error log alerts | |
| Managed Identity | Microsoft.ManagedIdentity |
Agent identity for Azure access | Permissions |
| Container Registry | Microsoft.ContainerRegistry |
Grubify container images |
| Role | Scope | Purpose |
|---|---|---|
| SRE Agent Administrator | Agent resource | User can manage agent via data plane APIs |
| Reader | Resource group | Agent can read all resources |
| Monitoring Reader | Resource group | Agent can read metrics and alerts |
| Log Analytics Reader | Log Analytics workspace | Agent can query logs via KQL |
See: Manage Permissions
| Component | Purpose | Docs |
|---|---|---|
| Knowledge Base | HTTP error runbook, app architecture, incident template | Memory & Knowledge |
| incident-handler subagent | Investigates alerts using logs, metrics, runbooks | Custom Agents |
| Response Plan | Routes HTTP 500 alerts to incident-handler | Response Plans |
| Azure Monitor | Incident platform — alerts flow to the agent | Incident Platforms |
| GitHub OAuth connector | Code search and issue management (optional) | Connectors |
| code-analyzer subagent | Source code root cause analysis | Custom Agents |
| issue-triager subagent | Automated issue triage from runbook | Custom Agents |
Note on GitHub tools: GitHub OAuth tools (code search, issue management) are built-in native tools, not MCP tools. Once the GitHub OAuth connector is set up, all agents — including subagents — get access to GitHub tools automatically through global settings. No explicit
mcp_toolsassignment is needed in subagent YAML. This is different from MCP connector tools (Datadog, Splunk, etc.) which require explicitmcp_toolsassignment. | Scheduled Task | Triage customer issues every 12 hours | Scheduled Tasks | | Code Repo | Agent indexes the Grubify source code | Deep Context |
# Full re-run (rebuilds container images + re-uploads everything)
./scripts/post-provision.sh
# Skip container image builds (just update KB, subagents, response plan)
./scripts/post-provision.sh --retry
# Windows: run from CMD with Python in PATH
set PATH=%PATH%;C:\Users\%USERNAME%\AppData\Local\Programs\Python\Python312
"C:\Program Files\Git\bin\bash.exe" scripts/post-provision.sh --retryIf the script deploys images but the app still shows the default page:
for /f "tokens=*" %a in ('azd env get-value AZURE_CONTAINER_REGISTRY_NAME') do set ACR=%a
for /f "tokens=*" %a in ('azd env get-value CONTAINER_APP_NAME') do set APP=%a
for /f "tokens=*" %a in ('azd env get-value FRONTEND_APP_NAME') do set FE=%a
az containerapp update --name %APP% --resource-group rg-sre-lab --image %ACR%.azurecr.io/grubify-api:latest
az containerapp update --name %FE% --resource-group rg-sre-lab --image %ACR%.azurecr.io/grubify-frontend:latestAfter deployment completes, open your agent at sre.azure.com and click Full setup. You should see green checkmarks on:
| Card | Expected Status |
|---|---|
| Code | ✅ 1 repository |
| Incidents | ✅ Connected to Azure Monitor |
| Azure resources | ✅ 1 resource group added |
| Knowledge files | ✅ 1 file |
Checkpoint: If any card is missing a checkmark, re-run the post-provision script:
bash scripts/post-provision.sh --retry
Once verified, click "Done and go to agent" to open the agent chat and start the team onboarding conversation.
The agent opens a "Team onboarding" thread automatically. It will:
- Explore your connected context — reads the code repository, Azure resources, and knowledge files you connected during setup
- Interview you about your team — ask about your team structure, on-call rotation, services you own, and escalation paths
Since the agent already has context from setup, try asking it questions:
"What do you know about the Grubify app architecture?"
"Summarize the HTTP errors runbook"
"What Azure resources are in my resource group?"
The agent saves your team information to persistent memory and references it in every future investigation.
Tip: Ask "What should I do next?" for personalized recommendations based on what's connected.
Break the app and watch the agent investigate:
./scripts/break-app.sh # macOS/Linux
# Windows: "C:\Program Files\Git\bin\bash.exe" scripts/break-app.shThen open sre.azure.com → Incidents to watch the agent:
- Detect the Azure Monitor alert
- Query Log Analytics for error patterns
- Reference the HTTP errors runbook
- Apply remediation (restart/scale)
- Summarize with root cause and evidence
Ask the agent to search source code for root causes:
- File:line references to problematic code
- Correlation of production errors to code changes
- Suggested fixes with before/after examples
Create sample support issues and let the agent triage them:
./scripts/create-sample-issues.sh <owner/repo>The agent classifies issues (Documentation, Bug, Feature Request), applies labels, and posts triage comments following the runbook.
After initial setup, add GitHub by signing in via the OAuth URL:
./scripts/setup-github.sh # macOS/Linux
# Windows: "C:\Program Files\Git\bin\bash.exe" scripts/setup-github.shSecurity tip: The OAuth flow requests broad repo access. For least-privilege, use a fine-grained PAT scoped to your grubify fork only with permissions:
Contents:Read,Issues:Read+Write,Metadata:Read.export GITHUB_PAT=github_pat_xxxx ./scripts/setup-github.sh
azd down --purge| Issue | Fix |
|---|---|
'bash' is not recognized (Windows) |
Run via: "C:\Program Files\Git\bin\bash.exe" scripts/post-provision.sh |
Python was not found (Windows) |
Install: winget install Python.Python.3.12, disable App execution aliases |
curl: error encountered when reading a file |
Python isn't in Git Bash PATH: export PATH="$PATH:/c/Users/$USER/AppData/Local/Programs/Python/Python312" |
roleAssignments/write denied |
Need Owner role on subscription. Check: az role assignment list --assignee $(az ad signed-in-user show --query id -o tsv) |
Microsoft.App not registered |
Run: az provider register -n Microsoft.App --wait |
| Grubify shows default page after deploy | Run manual deploy commands (see Post-Deployment section above) |
| Post-provision 405 on response plan | Wait 30s and run: ./scripts/post-provision.sh --retry |
| Agent can't create issues on forked repo | Forks have Issues disabled by default. Enable: repo Settings → Features → Issues ✅, or run gh api -X PATCH repos/OWNER/REPO -f has_issues=true |
SRE Agent is available in: eastus2, swedencentral, australiaeast
- Azure SRE Agent Documentation
- Getting Started Guide
- Connectors
- Custom Agents
- Incident Response
- Azure Observability
MIT