Skip to content

microsoft/sre-agent

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

137 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Azure SRE Agent — Resources

This repository is the official community hub for Azure SRE Agent. Here you'll find:

  • 🐛 Report Issues — File bugs, feature requests, and feedback via GitHub Issues
  • 📚 Resources — Curated links to docs, videos, blogs, and community content for Azure SRE Agent
  • 🧪 Labs — Hands-on labs and sample environments to deploy, break, and fix apps with Azure SRE Agent (see the labs/ folder)

Quick Links

Resource Link
Product Home Page https://www.azure.com/sreagent
Portal (Create & Manage Agents) https://aka.ms/sreagent
Documentation https://aka.ms/sreagent/newdocs
Pricing & Billing https://aka.ms/sreagent/pricing
All Blogs https://aka.ms/sreagent/blog
YouTube Channel https://aka.ms/sreagent/youtube
GitHub — Azure SRE Agent (Report Issues, Official Labs & Resources) https://aka.ms/sreagent/github
Hands-on Lab https://aka.ms/sreagent/lab
GitHub — Official Plugins https://github.com/Azure/sre-agent-plugins
Tech Community Discussions https://aka.ms/sreagent/discussions
Agentic DevOps Live https://aka.ms/agenticdevopslive
X (Twitter) https://x.com/azuresreagent

Featured Videos

What is Azure SRE Agent — Official Overview

The official Microsoft Azure product overview — a concise explainer of what Azure SRE Agent is, how it works, and the problems it solves. 🔗 https://www.youtube.com/watch?v=6vDrThUjDOc · 6,156 views · 158 likes

Microsoft AI SRE Agent: Fixing Bugs While You Sleep

Satya Nadella highlights Azure SRE Agent as a key example of AI-driven operations transforming how engineering teams manage reliability at scale. 🔗 https://www.youtube.com/watch?v=3hPeKDtLvPg · 2,548 views · 26 likes

Azure SRE Agent: Less Toil, More Uptime, Maximum Innovation — Azure Friday

Scott Hanselman walks through Azure SRE Agent on Azure Friday, showing how it reduces operational toil and lets teams focus on innovation. 🔗 https://www.youtube.com/watch?v=5c9pl8_DI3w · 4,264 views · 75 likes

Root Cause Analysis with Code Context: Azure SRE Agent + GitHub Integration — GA Launch

The GA launch video demonstrating Azure SRE Agent performing root cause analysis with full code context through deep GitHub integration. 🔗 https://www.youtube.com/watch?v=1vKoxPeep_M · 582 views · 25 likes

Use Azure SRE Agent to Automate Tasks and Increase Site Reliability (DEM550) — Build

Deep-dive Build session covering end-to-end SRE Agent capabilities: automated investigation, remediation, proactive monitoring, and custom hooks. 🔗 https://www.youtube.com/watch?v=bK3SIQoE_Nc · 12,294 views · 129 likes


More Videos


Blogs

Post-GA (April 2026)

GA Launch (March 2026)

Pre-GA (December 2025)


GitHub Repos

Repo Stars Description
microsoft/sre-agent 83 Official hands-on lab — sample environments, walkthroughs, and prompt guides
matthansen0/azure-sre-agent-sandbox 52 Fully automated sandbox deployment with AKS break-fix scenarios
paulasilvatech/Agentic-Ops-Dev 23 Agentic Operations & Observability Workshop

Azure SRE Agent Hands-On Lab

Deploy an Azure SRE Agent connected to a sample application with a single azd up command. Watch it diagnose and remediate issues autonomously.

Learn more: What is Azure SRE Agent?

Architecture

Lab Architecture

Prerequisites

Required Tools

Tool macOS Windows
Azure CLI 2.60+ brew install azure-cli winget install Microsoft.AzureCLI
Azure Developer CLI 1.9+ brew install azd winget install Microsoft.Azd
Git 2.x brew install git winget install Git.Git (includes Git Bash)
Python 3.10+ brew install python3 winget install Python.Python.3.12

Windows note: After installing Python, disable the Windows Store app aliases: Settings → Apps → Advanced app settings → App execution aliases → turn OFF python.exe and python3.exe

Azure Requirements

  • Active Azure subscription
  • Owner role on the subscription (needed for RBAC role assignments)
  • Register the resource provider:
    az provider register -n Microsoft.App --wait

Optional

  • GitHub account (for code search and issue triage scenarios — uses OAuth sign-in, or a fine-grained PAT scoped to your fork with Contents:Read, Issues:Read+Write, Metadata:Read for least-privilege access)

Quick Start

Check prerequisites

Run the prereqs script to verify everything is installed:

# macOS/Linux
bash scripts/prereqs.sh

# Windows (Git Bash or CMD)
"C:\Program Files\Git\bin\bash.exe" scripts/prereqs.sh

macOS / Linux

# 1. Clone the repo
git clone https://github.com/dm-chelupati/sre-agent-lab.git
cd sre-agent-lab
git submodule update --init --recursive

# 2. Sign in to Azure
az login
azd auth login

# 3. Create environment and deploy
azd env new sre-lab
azd up
# Select your subscription and eastus2 as the region

Windows

REM 1. Clone the repo (in CMD or PowerShell)
git clone https://github.com/dm-chelupati/sre-agent-lab.git
cd sre-agent-lab
git submodule update --init --recursive

REM 2. Sign in to Azure
az login
azd auth login

REM 3. Create environment and deploy
azd env new sre-lab
azd up

REM If post-provision fails with 'bash not found' or 'Python not found':
set PATH=%PATH%;C:\Users\%USERNAME%\AppData\Local\Programs\Python\Python312
"C:\Program Files\Git\bin\bash.exe" scripts/post-provision.sh

Deployment takes ~8-12 minutes.

What Gets Deployed

Azure Infrastructure (via Bicep)

Resource Service Purpose Docs
SRE Agent Microsoft.App/agents AI agent for incident investigation Overview
Grubify API Azure Container Apps Sample app to monitor
Grubify Frontend Azure Container Apps Sample app UI
Log Analytics Microsoft.OperationalInsights Log storage for KQL queries Azure Observability
App Insights Microsoft.Insights Request tracing and exceptions
Alert Rules Microsoft.Insights/metricAlerts HTTP 5xx and error log alerts
Managed Identity Microsoft.ManagedIdentity Agent identity for Azure access Permissions
Container Registry Microsoft.ContainerRegistry Grubify container images

RBAC Roles Assigned

Role Scope Purpose
SRE Agent Administrator Agent resource User can manage agent via data plane APIs
Reader Resource group Agent can read all resources
Monitoring Reader Resource group Agent can read metrics and alerts
Log Analytics Reader Log Analytics workspace Agent can query logs via KQL

See: Manage Permissions

SRE Agent Configuration (via post-provision script)

Component Purpose Docs
Knowledge Base HTTP error runbook, app architecture, incident template Memory & Knowledge
incident-handler subagent Investigates alerts using logs, metrics, runbooks Custom Agents
Response Plan Routes HTTP 500 alerts to incident-handler Response Plans
Azure Monitor Incident platform — alerts flow to the agent Incident Platforms
GitHub OAuth connector Code search and issue management (optional) Connectors
code-analyzer subagent Source code root cause analysis Custom Agents
issue-triager subagent Automated issue triage from runbook Custom Agents

Note on GitHub tools: GitHub OAuth tools (code search, issue management) are built-in native tools, not MCP tools. Once the GitHub OAuth connector is set up, all agents — including subagents — get access to GitHub tools automatically through global settings. No explicit mcp_tools assignment is needed in subagent YAML. This is different from MCP connector tools (Datadog, Splunk, etc.) which require explicit mcp_tools assignment. | Scheduled Task | Triage customer issues every 12 hours | Scheduled Tasks | | Code Repo | Agent indexes the Grubify source code | Deep Context |

Post-Deployment

Re-run the setup script

# Full re-run (rebuilds container images + re-uploads everything)
./scripts/post-provision.sh

# Skip container image builds (just update KB, subagents, response plan)
./scripts/post-provision.sh --retry

# Windows: run from CMD with Python in PATH
set PATH=%PATH%;C:\Users\%USERNAME%\AppData\Local\Programs\Python\Python312
"C:\Program Files\Git\bin\bash.exe" scripts/post-provision.sh --retry

Manual container deploy (Windows fallback)

If the script deploys images but the app still shows the default page:

for /f "tokens=*" %a in ('azd env get-value AZURE_CONTAINER_REGISTRY_NAME') do set ACR=%a
for /f "tokens=*" %a in ('azd env get-value CONTAINER_APP_NAME') do set APP=%a
for /f "tokens=*" %a in ('azd env get-value FRONTEND_APP_NAME') do set FE=%a
az containerapp update --name %APP% --resource-group rg-sre-lab --image %ACR%.azurecr.io/grubify-api:latest
az containerapp update --name %FE% --resource-group rg-sre-lab --image %ACR%.azurecr.io/grubify-frontend:latest

Verify Setup

After deployment completes, open your agent at sre.azure.com and click Full setup. You should see green checkmarks on:

Card Expected Status
Code ✅ 1 repository
Incidents ✅ Connected to Azure Monitor
Azure resources ✅ 1 resource group added
Knowledge files ✅ 1 file

Checkpoint: If any card is missing a checkmark, re-run the post-provision script: bash scripts/post-provision.sh --retry

Once verified, click "Done and go to agent" to open the agent chat and start the team onboarding conversation.

Team Onboarding

The agent opens a "Team onboarding" thread automatically. It will:

  1. Explore your connected context — reads the code repository, Azure resources, and knowledge files you connected during setup
  2. Interview you about your team — ask about your team structure, on-call rotation, services you own, and escalation paths

Since the agent already has context from setup, try asking it questions:

"What do you know about the Grubify app architecture?"

"Summarize the HTTP errors runbook"

"What Azure resources are in my resource group?"

The agent saves your team information to persistent memory and references it in every future investigation.

Tip: Ask "What should I do next?" for personalized recommendations based on what's connected.

Lab Scenarios

Scenario 1: IT Operations (No GitHub required)

Break the app and watch the agent investigate:

./scripts/break-app.sh     # macOS/Linux
# Windows: "C:\Program Files\Git\bin\bash.exe" scripts/break-app.sh

Then open sre.azure.com → Incidents to watch the agent:

  1. Detect the Azure Monitor alert
  2. Query Log Analytics for error patterns
  3. Reference the HTTP errors runbook
  4. Apply remediation (restart/scale)
  5. Summarize with root cause and evidence

Scenario 2: Developer (Requires GitHub)

Ask the agent to search source code for root causes:

  • File:line references to problematic code
  • Correlation of production errors to code changes
  • Suggested fixes with before/after examples

Scenario 3: Workflow Automation (Requires GitHub)

Create sample support issues and let the agent triage them:

./scripts/create-sample-issues.sh <owner/repo>

The agent classifies issues (Documentation, Bug, Feature Request), applies labels, and posts triage comments following the runbook.

Adding GitHub Later

After initial setup, add GitHub by signing in via the OAuth URL:

./scripts/setup-github.sh   # macOS/Linux
# Windows: "C:\Program Files\Git\bin\bash.exe" scripts/setup-github.sh

Security tip: The OAuth flow requests broad repo access. For least-privilege, use a fine-grained PAT scoped to your grubify fork only with permissions: Contents:Read, Issues:Read+Write, Metadata:Read.

export GITHUB_PAT=github_pat_xxxx
./scripts/setup-github.sh

Cleanup

azd down --purge

Troubleshooting

Issue Fix
'bash' is not recognized (Windows) Run via: "C:\Program Files\Git\bin\bash.exe" scripts/post-provision.sh
Python was not found (Windows) Install: winget install Python.Python.3.12, disable App execution aliases
curl: error encountered when reading a file Python isn't in Git Bash PATH: export PATH="$PATH:/c/Users/$USER/AppData/Local/Programs/Python/Python312"
roleAssignments/write denied Need Owner role on subscription. Check: az role assignment list --assignee $(az ad signed-in-user show --query id -o tsv)
Microsoft.App not registered Run: az provider register -n Microsoft.App --wait
Grubify shows default page after deploy Run manual deploy commands (see Post-Deployment section above)
Post-provision 405 on response plan Wait 30s and run: ./scripts/post-provision.sh --retry
Agent can't create issues on forked repo Forks have Issues disabled by default. Enable: repo Settings → Features → Issues ✅, or run gh api -X PATCH repos/OWNER/REPO -f has_issues=true

Regions

SRE Agent is available in: eastus2, swedencentral, australiaeast

Links

License

MIT

About

Azure SRE Agent is an AI-powered reliability assistant that helps teams diagnose and resolve production issues, reduce operational toil, and lower mean time to resolution

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors