resolved production monitor and stabilization problem#29
Conversation
📝 WalkthroughWalkthroughThe PR implements production monitoring infrastructure for the PaperScout scheduler. It introduces a failure category taxonomy, extends the health endpoint with computed metrics, adds message queue backpressure tracking, implements scheduler staleness detection with ops alerts, and applies structured error categorization across polling and probing code paths. ChangesProduction Monitoring Setup
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 3 | ❌ 2❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
tests/test_health.py (1)
106-114: ⚡ Quick winAdd a test for failing
extra_fields_fnfallback behavior.Please add a case where
extra_fields_fnraises, then assert/healthstill returns 200 with base fields. This is a key resilience branch introduced in this PR.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/test_health.py` around lines 106 - 114, Add a new test (or extend test_health_extra_fields_merged) that registers an extra_fields_fn which raises an exception and then calls the /health endpoint to verify resilience: assert the HTTP status is 200 and only the base health fields (e.g. "version" and "last_successful_poll") are present and correctly valued; reference the existing test helper/fixture used by tests (health_url_with_extras or create a similar fixture that wires a failing extra_fields_fn) and ensure the failing extra_fields_fn does not propagate, so /health returns base fields and a 200 response.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/paperscout/monitor.py`:
- Around line 115-117: The stale-alert gating currently blocks alerts when
self._last_successful_poll is None and also calls ops_alert_fn without
protection; update the gating so alerts are allowed if there has never been a
successful poll OR the last_successful_poll is older than the stale threshold
(i.e., treat None as "stale"), and wrap the ops_alert_fn invocation in a
try/except inside run_forever() to catch/log any exception and prevent crashing
while still updating/using self._last_ops_alert appropriately; reference
self._last_successful_poll, self._last_ops_alert, ops_alert_fn, and
run_forever() when making the changes.
---
Nitpick comments:
In `@tests/test_health.py`:
- Around line 106-114: Add a new test (or extend
test_health_extra_fields_merged) that registers an extra_fields_fn which raises
an exception and then calls the /health endpoint to verify resilience: assert
the HTTP status is 200 and only the base health fields (e.g. "version" and
"last_successful_poll") are present and correctly valued; reference the existing
test helper/fixture used by tests (health_url_with_extras or create a similar
fixture that wires a failing extra_fields_fn) and ensure the failing
extra_fields_fn does not propagate, so /health returns base fields and a 200
response.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: a063c352-50de-4861-81b6-e0bee7bc4368
📒 Files selected for processing (10)
src/paperscout/__main__.pysrc/paperscout/config.pysrc/paperscout/errors.pysrc/paperscout/health.pysrc/paperscout/monitor.pysrc/paperscout/scout.pysrc/paperscout/sources.pytests/test_health.pytests/test_monitor.pytests/test_sources.py
| self._last_successful_poll: float | None = None | ||
| self._last_probe_stats: dict[str, int] = {} | ||
| self._last_ops_alert: float | None = None |
There was a problem hiding this comment.
Fix stale-alert gating and guard ops_alert_fn failures.
On Line 323, requiring self._last_successful_poll is not None suppresses alerts when the scheduler has never had a successful poll. Also, on Line 330, an exception from ops_alert_fn can crash run_forever().
💡 Proposed fix
@@
self._last_successful_poll: float | None = None
self._last_probe_stats: dict[str, int] = {}
self._last_ops_alert: float | None = None
+ self._started_at: float = time.time()
@@
- if self.ops_alert_fn and self._last_successful_poll is not None:
- stale = time.time() - self._last_successful_poll
+ if self.ops_alert_fn:
+ last_ok = self._last_successful_poll or self._started_at
+ stale = time.time() - last_ok
alert_threshold = 2 * interval
now_m = time.monotonic()
if stale > alert_threshold and (
self._last_ops_alert is None or (now_m - self._last_ops_alert) > interval
):
- self.ops_alert_fn(
- f"No successful poll in {stale / 60:.0f}min "
- f"(threshold={2 * self.cfg.poll_interval_minutes}min)"
- )
- self._last_ops_alert = now_m
+ try:
+ self.ops_alert_fn(
+ f"No successful poll in {stale / 60:.0f}min "
+ f"(threshold={2 * self.cfg.poll_interval_minutes}min)"
+ )
+ self._last_ops_alert = now_m
+ except Exception:
+ log.exception("OPS-ALERT-ERROR")Also applies to: 323-334
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/paperscout/monitor.py` around lines 115 - 117, The stale-alert gating
currently blocks alerts when self._last_successful_poll is None and also calls
ops_alert_fn without protection; update the gating so alerts are allowed if
there has never been a successful poll OR the last_successful_poll is older than
the stale threshold (i.e., treat None as "stale"), and wrap the ops_alert_fn
invocation in a try/except inside run_forever() to catch/log any exception and
prevent crashing while still updating/using self._last_ops_alert appropriately;
reference self._last_successful_poll, self._last_ops_alert, ops_alert_fn, and
run_forever() when making the changes.
Summary
New / updated files
Tests
Ops / env
Set OPS_ALERT_CHANNEL to a Slack channel ID so stale-poll alerts are sent there (empty = disabled).
Related issues
close #28
Summary by CodeRabbit
New Features
Tests