All posts
incident5 min read

Incident Report: Alert Delay on February 28, 2026

On February 28, 2026, a subset of Cronping users experienced delayed alert notifications of up to 47 minutes. Here's exactly what happened, why, and what we changed.

Marco

Marco

Incident Report: Alert Delay on February 28, 2026

On February 28, 2026, between 17:42 UTC and 18:29 UTC, a subset of Cronping users experienced delayed alert notifications. Heartbeats that transitioned to a "down" state were not notified for up to 47 minutes after the expected notification time. Heartbeats continued to be monitored correctly throughout the incident — pings were received and processed normally. Only outbound notifications (email, Slack, webhook) were affected.

This post is our full account of what happened.


Who Was Affected

Approximately 23% of active organizations were affected. The impact was concentrated on users whose jobs failed between 17:42 UTC and 18:29 UTC. Users without any heartbeats transitioning to "down" during this window were not affected.

Integrations that process notifications through our background queue (email, Slack, Discord, webhook, PagerDuty) were delayed. Telegram notifications were unaffected due to processing through a separate worker.


Timeline

All times in UTC.

TimeEvent
17:38Our primary background job worker deployed as part of a routine database index migration
17:42Worker deployment completes. Internal monitoring shows queue depth beginning to grow
17:51First user-facing report received via support email: "We got a delayed alert — our heartbeat went down at 17:43 but we only got the Slack notification now"
17:58On-call engineer begins investigation. Queue depth at ~340 pending jobs
18:05Root cause identified: a missing database index introduced by the migration was causing alert query to perform a sequential scan on a table with 2.1M rows
18:11Hotfix index deployed to production database
18:14Queue begins draining normally. Alert processing latency drops from ~40s/alert to <1s
18:29Queue depth returns to normal. All delayed notifications delivered
18:33Incident resolved. Postmortem process begins

Total duration: 47 minutes


Root Cause

We added a new database index as part of a schema migration earlier in the week. The migration itself was designed to improve query performance on our alert processing table — it was adding a composite index on (organization_id, status, next_ping_at) to speed up the query that finds overdue heartbeats.

The migration had two steps:

  1. Add the new composite index
  2. Drop an old single-column index on status that was no longer needed

Step 2 was the problem.

The alert processing query that scans for overdue heartbeats was:

SELECT * FROM heartbeats
WHERE status = 'up'
  AND next_ping_at < NOW()
  AND notifications_enabled = true
ORDER BY next_ping_at ASC
LIMIT 100;

This query was written assuming the status index would be used. After we dropped that index, the query planner switched to a sequential scan. On a table with 2.1 million rows, each scan took approximately 4.2 seconds instead of ~8 milliseconds. With notifications running every second, the worker quickly fell behind.

The migration was tested on a staging database with only ~12,000 rows. The sequential scan at that scale completes in 80ms — slow but not alarming. We didn't catch the behavioral difference at production scale.


What We Fixed

Immediate (deployed during incident):

  • Added a targeted index on (status, next_ping_at) covering the exact query pattern used by the alert worker
  • Added EXPLAIN ANALYZE output to our staging migration tests in CI

Short-term (deployed within 48 hours):

  • Added a query performance test to our CI pipeline: migrations that drop indexes must demonstrate that affected queries maintain execution time within 2x of pre-migration baseline
  • Created an automated alert on our internal monitoring that fires if the alert processing queue depth exceeds 50 jobs for more than 60 seconds

Medium-term (deploying this week):

  • Migrating our alert worker to emit structured timing metrics per query so we have visibility into query execution time at the application level, not just database level
  • Adding end-to-end alert latency monitoring: we'll have synthetic heartbeats that fail on purpose, and if we don't receive the alert within 2 minutes, our infrastructure alerting fires

What We Didn't Do Well

We tested migrations on staging with unrealistic data volumes. Our staging database is a subset of production with about 0.5% of the rows. This is normal for development speed, but it means we can't rely on staging to catch query plan regressions. We knew this was a risk and hadn't yet acted on it.

The monitoring gap was too wide. We had internal queue depth monitoring, but the alert threshold was set at 500 jobs (a leftover from early tuning). The incident had 340 pending jobs at its peak — technically under threshold. We should have been alerted at ~50.

The first user report came at 17:51, nine minutes into the incident. Our own infrastructure monitoring didn't catch it for another seven minutes. Users noticed before our systems did.


Metrics

MetricValue
Duration47 minutes
Organizations affected~23%
Maximum notification delay47 minutes
Heartbeats incorrectly paused/altered0
Data lossNone
Pings dropped or lost0

The monitoring engine (ping reception, state tracking, flip history) was fully operational throughout. The issue was isolated to outbound notifications.


Changes to Our Reliability Practices

We're updating our incident playbook with two items:

  1. Migrations that drop indexes require performance validation against production row counts (or an anonymized production-scale dataset). We're building a sanitized data export pipeline for our staging environment this month.

  2. End-to-end canary monitoring for our own product. We run a few dozen heartbeats internally for tracking infrastructure jobs. We're adding synthetic "failure" heartbeats — scheduled to fail deliberately — so that we get alerted if alert delivery breaks. It's using Cronping to monitor Cronping. There's something right about that.


A Note on Transparency

I considered writing a softer version of this post — vague about the root cause, light on the numbers, heavy on "we take reliability seriously." I've read enough of those posts to know how they read.

The actual service we provide is "tell me when my cron jobs fail." If we have an incident that delays those notifications, the least we can do is tell you exactly what happened and what we changed. Anything less would be inconsistent with what we're for.

If you were affected and want to talk through the incident or discuss compensations, email [email protected]. We'll make it right.


This report will be updated if any new findings emerge from the postmortem process. Last updated: March 5, 2026.