Cron Job Monitoring: The Complete Guide
Every software team has a story. Maybe it's the ETL pipeline that hadn't run in two weeks before a client meeting exposed the stale data. Or the backup job that silently failed for three months until somebody actually tried to restore. Or the cleanup job that crashed, filled up the disk, and took down production at 3am.
All of these stories share a common thread: nobody knew the job had stopped running.
This guide is about changing that. We'll cover why cron jobs fail, what kinds of monitoring exist, how dead-man's switch monitoring works, and how to set it up for your stack.
What Is a Cron Job?
A cron job is any task that runs automatically on a defined schedule. The name comes from the Unix cron daemon, but today "cron job" is used loosely to mean any scheduled, automated task:
- A
crontabentry on a Linux server - A
CronJobresource in Kubernetes - A Lambda function triggered by EventBridge
- A Vercel cron route called by their scheduler
- A Bull/BullMQ repeatable job
- A GitHub Actions scheduled workflow
- An
@Scheduledmethod in a Spring Boot service
The implementation varies, but the fundamental problem is the same: you need it to run, on time, every time, and you need to know when it doesn't.
Why Cron Jobs Are Risky
Cron jobs are often written quickly, tested minimally, and forgotten immediately. That's partly because they "just work" for a long time — until they don't.
Common Failure Modes
1. The job never starts
The scheduler died, the container restarted, the server ran out of memory, or the deployment accidentally removed the cron definition. The job simply doesn't run, and nothing tells you.
2. The job crashes early
An exception on line 3 before any meaningful work is done. The process exits with code 1. Logs exist, but nobody's watching them.
3. The job hangs
A database query locks up. An external HTTP call never times out. The job is technically "running" but doing nothing, and it'll block the next scheduled run too.
4. The job succeeds but does nothing
No exception, exits 0, but a missing WHERE clause means the query returned 0 rows and nothing was processed. Looks healthy in metrics, isn't.
5. The job starts but finishes too slow
A job that normally takes 5 minutes suddenly takes 4 hours. Not technically failed, but something's wrong (slow query, bigger dataset, external degradation).
6. The job runs on the wrong schedule
Server timezone changed after a migration. Daylight saving time caused a double-run or a skip. The schedule expression was wrong from the beginning but nobody noticed.
7. The job runs, but too many times
Horizontal scaling without distributed locking. Two instances of your app, both running the same scheduler. Now that "once per day" job runs twice.
Types of Cron Job Monitoring
There are fundamentally three approaches, and they complement each other.
1. Log-Based Monitoring
You watch the output of the job — stdout, stderr, log files, or structured logs in a system like Datadog, Loki, or CloudWatch.
What it catches: Exceptions, errors, slow runs, unexpected output.
What it misses: The job not running at all. If the job doesn't run, there's nothing to log. Log-based monitoring has no awareness of expected runs that didn't happen.
Best for: Debugging failures you already know happened.
2. Active Heartbeat Monitoring
You periodically check if something is running. A monitoring service pings your app or checks a health endpoint on a schedule.
What it catches: Is the service alive right now?
What it misses: A job that ran at 2am but hasn't checked in since. The service might be "up" while 10 cron jobs are silently broken.
Best for: API uptime and service health, not scheduled jobs.
3. Dead-Man's Switch Monitoring (Passive/Push)
The job itself sends a "check-in" signal after each successful run. A monitoring service watches for these signals. If one doesn't arrive when expected, an alert fires.
This is a dead-man's switch — a device that requires active operation to prevent it from triggering. If you stop operating (i.e., the job stops checking in), the switch fires.
What it catches: Any failure mode that prevents the job from completing — including "job never started", "job crashed", "job hung", and "job ran on wrong schedule."
What it misses: Not much. It's the most complete form of cron job monitoring.
Best for: Everything. This should be your default.
How Dead-Man's Switch Monitoring Works
The mechanics are simple:
- You register a heartbeat in the monitoring service with your cron expression
- Your job sends an HTTP request (a "ping") when it successfully completes
- The monitoring service calculates when the next ping is due based on the cron expression + a grace period
- If no ping arrives by the deadline, an alert is sent
Heartbeat config:
schedule: "0 2 * * *" # 2am daily
grace: 10 minutes # Allow 10 min variance
Timeline:
02:00 → Job starts
02:03 → Job completes → sends ping
02:03 → Cronping receives ping ✅
Next day:
02:00 → Job should run...
02:10 → No ping received → ALERT ❌
The job doesn't need to be always watched. It needs to report in once per schedule cycle.
The Start Signal
For jobs that run longer than a few seconds, you should send a /start ping at the beginning of the job, not just a success ping at the end.
Why? Because without a start signal, Cronping doesn't know if the job is running slowly or if it never started. With a start signal, it knows: "the job started at 2:03am, I'll wait for it to finish."
Without /start:
02:00 → Expected start
02:10 → Grace period expires → ALERT
02:47 → Job finishes (it was just slow) → Ping received (too late, alert already sent)
With /start:
02:00 → Expected start
02:03 → /start received → grace clock starts fresh
02:47 → Ping received ✅ (no false alert)
For jobs under 30 seconds, start signals are optional. For anything longer, they're worth it.
Setting Up Monitoring: Step by Step
Step 1: Create a Heartbeat
In Cronping, create a new heartbeat with:
- Name: Something descriptive (
db-backup,invoice-processor,cleanup-old-logs) - Schedule type: Cron expression or simple interval
- Schedule: e.g.,
0 2 * * * - Grace period: How much leeway to allow (5–30 minutes for most jobs)
You'll get a unique ping URL like:
https://ping.cronping.com/abc123xyz
Step 2: Add Monitoring to Your Job
The minimal implementation is a single HTTP GET at the end of a successful run:
# Shell script
#!/bin/bash
/opt/scripts/run-backup.sh && curl -fsS https://ping.cronping.com/abc123xyz
For better coverage, add start and fail signals:
#!/bin/bash
PING="https://ping.cronping.com/abc123xyz"
curl -fsS "${PING}/start"
if /opt/scripts/run-backup.sh; then
curl -fsS "${PING}"
else
curl -fsS "${PING}/fail"
exit 1
fi
In Python:
import requests
import sys
PING_URL = "https://ping.cronping.com/abc123xyz"
def ping(path=""):
try:
requests.get(f"{PING_URL}{path}", timeout=10)
except Exception:
pass # Never let monitoring crash the job
ping("/start")
try:
run_my_job()
ping()
except Exception as e:
ping("/fail")
print(f"Job failed: {e}", file=sys.stderr)
sys.exit(1)
In Node.js (see our dedicated Node.js guide for more details):
const PING_URL = "https://ping.cronping.com/abc123xyz";
const ping = async (path = "") => fetch(`${PING_URL}${path}`).catch(() => {});
await ping("/start");
try {
await runMyJob();
await ping();
} catch (err) {
await ping("/fail");
throw err;
}
Step 3: Test It
Trigger the job manually and verify the heartbeat switches to "Up" in the Cronping dashboard. Then simulate a failure and make sure you get alerted.
The most common bug here: the ping URL is wrong (typo in the key) and the job appears to work but nothing is actually being received. Always verify the heartbeat status after the first run.
Step 4: Configure Alerts
Set up alerting to wherever your team actually responds:
- Email — good for non-urgent jobs, easy to miss
- Slack/Discord — good for team visibility
- PagerDuty/OpsGenie — for critical jobs requiring immediate response
- Webhook — for custom integrations (ticketing systems, etc.)
Match the alert channel to the severity of the job. A weekly newsletter job failing doesn't warrant a 3am page. A payment processor failing absolutely does.
What to Monitor: Prioritizing Your Jobs
Not every cron job needs the same level of urgency. Here's a rough framework:
Tier 1: Business Critical (PagerDuty/immediate alert)
- Payment processing, subscription renewals
- Data ingestion pipelines that customers depend on
- Backup jobs (you can't restore what wasn't backed up)
- Fraud detection, compliance reporting
Tier 2: Important (Slack/email, 1-hour response)
- Email delivery workers
- Daily/weekly report generation
- Analytics aggregation
- Cache warming jobs
Tier 3: Operational (Email digest, next business day)
- Log rotation and cleanup
- Archival jobs
- Non-critical data syncs
- Development/staging environment jobs
If you're starting from scratch, focus on Tier 1 first. Three well-monitored critical jobs are worth more than 30 poorly monitored important ones.
Advanced Patterns
Monitoring a Job That Runs Multiple Times
If your job runs every 5 minutes, you'll get a ping every 5 minutes. This is fine — Cronping handles high-frequency schedules well. Each ping resets the timer.
Monitoring Distributed Jobs (Running on Multiple Instances)
If you run the same job across multiple instances (like a K8s deployment without singleton scheduling), every instance will ping. Cronping will receive multiple pings per interval, which is fine — as long as at least one arrives, the heartbeat is considered healthy.
If you need only one instance to run the job, solve that at the scheduler level (leader election, node-cron with Redis locking, etc.). Monitoring is downstream of scheduling, not a replacement for it.
Long-Running Jobs
For jobs that take hours (data migrations, full database dumps):
- Set a generous grace period (longer than the longest expected runtime)
- Always use the
/startsignal - Consider sending periodic
/logpings during the job to show progress
# For very long jobs, send progress updates
for batch in process_in_batches(data):
process_batch(batch)
# Every 100 batches, send a log ping so Cronping knows we're still alive
if batch.number % 100 == 0:
requests.post(f"{PING_URL}/log", data=f"Processed {batch.number} batches", timeout=5)
Exit Code Monitoring
Instead of manually calling /fail, you can pass the exit code directly:
/opt/scripts/my-job.sh
EXIT_CODE=$?
curl -fsS "https://ping.cronping.com/abc123xyz/${EXIT_CODE}"
Exit code 0 = success, anything else = failure. Clean and idiomatic.
The Metrics Worth Tracking
Once your jobs are monitored, you have visibility into:
Uptime — What percentage of expected runs succeeded? 99% uptime on a daily job means roughly 3.6 failed runs per year.
Duration — How long does each run take? A graph over 90 days makes it obvious when a job starts degrading (growing runtime usually means a database query isn't scaling with data growth).
Last ping time — When did it last run? If "last ping: 3 days ago" and it's supposed to run daily, something's wrong.
Flip history — How often does it go up/down/up? Frequent flapping might indicate resource contention or an unreliable dependency.
Making the Case to Your Team
If you need to convince your team to invest time in this:
- The average time-to-detect for a failed cron job without monitoring is days to weeks
- Most production incidents have a "this should have been caught by automated monitoring" root cause
- Implementation cost: 5–15 minutes per job, one HTTP call
- Tools like Cronping have free plans — the cost barrier is essentially zero
The question isn't "can we afford to monitor cron jobs." It's "what was the last silent failure that cost us, and what would monitoring have done to that timeline."
Summary
- Cron jobs fail silently. This is the default. You have to opt into visibility.
- Log monitoring and health checks don't detect "job didn't run." Dead-man's switch monitoring does.
- The implementation is minimal: an HTTP call at the end of each job.
- Prioritize by business impact. Not every job needs a 3am page.
- The start signal is underrated — use it for jobs longer than a minute.
Cron job monitoring isn't glamorous work. But it's the kind of infrastructure investment that prevents the 3am call that ruins everyone's week. Worth the 15 minutes.