Why Nobody Knows When Your Cron Job Stops Running
Picture this: it's a Monday morning. A client emails asking why their weekly report hasn't arrived for the past three weeks. You log in, check the cron logs (or rather, realize there are no cron logs), and spend two hours tracing the problem to a deployment from 23 days ago that silently removed the crontab entry.
The report had been failing silently the entire time. No alert. No email. No error. Nothing.
This isn't a freak accident. It's the default behavior of every cron system ever built.
The Fundamental Problem
Cron was designed to run commands on a schedule. That's it. It was never designed to tell you when those commands fail to run.
The architecture is one-way: cron fires the job, the job runs (or doesn't), and cron moves on. There is no built-in health check, no "it didn't run" alert, no way to know if the last execution was successful beyond whatever the job itself wrote to a log.
This creates a class of failure that's uniquely dangerous: the job that appears healthy because it produces no output at all.
Why Standard "Solutions" Don't Work
MAILTO in crontab
Most cron introductions mention MAILTO:
[email protected]
* * * * * /usr/bin/python3 /scripts/process.py
The theory: cron emails you stdout/stderr on failure. The reality:
- It only fires if the job produces output. A job that exits with code 1 silently produces nothing → no email.
- It requires a working mail daemon on the server (
sendmail, Postfix, etc.). Most modern cloud VMs don't have one. - Even if mail is configured, nobody reads that address. It becomes a graveyard of ignored cron output.
- It tells you nothing if the job simply never starts.
Checking logs manually
grep "ERROR" /var/log/myapp/cron.log
This only works if your job writes structured, consistent logs. It doesn't help with:
- Jobs that hang and produce no output while blocking
- Jobs that the scheduler never fires
- Jobs that exit with code 0 after doing nothing useful
Log aggregation and alerting
Setting up Datadog or CloudWatch to alert on ERROR patterns in logs is closer to right, but still reactive. And it still misses the "job never ran" case entirely.
The Failure Mode Nobody Talks About: The Missing Execution
There are two types of cron job failures:
| Type | What happens | Caught by logs? |
|---|---|---|
| Crash | Job starts, throws exception, exits non-zero | Sometimes |
| Missing execution | Job never runs at all | Never |
Missing executions happen when:
- The server rebooted and cron wasn't restarted
- A deployment wiped the crontab
- The container cluster scaled down the node running the scheduler
- A lock file from a previous hung job is blocking the new run
- The cloud scheduler (EventBridge, Vercel Cron, etc.) silently stopped triggering
None of these produce a log entry. There's nothing to alert on. The silence is the signal. But only if you're listening for it.
The Pattern That Actually Works: Dead Man's Switch
A dead man's switch inverts the monitoring model.
Instead of watching for an error event, you watch for the absence of a success signal. If the signal doesn't arrive within the expected window, you get alerted.
The implementation is a single curl at the end of your job:
# Original job
0 2 * * * /opt/scripts/run-backup.sh
# With dead man's switch monitoring
0 2 * * * /opt/scripts/run-backup.sh && curl -fsS https://ping.cronping.com/abc123xyz
The && ensures the ping only fires on success. If the script fails, or never runs, Cronping notices the missing ping and sends you an alert.
That's the entire pattern. The rest is just making it robust.
Making It Robust
Handle failures explicitly
#!/bin/bash
PING_URL="https://ping.cronping.com/abc123xyz"
curl -fsS "${PING_URL}/start" > /dev/null # signal job started
# Your actual work
/opt/scripts/run-backup.sh
EXIT_CODE=$?
if [ $EXIT_CODE -eq 0 ]; then
curl -fsS "${PING_URL}" > /dev/null # success
else
curl -fsS "${PING_URL}/fail" > /dev/null # explicit failure
exit $EXIT_CODE
fi
The /start ping lets Cronping track duration. The /fail endpoint triggers an immediate alert without waiting for the grace period.
Python example
import requests
import sys
PING_URL = "https://ping.cronping.com/abc123xyz"
def main():
try:
requests.get(f"{PING_URL}/start", timeout=5)
except Exception:
pass # monitoring failure should never block the job
try:
run_your_job()
requests.get(PING_URL, timeout=5)
except Exception as e:
try:
requests.get(f"{PING_URL}/fail", params={"msg": str(e)}, timeout=5)
except Exception:
pass
raise
def run_your_job():
# your actual logic here
pass
if __name__ == "__main__":
main()
Node.js example
const PING_URL = "https://ping.cronping.com/abc123xyz";
async function pingCronping(path = "", params = {}) {
try {
const url = new URL(`${PING_URL}${path}`);
for (const [k, v] of Object.entries(params)) url.searchParams.set(k, v);
await fetch(url.toString(), { signal: AbortSignal.timeout(5000) });
} catch {
// never let monitoring failures affect the job
}
}
async function main() {
await pingCronping("/start");
try {
await runYourJob();
await pingCronping();
} catch (err) {
await pingCronping("/fail", { msg: err.message });
process.exit(1);
}
}
main();
What Grace Period to Set
The grace period is how long Cronping waits after the expected execution time before alerting. Set it too tight and you'll get false alarms. Too loose and slow failures slip through.
Rule of thumb: set the grace period to 20–30% of the expected job duration, with a minimum of 5 minutes.
| Job schedule | Typical duration | Suggested grace period |
|---|---|---|
| Every 5 minutes | < 1 min | 3 minutes |
| Hourly | 5–10 min | 15 minutes |
| Daily at 2am | 30 min | 10 minutes |
| Weekly | 1–2 hours | 30 minutes |
The Checklist
Before you close this tab, audit your most critical cron jobs against this:
- Does it send a ping on success?
- Does it send
/failon error (not just rely on the timeout)? - Is the grace period set appropriately?
- Are alerts going to a channel someone actually reads (Slack, PagerDuty)?
- Is the job listed somewhere so you know it exists?
The last point is underrated. Cron jobs accumulate over years. A shared inventory (even a simple Notion table) of "what jobs run where, and what ping key they use" is worth maintaining.
Getting Started
Cronping gives you a ping URL in under a minute:
- Sign up at cronping.com
- Create a new heartbeat monitor
- Set the schedule (using cron expression or human-readable interval)
- Add the
curlcall to your job - Set your alert channels (Slack, Email, Discord, PagerDuty, webhook)
The free plan covers 5 monitors, which is enough to protect your most critical jobs today.
The backup that silently stopped running? That's a solved problem.