All posts
cron5 min read

Why Nobody Knows When Your Cron Job Stops Running

Cron jobs fail silently by design. Here's why the default tooling lies to you, and the one pattern that actually works.

Marco

Marco

Why Nobody Knows When Your Cron Job Stops Running

Picture this: it's a Monday morning. A client emails asking why their weekly report hasn't arrived for the past three weeks. You log in, check the cron logs (or rather, realize there are no cron logs), and spend two hours tracing the problem to a deployment from 23 days ago that silently removed the crontab entry.

The report had been failing silently the entire time. No alert. No email. No error. Nothing.

This isn't a freak accident. It's the default behavior of every cron system ever built.


The Fundamental Problem

Cron was designed to run commands on a schedule. That's it. It was never designed to tell you when those commands fail to run.

The architecture is one-way: cron fires the job, the job runs (or doesn't), and cron moves on. There is no built-in health check, no "it didn't run" alert, no way to know if the last execution was successful beyond whatever the job itself wrote to a log.

This creates a class of failure that's uniquely dangerous: the job that appears healthy because it produces no output at all.


Why Standard "Solutions" Don't Work

MAILTO in crontab

Most cron introductions mention MAILTO:

[email protected]
* * * * * /usr/bin/python3 /scripts/process.py

The theory: cron emails you stdout/stderr on failure. The reality:

  1. It only fires if the job produces output. A job that exits with code 1 silently produces nothing → no email.
  2. It requires a working mail daemon on the server (sendmail, Postfix, etc.). Most modern cloud VMs don't have one.
  3. Even if mail is configured, nobody reads that address. It becomes a graveyard of ignored cron output.
  4. It tells you nothing if the job simply never starts.

Checking logs manually

grep "ERROR" /var/log/myapp/cron.log

This only works if your job writes structured, consistent logs. It doesn't help with:

  • Jobs that hang and produce no output while blocking
  • Jobs that the scheduler never fires
  • Jobs that exit with code 0 after doing nothing useful

Log aggregation and alerting

Setting up Datadog or CloudWatch to alert on ERROR patterns in logs is closer to right, but still reactive. And it still misses the "job never ran" case entirely.


The Failure Mode Nobody Talks About: The Missing Execution

There are two types of cron job failures:

TypeWhat happensCaught by logs?
CrashJob starts, throws exception, exits non-zeroSometimes
Missing executionJob never runs at allNever

Missing executions happen when:

  • The server rebooted and cron wasn't restarted
  • A deployment wiped the crontab
  • The container cluster scaled down the node running the scheduler
  • A lock file from a previous hung job is blocking the new run
  • The cloud scheduler (EventBridge, Vercel Cron, etc.) silently stopped triggering

None of these produce a log entry. There's nothing to alert on. The silence is the signal. But only if you're listening for it.


The Pattern That Actually Works: Dead Man's Switch

A dead man's switch inverts the monitoring model.

Instead of watching for an error event, you watch for the absence of a success signal. If the signal doesn't arrive within the expected window, you get alerted.

The implementation is a single curl at the end of your job:

# Original job
0 2 * * * /opt/scripts/run-backup.sh

# With dead man's switch monitoring
0 2 * * * /opt/scripts/run-backup.sh && curl -fsS https://ping.cronping.com/abc123xyz

The && ensures the ping only fires on success. If the script fails, or never runs, Cronping notices the missing ping and sends you an alert.

That's the entire pattern. The rest is just making it robust.


Making It Robust

Handle failures explicitly

#!/bin/bash
PING_URL="https://ping.cronping.com/abc123xyz"

curl -fsS "${PING_URL}/start" > /dev/null  # signal job started

# Your actual work
/opt/scripts/run-backup.sh
EXIT_CODE=$?

if [ $EXIT_CODE -eq 0 ]; then
  curl -fsS "${PING_URL}" > /dev/null        # success
else
  curl -fsS "${PING_URL}/fail" > /dev/null   # explicit failure
  exit $EXIT_CODE
fi

The /start ping lets Cronping track duration. The /fail endpoint triggers an immediate alert without waiting for the grace period.

Python example

import requests
import sys

PING_URL = "https://ping.cronping.com/abc123xyz"

def main():
    try:
        requests.get(f"{PING_URL}/start", timeout=5)
    except Exception:
        pass  # monitoring failure should never block the job

    try:
        run_your_job()
        requests.get(PING_URL, timeout=5)
    except Exception as e:
        try:
            requests.get(f"{PING_URL}/fail", params={"msg": str(e)}, timeout=5)
        except Exception:
            pass
        raise

def run_your_job():
    # your actual logic here
    pass

if __name__ == "__main__":
    main()

Node.js example

const PING_URL = "https://ping.cronping.com/abc123xyz";

async function pingCronping(path = "", params = {}) {
  try {
    const url = new URL(`${PING_URL}${path}`);
    for (const [k, v] of Object.entries(params)) url.searchParams.set(k, v);
    await fetch(url.toString(), { signal: AbortSignal.timeout(5000) });
  } catch {
    // never let monitoring failures affect the job
  }
}

async function main() {
  await pingCronping("/start");

  try {
    await runYourJob();
    await pingCronping();
  } catch (err) {
    await pingCronping("/fail", { msg: err.message });
    process.exit(1);
  }
}

main();

What Grace Period to Set

The grace period is how long Cronping waits after the expected execution time before alerting. Set it too tight and you'll get false alarms. Too loose and slow failures slip through.

Rule of thumb: set the grace period to 20–30% of the expected job duration, with a minimum of 5 minutes.

Job scheduleTypical durationSuggested grace period
Every 5 minutes< 1 min3 minutes
Hourly5–10 min15 minutes
Daily at 2am30 min10 minutes
Weekly1–2 hours30 minutes

The Checklist

Before you close this tab, audit your most critical cron jobs against this:

  • Does it send a ping on success?
  • Does it send /fail on error (not just rely on the timeout)?
  • Is the grace period set appropriately?
  • Are alerts going to a channel someone actually reads (Slack, PagerDuty)?
  • Is the job listed somewhere so you know it exists?

The last point is underrated. Cron jobs accumulate over years. A shared inventory (even a simple Notion table) of "what jobs run where, and what ping key they use" is worth maintaining.


Getting Started

Cronping gives you a ping URL in under a minute:

  1. Sign up at cronping.com
  2. Create a new heartbeat monitor
  3. Set the schedule (using cron expression or human-readable interval)
  4. Add the curl call to your job
  5. Set your alert channels (Slack, Email, Discord, PagerDuty, webhook)

The free plan covers 5 monitors, which is enough to protect your most critical jobs today.

The backup that silently stopped running? That's a solved problem.