Cron Job Monitoring: The Complete Guide

Every software team has a story. Maybe it's the ETL pipeline that hadn't run in two weeks before a client meeting exposed the stale data. Or the backup job that silently failed for three months until somebody actually tried to restore. Or the cleanup job that crashed, filled up the disk, and took down production at 3am.

All of these stories share a common thread: nobody knew the job had stopped running.

This guide is about changing that. We'll cover why cron jobs fail, what kinds of monitoring exist, how dead-man's switch monitoring works, and how to set it up for your stack.

What Is a Cron Job?

A cron job is any task that runs automatically on a defined schedule. The name comes from the Unix cron daemon, but today "cron job" is used loosely to mean any scheduled, automated task:

A crontab entry on a Linux server
A CronJob resource in Kubernetes
A Lambda function triggered by EventBridge
A Vercel cron route called by their scheduler
A Bull/BullMQ repeatable job
A GitHub Actions scheduled workflow
An @Scheduled method in a Spring Boot service

The implementation varies, but the fundamental problem is the same: you need it to run, on time, every time, and you need to know when it doesn't.

Why Cron Jobs Are Risky

Cron jobs are often written quickly, tested minimally, and forgotten immediately. That's partly because they "just work" for a long time. Until they don't.

Common Failure Modes

1. The job never starts

The scheduler died, the container restarted, the server ran out of memory, or the deployment accidentally removed the cron definition. The job simply doesn't run, and nothing tells you.

2. The job crashes early

An exception on line 3 before any meaningful work is done. The process exits with code 1. Logs exist, but nobody's watching them.

3. The job hangs

A database query locks up. An external HTTP call never times out. The job is technically "running" but doing nothing, and it'll block the next scheduled run too.

4. The job succeeds but does nothing

No exception, exits 0, but a missing WHERE clause means the query returned 0 rows and nothing was processed. Looks healthy in metrics, isn't.

5. The job starts but finishes too slow

A job that normally takes 5 minutes suddenly takes 4 hours. Not technically failed, but something's wrong (slow query, bigger dataset, external degradation).

6. The job runs on the wrong schedule

Server timezone changed after a migration. Daylight saving time caused a double-run or a skip. The schedule expression was wrong from the beginning but nobody noticed.

7. The job runs, but too many times

Horizontal scaling without distributed locking. Two instances of your app, both running the same scheduler. Now that "once per day" job runs twice.

Types of Cron Job Monitoring

There are fundamentally three approaches, and they complement each other.

1. Log-Based Monitoring

You watch the output of the job: stdout, stderr, log files, or structured logs in a system like Datadog, Loki, or CloudWatch.

What it catches: Exceptions, errors, slow runs, unexpected output.

What it misses: The job not running at all. If the job doesn't run, there's nothing to log. Log-based monitoring has no awareness of expected runs that didn't happen.

Best for: Debugging failures you already know happened.

2. Active Heartbeat Monitoring

You periodically check if something is running. A monitoring service pings your app or checks a health endpoint on a schedule.

What it catches: Is the service alive right now?

What it misses: A job that ran at 2am but hasn't checked in since. The service might be "up" while 10 cron jobs are silently broken.

Best for: API uptime and service health, not scheduled jobs.

3. Dead-Man's Switch Monitoring (Passive/Push)

The job itself sends a "check-in" signal after each successful run. A monitoring service watches for these signals. If one doesn't arrive when expected, an alert fires.

This is a dead-man's switch, a device that requires active operation to prevent it from triggering. If you stop operating (i.e., the job stops checking in), the switch fires.

What it catches: Any failure mode that prevents the job from completing, including "job never started", "job crashed", "job hung", and "job ran on wrong schedule."

What it misses: Not much. It's the most complete form of cron job monitoring.

Best for: Everything. This should be your default.

How Dead-Man's Switch Monitoring Works

The mechanics are simple:

You register a heartbeat in the monitoring service with your cron expression
Your job sends an HTTP request (a "ping") when it successfully completes
The monitoring service calculates when the next ping is due based on the cron expression + a grace period
If no ping arrives by the deadline, an alert is sent

code

Heartbeat config:
  schedule: "0 2 * * *"   # 2am daily
  grace: 10 minutes        # Allow 10 min variance

Timeline:
  02:00 → Job starts
  02:03 → Job completes → sends ping
  02:03 → Cronping receives ping ✅

Next day:
  02:00 → Job should run...
  02:10 → No ping received → ALERT ❌

The job doesn't need to be always watched. It needs to report in once per schedule cycle.

The Start Signal

For jobs that run longer than a few seconds, you should send a /start ping at the beginning of the job, not just a success ping at the end.

Why? Because without a start signal, Cronping doesn't know if the job is running slowly or if it never started. With a start signal, it knows: "the job started at 2:03am, I'll wait for it to finish."

code

Without /start:
  02:00 → Expected start
  02:10 → Grace period expires → ALERT
  02:47 → Job finishes (it was just slow) → Ping received (too late, alert already sent)

With /start:
  02:00 → Expected start
  02:03 → /start received → grace clock starts fresh
  02:47 → Ping received ✅ (no false alert)

For jobs under 30 seconds, start signals are optional. For anything longer, they're worth it.

Setting Up Monitoring: Step by Step

Step 1: Create a Heartbeat

In Cronping, create a new heartbeat with:

Name: Something descriptive (db-backup, invoice-processor, cleanup-old-logs)
Schedule type: Cron expression or simple interval
Schedule: e.g., 0 2 * * *
Grace period: How much leeway to allow (5–30 minutes for most jobs)

You'll get a unique ping URL like:

code

https://ping.cronping.com/abc123xyz

Step 2: Add Monitoring to Your Job

The minimal implementation is a single HTTP GET at the end of a successful run:

code

# Shell script
#!/bin/bash
/opt/scripts/run-backup.sh && curl -fsS https://ping.cronping.com/abc123xyz

For better coverage, add start and fail signals:

code

#!/bin/bash
PING="https://ping.cronping.com/abc123xyz"

curl -fsS "${PING}/start"

if /opt/scripts/run-backup.sh; then
  curl -fsS "${PING}"
else
  curl -fsS "${PING}/fail"
  exit 1
fi

In Python:

code

import requests
import sys

PING_URL = "https://ping.cronping.com/abc123xyz"

def ping(path=""):
    try:
        requests.get(f"{PING_URL}{path}", timeout=10)
    except Exception:
        pass  # Never let monitoring crash the job

ping("/start")

try:
    run_my_job()
    ping()
except Exception as e:
    ping("/fail")
    print(f"Job failed: {e}", file=sys.stderr)
    sys.exit(1)

In Node.js (see our dedicated Node.js guide for more details):

code

const PING_URL = "https://ping.cronping.com/abc123xyz";

const ping = async (path = "") => fetch(`${PING_URL}${path}`).catch(() => {});

await ping("/start");
try {
  await runMyJob();
  await ping();
} catch (err) {
  await ping("/fail");
  throw err;
}

Step 3: Test It

Trigger the job manually and verify the heartbeat switches to "Up" in the Cronping dashboard. Then simulate a failure and make sure you get alerted.

The most common bug here: the ping URL is wrong (typo in the key) and the job appears to work but nothing is actually being received. Always verify the heartbeat status after the first run.

Step 4: Configure Alerts

Set up alerting to wherever your team actually responds:

Email: good for non-urgent jobs, easy to miss
Slack/Discord: good for team visibility
PagerDuty/OpsGenie: for critical jobs requiring immediate response
Webhook: for custom integrations (ticketing systems, etc.)

Match the alert channel to the severity of the job. A weekly newsletter job failing doesn't warrant a 3am page. A payment processor failing absolutely does.

What to Monitor: Prioritizing Your Jobs

Not every cron job needs the same level of urgency. Here's a rough framework:

Tier 1: Business Critical (PagerDuty/immediate alert)

Payment processing, subscription renewals
Data ingestion pipelines that customers depend on
Backup jobs (you can't restore what wasn't backed up)
Fraud detection, compliance reporting

Tier 2: Important (Slack/email, 1-hour response)

Email delivery workers
Daily/weekly report generation
Analytics aggregation
Cache warming jobs

Tier 3: Operational (Email digest, next business day)

Log rotation and cleanup
Archival jobs
Non-critical data syncs
Development/staging environment jobs

If you're starting from scratch, focus on Tier 1 first. Three well-monitored critical jobs are worth more than 30 poorly monitored important ones.

Advanced Patterns

Monitoring a Job That Runs Multiple Times

If your job runs every 5 minutes, you'll get a ping every 5 minutes. This is fine. Cronping handles high-frequency schedules well. Each ping resets the timer.

Monitoring Distributed Jobs (Running on Multiple Instances)

If you run the same job across multiple instances (like a K8s deployment without singleton scheduling), every instance will ping. Cronping will receive multiple pings per interval, which is fine, as long as at least one arrives, the heartbeat is considered healthy.

If you need only one instance to run the job, solve that at the scheduler level (leader election, node-cron with Redis locking, etc.). Monitoring is downstream of scheduling, not a replacement for it.

Long-Running Jobs

For jobs that take hours (data migrations, full database dumps):

Set a generous grace period (longer than the longest expected runtime)
Always use the /start signal
Consider sending periodic /log pings during the job to show progress

code

# For very long jobs, send progress updates
for batch in process_in_batches(data):
    process_batch(batch)
    # Every 100 batches, send a log ping so Cronping knows we're still alive
    if batch.number % 100 == 0:
        requests.post(f"{PING_URL}/log", data=f"Processed {batch.number} batches", timeout=5)

Exit Code Monitoring

Instead of manually calling /fail, you can pass the exit code directly:

code

/opt/scripts/my-job.sh
EXIT_CODE=$?
curl -fsS "https://ping.cronping.com/abc123xyz/${EXIT_CODE}"

Exit code 0 = success, anything else = failure. Clean and idiomatic.

The Metrics Worth Tracking

Once your jobs are monitored, you have visibility into:

Uptime: What percentage of expected runs succeeded? 99% uptime on a daily job means roughly 3.6 failed runs per year.

Duration: How long does each run take? A graph over 90 days makes it obvious when a job starts degrading (growing runtime usually means a database query isn't scaling with data growth).

Last ping time: When did it last run? If "last ping: 3 days ago" and it's supposed to run daily, something's wrong.

Flip history: How often does it go up/down/up? Frequent flapping might indicate resource contention or an unreliable dependency.

Making the Case to Your Team

If you need to convince your team to invest time in this:

The average time-to-detect for a failed cron job without monitoring is days to weeks
Most production incidents have a "this should have been caught by automated monitoring" root cause
Implementation cost: 5–15 minutes per job, one HTTP call
Tools like Cronping have free plans; the cost barrier is essentially zero

The question isn't "can we afford to monitor cron jobs." It's "what was the last silent failure that cost us, and what would monitoring have done to that timeline."

Summary

Cron jobs fail silently. This is the default. You have to opt into visibility.
Log monitoring and health checks don't detect "job didn't run." Dead-man's switch monitoring does.
The implementation is minimal: an HTTP call at the end of each job.
Prioritize by business impact. Not every job needs a 3am page.
The start signal is underrated. Use it for jobs longer than a minute.

Cron job monitoring isn't glamorous work. But it's the kind of infrastructure investment that prevents the 3am call that ruins everyone's week. Worth the 15 minutes.