Health Monitor

Distributed uptime monitor for our public API servers. Any number of servers can run the daemon; each independently polls the endpoints and alerts on failures.

Repo: https://github.com/LarryAnglin/health-monitor (private)
Language: Python 3 daemon, managed by systemd
Monitored endpoints: Node API (https://api.theaccessible.org/health), Lambda API (https://api-pdf.theaccessible.org/health)
Alert channels: Email (Resend) + Telegram
Log sink: Supabase healthcheck_events table in the accessible-pdf-converter project

Architecture

  ┌──────────────┐       ┌──────────────┐        ┌─────────────────┐
  │  Server N    │──┐    │  Cloudflare  │        │   Supabase      │
  │  (daemon)    │  │    │  R2 bucket:  │        │   edge function │
  └──────────────┘  │    │  accessible- │        │  healthcheck-   │
  ┌──────────────┐  │    │  status      │        │  log            │
  │  Server 2    │──┼───►│  config.json │        │                 │
  └──────────────┘  │    └──────────────┘        │  ↓ inserts into │
  ┌──────────────┐  │                            │  healthcheck_   │
  │  Server 1    │──┘    ┌──────────────┐        │  events table   │
  │              │──────►│ Telegram +   │        └─────────────────┘
  │              │──────►│ Resend email │                ▲
  └──────┬───────┘       └──────────────┘                │
         │                                               │
         └───────────────────────────────────────────────┘
                          POST events every cycle

Each daemon:

On startup, fetches config.json from R2. Caches it locally at /var/lib/healthcheck/config.json so transient R2 errors don’t take it down.
Every 60 seconds (configurable), hits every endpoint with a 10s timeout and verifies both HTTP status and a JSON field (e.g. status == "ok").
Writes every result to the Supabase edge function (one row per endpoint per cycle per server) for trend analysis.
Tracks consecutive failures per endpoint in /var/lib/healthcheck/state.json.
After 3 consecutive failures, sends one email + one Telegram message. Re-alerts every 60 minutes while still down. Sends a recovery notification on the first success after an alert.
Re-fetches config from R2 every hour so changes propagate across the fleet without a redeploy.

Configuration

Single JSON file in R2: s3://accessible-status/config.json.

{
  "checkIntervalSeconds": 60,
  "configRefreshSeconds": 3600,
  "timeoutSeconds": 10,
  "alerting": {
    "consecutiveFailuresBeforeAlert": 3,
    "reAlertIntervalMinutes": 60,
    "notifyOnRecovery": true,
    "email": { "enabled": true, "from": "noreply@tagzen.ai", "to": ["larry@anglin.com"] },
    "telegram": { "enabled": true }
  },
  "endpoints": [
    {
      "name": "node-api",
      "url": "https://api.theaccessible.org/health",
      "expect": { "httpStatus": 200, "jsonPath": "status", "equals": "ok" }
    }
  ]
}

To update the config fleet-wide:

aws s3 --endpoint-url https://c6cce84d1636ec85ec946a19edef0103.r2.cloudflarestorage.com \
  cp config.json s3://accessible-status/config.json

Every server picks up changes within an hour. Force immediate refresh on a specific server with systemctl restart healthcheck.

Secrets

Per-server env file lives at /etc/healthcheck/healthcheck.env, owned root:healthcheck with mode 0640. All servers share the same values — nothing per-server except the hostname (auto-detected from hostname).

Variable	Purpose
`R2_ACCESS_KEY_ID` / `R2_SECRET_ACCESS_KEY`	Read-only R2 token (`healthcheck-config-read`) scoped to `accessible-status`
`R2_ENDPOINT_URL` / `R2_BUCKET` / `R2_CONFIG_KEY`	R2 location (`config.json` at bucket root)
`RESEND_API_KEY`	Resend key for email alerts
`TELEGRAM_BOT_TOKEN` / `TELEGRAM_CHAT_ID`	Telegram bot + destination chat
`SUPABASE_LOG_URL`	`https://vuvwmfxssjosfphzpzim.supabase.co/functions/v1/healthcheck-log`
`SUPABASE_LOG_TOKEN`	Shared bearer token (`HEALTHCHECK_LOG_TOKEN` secret in the edge function)

See docs/admin/secrets-inventory-and-rotation.md for rotation procedures.

Adding a new server

On a fresh Debian/Ubuntu box with Python 3:

sudo apt-get update && sudo apt-get install -y python3-venv python3-pip git
git clone https://github.com/LarryAnglin/health-monitor.git /opt/healthcheck-src
cd /opt/healthcheck-src
sudo ./install.sh

The installer creates a healthcheck system user, a venv at /opt/healthcheck/venv, the state dir at /var/lib/healthcheck, and drops a .env.example template at /etc/healthcheck/healthcheck.env that you fill in with the values from 1Password. Then:

sudo systemctl enable --now healthcheck
sudo journalctl -u healthcheck -f

Within ~60 seconds you should see {"level":"info","msg":"config refreshed",...} followed by per-endpoint check logs.

Updating the daemon code

cd /opt/healthcheck-src
sudo git pull
sudo /opt/healthcheck/venv/bin/pip install -r requirements.txt
sudo systemctl restart healthcheck

Troubleshooting

Symptom	Cause	Fix
`missing required env var`	`/etc/healthcheck/healthcheck.env` empty or not loaded	`systemctl cat healthcheck` to confirm `EnvironmentFile=`; re-fill the file
`initial config fetch failed ... NoSuchKey`	`R2_CONFIG_KEY` doesn’t match the actual key in R2	`aws s3 ls s3://accessible-status/ --recursive` to find it; fix env var
`initial config fetch failed ... 403`	R2 token revoked or wrong scope	Re-issue `healthcheck-config-read` in the Cloudflare dashboard
`supabase log failed status=401` body `unauthorized`	`SUPABASE_LOG_TOKEN` doesn’t match the edge-function secret	Reset both sides to the same value (see rotation doc)
`supabase log failed status=401` body `Missing authorization header`	Edge function deployed without `--no-verify-jwt`	Redeploy: `supabase functions deploy healthcheck-log --no-verify-jwt`
`/opt/healthcheck/venv/bin/pip: No such file or directory`	`python3-venv` package missing	`apt-get install -y python3-venv python3-pip`, delete the broken venv, re-run install
No alerts firing but endpoint is clearly down	Fewer than `consecutiveFailuresBeforeAlert` failures, or alert sent and within re-alert window	Check `/var/lib/healthcheck/state.json` — `consecutive_failures` and `last_alert_at` show current state
Fleet-wide alert storm (N servers × 1 alert each)	Endpoint actually down, every server reports	Expected; tells you which servers can reach the endpoint. Dedupe not implemented.

Querying the log data

-- Uptime % per endpoint in the last 24h, by reporter
select
  hostname,
  endpoint_name,
  count(*) filter (where status = 'ok') * 100.0 / count(*) as uptime_pct,
  count(*) as samples
from healthcheck_events
where created_at > now() - interval '24 hours'
group by 1, 2
order by hostname, endpoint_name;

-- Recent failures across the fleet
select created_at, hostname, endpoint_name, http_status, error
from healthcheck_events
where status = 'fail'
order by created_at desc
limit 50;

Known limitations

Alert dedupe across the fleet isn’t implemented — N servers watching the same endpoint produces N alerts when it goes down. Acceptable for now because it signals which servers have network paths to the target.
Config has no per-endpoint alert routing. All alerts go to the single configured Resend recipient + Telegram chat. If we need different channels for different severities, extend alerting to be per-endpoint.
No historical retention policy. healthcheck_events grows unbounded — every server writes 2 rows/min = ~2880 rows/day per server. Add a Supabase scheduled function to prune rows older than 90 days if disk usage becomes a concern.