Health Monitor
Distributed uptime monitor for our public API servers. Any number of servers can run the daemon; each independently polls the endpoints and alerts on failures.
- Repo: https://github.com/LarryAnglin/health-monitor (private)
- Language: Python 3 daemon, managed by systemd
- Monitored endpoints: Node API (
https://api.theaccessible.org/health), Lambda API (https://api-pdf.theaccessible.org/health) - Alert channels: Email (Resend) + Telegram
- Log sink: Supabase
healthcheck_eventstable in theaccessible-pdf-converterproject
Architecture
ββββββββββββββββ ββββββββββββββββ βββββββββββββββββββ β Server N ββββ β Cloudflare β β Supabase β β (daemon) β β β R2 bucket: β β edge function β ββββββββββββββββ β β accessible- β β healthcheck- β ββββββββββββββββ β β status β β log β β Server 2 ββββΌββββΊβ config.json β β β ββββββββββββββββ β ββββββββββββββββ β β inserts into β ββββββββββββββββ β β healthcheck_ β β Server 1 ββββ ββββββββββββββββ β events table β β ββββββββΊβ Telegram + β βββββββββββββββββββ β ββββββββΊβ Resend email β β² ββββββββ¬ββββββββ ββββββββββββββββ β β β βββββββββββββββββββββββββββββββββββββββββββββββββ POST events every cycleEach daemon:
- On startup, fetches
config.jsonfrom R2. Caches it locally at/var/lib/healthcheck/config.jsonso transient R2 errors donβt take it down. - Every 60 seconds (configurable), hits every endpoint with a 10s timeout
and verifies both HTTP status and a JSON field (e.g.
status == "ok"). - Writes every result to the Supabase edge function (one row per endpoint per cycle per server) for trend analysis.
- Tracks consecutive failures per endpoint in
/var/lib/healthcheck/state.json. - After 3 consecutive failures, sends one email + one Telegram message. Re-alerts every 60 minutes while still down. Sends a recovery notification on the first success after an alert.
- Re-fetches config from R2 every hour so changes propagate across the fleet without a redeploy.
Configuration
Single JSON file in R2: s3://accessible-status/config.json.
{ "checkIntervalSeconds": 60, "configRefreshSeconds": 3600, "timeoutSeconds": 10, "alerting": { "consecutiveFailuresBeforeAlert": 3, "reAlertIntervalMinutes": 60, "notifyOnRecovery": true, "email": { "enabled": true, "from": "noreply@tagzen.ai", "to": ["larry@anglin.com"] }, "telegram": { "enabled": true } }, "endpoints": [ { "name": "node-api", "url": "https://api.theaccessible.org/health", "expect": { "httpStatus": 200, "jsonPath": "status", "equals": "ok" } } ]}To update the config fleet-wide:
aws s3 --endpoint-url https://c6cce84d1636ec85ec946a19edef0103.r2.cloudflarestorage.com \ cp config.json s3://accessible-status/config.jsonEvery server picks up changes within an hour. Force immediate refresh on a
specific server with systemctl restart healthcheck.
Secrets
Per-server env file lives at /etc/healthcheck/healthcheck.env, owned
root:healthcheck with mode 0640. All servers share the same values β
nothing per-server except the hostname (auto-detected from hostname).
| Variable | Purpose |
|---|---|
R2_ACCESS_KEY_ID / R2_SECRET_ACCESS_KEY | Read-only R2 token (healthcheck-config-read) scoped to accessible-status |
R2_ENDPOINT_URL / R2_BUCKET / R2_CONFIG_KEY | R2 location (config.json at bucket root) |
RESEND_API_KEY | Resend key for email alerts |
TELEGRAM_BOT_TOKEN / TELEGRAM_CHAT_ID | Telegram bot + destination chat |
SUPABASE_LOG_URL | https://vuvwmfxssjosfphzpzim.supabase.co/functions/v1/healthcheck-log |
SUPABASE_LOG_TOKEN | Shared bearer token (HEALTHCHECK_LOG_TOKEN secret in the edge function) |
See docs/admin/secrets-inventory-and-rotation.md for rotation procedures.
Adding a new server
On a fresh Debian/Ubuntu box with Python 3:
sudo apt-get update && sudo apt-get install -y python3-venv python3-pip gitgit clone https://github.com/LarryAnglin/health-monitor.git /opt/healthcheck-srccd /opt/healthcheck-srcsudo ./install.shThe installer creates a healthcheck system user, a venv at
/opt/healthcheck/venv, the state dir at /var/lib/healthcheck, and drops a
.env.example template at /etc/healthcheck/healthcheck.env that you fill in
with the values from 1Password. Then:
sudo systemctl enable --now healthchecksudo journalctl -u healthcheck -fWithin ~60 seconds you should see {"level":"info","msg":"config refreshed",...}
followed by per-endpoint check logs.
Updating the daemon code
cd /opt/healthcheck-srcsudo git pullsudo /opt/healthcheck/venv/bin/pip install -r requirements.txtsudo systemctl restart healthcheckTroubleshooting
| Symptom | Cause | Fix |
|---|---|---|
missing required env var | /etc/healthcheck/healthcheck.env empty or not loaded | systemctl cat healthcheck to confirm EnvironmentFile=; re-fill the file |
initial config fetch failed ... NoSuchKey | R2_CONFIG_KEY doesnβt match the actual key in R2 | aws s3 ls s3://accessible-status/ --recursive to find it; fix env var |
initial config fetch failed ... 403 | R2 token revoked or wrong scope | Re-issue healthcheck-config-read in the Cloudflare dashboard |
supabase log failed status=401 body unauthorized | SUPABASE_LOG_TOKEN doesnβt match the edge-function secret | Reset both sides to the same value (see rotation doc) |
supabase log failed status=401 body Missing authorization header | Edge function deployed without --no-verify-jwt | Redeploy: supabase functions deploy healthcheck-log --no-verify-jwt |
/opt/healthcheck/venv/bin/pip: No such file or directory | python3-venv package missing | apt-get install -y python3-venv python3-pip, delete the broken venv, re-run install |
| No alerts firing but endpoint is clearly down | Fewer than consecutiveFailuresBeforeAlert failures, or alert sent and within re-alert window | Check /var/lib/healthcheck/state.json β consecutive_failures and last_alert_at show current state |
| Fleet-wide alert storm (N servers Γ 1 alert each) | Endpoint actually down, every server reports | Expected; tells you which servers can reach the endpoint. Dedupe not implemented. |
Querying the log data
-- Uptime % per endpoint in the last 24h, by reporterselect hostname, endpoint_name, count(*) filter (where status = 'ok') * 100.0 / count(*) as uptime_pct, count(*) as samplesfrom healthcheck_eventswhere created_at > now() - interval '24 hours'group by 1, 2order by hostname, endpoint_name;
-- Recent failures across the fleetselect created_at, hostname, endpoint_name, http_status, errorfrom healthcheck_eventswhere status = 'fail'order by created_at desclimit 50;Known limitations
- Alert dedupe across the fleet isnβt implemented β N servers watching the same endpoint produces N alerts when it goes down. Acceptable for now because it signals which servers have network paths to the target.
- Config has no per-endpoint alert routing. All alerts go to the single
configured Resend recipient + Telegram chat. If we need different channels for
different severities, extend
alertingto be per-endpoint. - No historical retention policy.
healthcheck_eventsgrows unbounded β every server writes 2 rows/min = ~2880 rows/day per server. Add a Supabase scheduled function to prune rows older than 90 days if disk usage becomes a concern.