Skip to content

Health Monitor

Distributed uptime monitor for our public API servers. Any number of servers can run the daemon; each independently polls the endpoints and alerts on failures.

  • Repo: https://github.com/LarryAnglin/health-monitor (private)
  • Language: Python 3 daemon, managed by systemd
  • Monitored endpoints: Node API (https://api.theaccessible.org/health), Lambda API (https://api-pdf.theaccessible.org/health)
  • Alert channels: Email (Resend) + Telegram
  • Log sink: Supabase healthcheck_events table in the accessible-pdf-converter project

Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Server N │──┐ β”‚ Cloudflare β”‚ β”‚ Supabase β”‚
β”‚ (daemon) β”‚ β”‚ β”‚ R2 bucket: β”‚ β”‚ edge function β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”‚ accessible- β”‚ β”‚ healthcheck- β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ β”‚ status β”‚ β”‚ log β”‚
β”‚ Server 2 │──┼───►│ config.json β”‚ β”‚ β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ ↓ inserts into β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ β”‚ healthcheck_ β”‚
β”‚ Server 1 β”‚β”€β”€β”˜ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ events table β”‚
β”‚ │──────►│ Telegram + β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚ │──────►│ Resend email β”‚ β–²
β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β”‚ β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
POST events every cycle

Each daemon:

  1. On startup, fetches config.json from R2. Caches it locally at /var/lib/healthcheck/config.json so transient R2 errors don’t take it down.
  2. Every 60 seconds (configurable), hits every endpoint with a 10s timeout and verifies both HTTP status and a JSON field (e.g. status == "ok").
  3. Writes every result to the Supabase edge function (one row per endpoint per cycle per server) for trend analysis.
  4. Tracks consecutive failures per endpoint in /var/lib/healthcheck/state.json.
  5. After 3 consecutive failures, sends one email + one Telegram message. Re-alerts every 60 minutes while still down. Sends a recovery notification on the first success after an alert.
  6. Re-fetches config from R2 every hour so changes propagate across the fleet without a redeploy.

Configuration

Single JSON file in R2: s3://accessible-status/config.json.

{
"checkIntervalSeconds": 60,
"configRefreshSeconds": 3600,
"timeoutSeconds": 10,
"alerting": {
"consecutiveFailuresBeforeAlert": 3,
"reAlertIntervalMinutes": 60,
"notifyOnRecovery": true,
"email": { "enabled": true, "from": "noreply@tagzen.ai", "to": ["larry@anglin.com"] },
"telegram": { "enabled": true }
},
"endpoints": [
{
"name": "node-api",
"url": "https://api.theaccessible.org/health",
"expect": { "httpStatus": 200, "jsonPath": "status", "equals": "ok" }
}
]
}

To update the config fleet-wide:

Terminal window
aws s3 --endpoint-url https://c6cce84d1636ec85ec946a19edef0103.r2.cloudflarestorage.com \
cp config.json s3://accessible-status/config.json

Every server picks up changes within an hour. Force immediate refresh on a specific server with systemctl restart healthcheck.

Secrets

Per-server env file lives at /etc/healthcheck/healthcheck.env, owned root:healthcheck with mode 0640. All servers share the same values β€” nothing per-server except the hostname (auto-detected from hostname).

VariablePurpose
R2_ACCESS_KEY_ID / R2_SECRET_ACCESS_KEYRead-only R2 token (healthcheck-config-read) scoped to accessible-status
R2_ENDPOINT_URL / R2_BUCKET / R2_CONFIG_KEYR2 location (config.json at bucket root)
RESEND_API_KEYResend key for email alerts
TELEGRAM_BOT_TOKEN / TELEGRAM_CHAT_IDTelegram bot + destination chat
SUPABASE_LOG_URLhttps://vuvwmfxssjosfphzpzim.supabase.co/functions/v1/healthcheck-log
SUPABASE_LOG_TOKENShared bearer token (HEALTHCHECK_LOG_TOKEN secret in the edge function)

See docs/admin/secrets-inventory-and-rotation.md for rotation procedures.

Adding a new server

On a fresh Debian/Ubuntu box with Python 3:

Terminal window
sudo apt-get update && sudo apt-get install -y python3-venv python3-pip git
git clone https://github.com/LarryAnglin/health-monitor.git /opt/healthcheck-src
cd /opt/healthcheck-src
sudo ./install.sh

The installer creates a healthcheck system user, a venv at /opt/healthcheck/venv, the state dir at /var/lib/healthcheck, and drops a .env.example template at /etc/healthcheck/healthcheck.env that you fill in with the values from 1Password. Then:

Terminal window
sudo systemctl enable --now healthcheck
sudo journalctl -u healthcheck -f

Within ~60 seconds you should see {"level":"info","msg":"config refreshed",...} followed by per-endpoint check logs.

Updating the daemon code

Terminal window
cd /opt/healthcheck-src
sudo git pull
sudo /opt/healthcheck/venv/bin/pip install -r requirements.txt
sudo systemctl restart healthcheck

Troubleshooting

SymptomCauseFix
missing required env var/etc/healthcheck/healthcheck.env empty or not loadedsystemctl cat healthcheck to confirm EnvironmentFile=; re-fill the file
initial config fetch failed ... NoSuchKeyR2_CONFIG_KEY doesn’t match the actual key in R2aws s3 ls s3://accessible-status/ --recursive to find it; fix env var
initial config fetch failed ... 403R2 token revoked or wrong scopeRe-issue healthcheck-config-read in the Cloudflare dashboard
supabase log failed status=401 body unauthorizedSUPABASE_LOG_TOKEN doesn’t match the edge-function secretReset both sides to the same value (see rotation doc)
supabase log failed status=401 body Missing authorization headerEdge function deployed without --no-verify-jwtRedeploy: supabase functions deploy healthcheck-log --no-verify-jwt
/opt/healthcheck/venv/bin/pip: No such file or directorypython3-venv package missingapt-get install -y python3-venv python3-pip, delete the broken venv, re-run install
No alerts firing but endpoint is clearly downFewer than consecutiveFailuresBeforeAlert failures, or alert sent and within re-alert windowCheck /var/lib/healthcheck/state.json β€” consecutive_failures and last_alert_at show current state
Fleet-wide alert storm (N servers Γ— 1 alert each)Endpoint actually down, every server reportsExpected; tells you which servers can reach the endpoint. Dedupe not implemented.

Querying the log data

-- Uptime % per endpoint in the last 24h, by reporter
select
hostname,
endpoint_name,
count(*) filter (where status = 'ok') * 100.0 / count(*) as uptime_pct,
count(*) as samples
from healthcheck_events
where created_at > now() - interval '24 hours'
group by 1, 2
order by hostname, endpoint_name;
-- Recent failures across the fleet
select created_at, hostname, endpoint_name, http_status, error
from healthcheck_events
where status = 'fail'
order by created_at desc
limit 50;

Known limitations

  • Alert dedupe across the fleet isn’t implemented β€” N servers watching the same endpoint produces N alerts when it goes down. Acceptable for now because it signals which servers have network paths to the target.
  • Config has no per-endpoint alert routing. All alerts go to the single configured Resend recipient + Telegram chat. If we need different channels for different severities, extend alerting to be per-endpoint.
  • No historical retention policy. healthcheck_events grows unbounded β€” every server writes 2 rows/min = ~2880 rows/day per server. Add a Supabase scheduled function to prune rows older than 90 days if disk usage becomes a concern.