Cloud Run Failover
How It Works
The CF Worker (api-pdf.theaccessible.org) proxies heavy processing requests (conversions, exports, remediation) to backend Node.js servers. Backends are configured as an ordered, comma-separated list in NODE_API_URLS. Each backend gets its own circuit breaker. The Worker routes to the first healthy backend.
User Request β βΌCloudflare Worker (api-pdf.theaccessible.org) β βββ 1. https://api.theaccessible.org (self-hosted, primary) β Circuit breaker: 2 failures β open for 30s β βββ 2. https://pdf-api-763162717299.us-central1.run.app (Cloud Run, failover) β Circuit breaker: same logic, independent state β βββ 3. All backends down β 503 "PDF processing server temporarily unavailable"Proxied Routes
Only POST requests matching these patterns are proxied:
/api/convert/:fileIdβ PDF conversion/api/convert/:fileId/previewβ Preview conversion/api/remediate/htmlβ HTML remediation/api/remediate/batchβ Batch remediation/api/remediate/urlβ URL remediation/api/gateway/convertβ Full pipeline conversion
All other requests are handled directly by the CF Worker.
Circuit Breaker
Each backend has independent circuit breaker state:
| Setting | Value |
|---|---|
| Health check interval | 30s (cached between checks) |
| Health check timeout | 3s |
| Failure threshold | 2 consecutive failures to open |
| Cooldown | 30s before retrying an open circuit |
| Proxy timeout | 5 minutes (for long conversions) |
The health check hits GET /health on each backend.
Authentication
Cloud Run is publicly accessible (IAM allows allUsers) but protected by a shared secret:
- The CF Worker sends
X-Proxy-Secretheader on all proxied requests - The Node server middleware (
server.ts) validates this header whenPROXY_SHARED_SECRETenv var is set - Health checks (
/health,/) are exempt so Cloud Run can probe the container - Local servers donβt set
PROXY_SHARED_SECRET, so the check is skipped
Cold Start
Cloud Run is configured with min-instances=0 (scales to zero when idle).
- Cold start latency: 5-15 seconds
- Impact: On first failover request, the circuit breaker health check (3s timeout) may fail against a cold Cloud Run instance. The circuit opens for 30s. On the next attempt after cooldown, Cloud Run is warm and responds.
- To eliminate cold starts: Set
--min-instances 1(~$15/mo)
Statelessness
Both backends share the same external services β no data sync needed:
- R2 bucket:
accessible-pdf-files(via S3-compatible API) - KV namespaces:
KV_SESSIONSandKV_RATE_LIMIT(via CF REST API) - Supabase: Same project (
vuvwmfxssjosfphzpzim) - AI APIs: Same keys (Anthropic, Gemini, Mathpix)
Key Files
| File | Role |
|---|---|
workers/api/src/services/node-proxy.ts | Multi-backend proxy with per-URL circuit breakers |
workers/api/src/middleware/node-proxy.ts | Hono middleware β reads NODE_API_URLS, passes to proxy |
workers/api/src/server.ts | Node server β shared secret gate middleware |
workers/api/src/types/env.ts | NODE_API_URLS, PROXY_SHARED_SECRET type defs |
workers/api/wrangler.toml | Production NODE_API_URLS config |
Infrastructure
GCP Project
- Project ID:
pdf-theaccessible-org - Region:
us-central1 - Artifact Registry:
us-central1-docker.pkg.dev/pdf-theaccessible-org/pdf-api/node-server
Cloud Run Service
- Service:
pdf-api - URL:
https://pdf-api-763162717299.us-central1.run.app - CPU: 2 vCPUs
- Memory: 2 GiB
- Min instances: 0 (scales to zero)
- Max instances: 3
- Timeout: 300s (5 min)
- Port: 8790
Secrets (Google Secret Manager)
All sensitive values stored in Secret Manager, bound to Cloud Run:
R2_ACCESS_KEY_ID, R2_SECRET_ACCESS_KEY, CF_API_TOKEN, SUPABASE_SERVICE_ROLE_KEY, SUPABASE_JWT_SECRET, ANTHROPIC_API_KEY, GEMINI_API_KEY, MATHPIX_APP_ID, MATHPIX_APP_KEY, RESEND_API_KEY, PROXY_SHARED_SECRET
Non-sensitive env vars set directly: ENVIRONMENT, FRONTEND_URL, R2_ACCOUNT_ID, R2_BUCKET_NAME, CF_ACCOUNT_ID, KV_SESSIONS_NAMESPACE_ID, SUPABASE_URL, ALERT_EMAIL
CF Worker Secrets
NODE_API_URLSβ comma-separated backend listPROXY_SHARED_SECRETβ shared secret for Cloud Run auth
Cost
| Scenario | Cost |
|---|---|
| Idle (no failover events) | $0/mo |
| 1-hour outage, moderate load | ~$0.05-0.50 |
| 1-day outage, moderate load | ~$1-5 |
| Always-warm (min-instances=1) | ~$15/mo |
Deployment
When the Node server code changes, rebuild and redeploy Cloud Run:
# On the Node server (10.1.1.4)ssh -i ~/.ssh/nightly-audit larry@10.1.1.4cd ~/accessible-pdf-convertergit pull
# Build and pushdocker build -t us-central1-docker.pkg.dev/pdf-theaccessible-org/pdf-api/node-server:latest \ -f workers/api/Dockerfile .docker push us-central1-docker.pkg.dev/pdf-theaccessible-org/pdf-api/node-server:latest
# Redeploy Cloud Rungcloud run deploy pdf-api \ --image us-central1-docker.pkg.dev/pdf-theaccessible-org/pdf-api/node-server:latest \ --region us-central1Updating Secrets
# Update a secret valueecho -n "NEW_VALUE" | gcloud secrets versions add SECRET_NAME --data-file=-
# Cloud Run picks up `:latest` on next deploygcloud run deploy pdf-api \ --image us-central1-docker.pkg.dev/pdf-theaccessible-org/pdf-api/node-server:latest \ --region us-central1Testing the Failover
1. Verify Cloud Run is healthy
curl -s https://pdf-api-763162717299.us-central1.run.app/health# Should return: {"success":true,"data":{"status":"healthy",...}}2. Verify shared secret blocks unauthenticated requests
curl -s -o /dev/null -w "%{http_code}" https://pdf-api-763162717299.us-central1.run.app/api/files# Should return: 4013. Verify normal traffic goes to local servers
With local servers running, submit a conversion via the UI or API. Check local server logs:
ssh -i ~/.ssh/nightly-audit larry@10.1.1.4docker compose logs -f api-node-1 api-node-2You should see the request logged on the local server, not Cloud Run.
4. Simulate an outage
Stop the local servers:
ssh -i ~/.ssh/nightly-audit larry@10.1.1.4cd ~/accessible-pdf-converterdocker compose stop api-node-1 api-node-25. Trigger failover
Submit a conversion. The CF Worker will:
- Health-check
api.theaccessible.orgβ fails (servers stopped) - Record failure, check again β second failure opens circuit breaker
- Try next URL:
pdf-api-763162717299.us-central1.run.appβ healthy - Proxy the request to Cloud Run
Check Cloud Run logs:
gcloud run services logs read pdf-api --region us-central1 --limit=206. Restore local servers
ssh -i ~/.ssh/nightly-audit larry@10.1.1.4cd ~/accessible-pdf-converterdocker compose up -d api-node-1 api-node-2After 30s (circuit breaker cooldown), the next request will health-check the local server again. Once healthy, traffic returns to local.
7. Verify recovery
Submit another conversion. Check local server logs to confirm traffic is back on the primary.