S3 Integrations — Operational Runbook
Operational reference for the S3 bucket integration feature (PR #688, phases 1–6,
shipped 2026-05-18). Pair with the customer-facing guide at docs/user/s3-integrations.md
and the original roadmap in docs/admin/s3-customer-storage-implementation-plan.md
(historical — does not reflect what shipped).
Architecture at a glance
poll mode ┌──────────────────────────┐customer bucket ─list/get─▶ s3-poller (Lambda, every 2 min) │ │ ▼ │ s3_ingest_queue (SQS) │ │ ▼ ▼customer bucket ─event────▶ EventBridge / R2 forwarder ─▶ s3-ingest (Lambda) │ ▼ pipeline_queue (SQS) │ ▼ accessibility pipeline │ ▼ s3-integration-writeback │ ▼ customer bucket / output_prefix/Components
| Layer | Path | Notes |
|---|---|---|
| Schema | supabase/migrations/20260517_111…20260518_116 | Six migrations total; 116 is the SECURITY DEFINER usage rollup. |
| DB tables | s3_integrations, s3_processed_objects | Per-user via auth.users(id); RLS enforced. |
| Credential sealing | workers/api/src/services/integration-creds.ts | KMS envelope encryption, alias alias/accessible-pdf-<env>-integrations (env-templated via envName() in infra/cdk/lib/env-config.ts — e.g. alias/accessible-pdf-production-integrations). Plaintext never touches Postgres. |
| API routes | workers/api/src/routes/s3-integrations.ts | CRUD + :id/usage-summary (calls s3_integration_usage_summary RPC). |
| Client factory | workers/api/src/services/s3-client-factory.ts | Builds an AWS SDK client per provider (AssumeRole / access key / R2 / B2 / generic). |
| Polling Lambda | s3-poller (infra/cdk/lib/stacks/integrations-stack.ts:67) | EventBridge rate(2 min). Lists active poll-mode integrations, fans into s3_ingest_queue. |
| Ingest Lambda | s3-ingest (SqsEventSource) | Consumes s3_ingest_queue, hands jobs to pipeline_queue. |
| Writeback | s3-integration-writeback.ts | After pipeline completion, PUTs the remediated PDF + HTML under output_prefix/. |
| R2 event receiver | workers/r2-event-receiver/ | Worker template generated for customers; verifies EVENT_SOURCE_SECRET. |
| UI | apps/web/src/app/account/integrations/ + apps/web/src/components/IntegrationsSection.tsx | Surfaced in Settings → Integrations. |
| Marketing | apps/home/src/app/integrations/s3/ | theaccessible.org/integrations/s3 |
Secrets and KMS
- KMS CMK alias:
alias/accessible-pdf-<env>-integrations(us-east-1 in prod). Env-templated viaenvName()ininfra/cdk/lib/env-config.ts— prod resolves toalias/accessible-pdf-production-integrations. Used for envelope encryption of S3-integration access keys. - Access keys (AWS user-key, R2 token, B2 application key) are sealed on insert via
integration-creds.sealCreds(). Postgres stores only the ciphertext (bytea) andkms_key_version(int). - Event source secret is stored as a plaintext UUID in
s3_integrations.event_source_secret(migration20260518_115). The r2-event-receiver does a constant-time string compare against this value. Sealing this field is tracked under “Known unfinished work” below — until then, treat the column as sensitive (RLS-scoped per user, never logged). Customers see the value once at issuance and after each rotation; they must re-set it in their R2 forwarder Worker or AWS EventBridge target withwrangler secret put/ re-deploy. kms_key_versionlets a future re-encrypt batch upgrade rows after a CMK rotation without unsealing every blob on first read.
Deployment
The integration feature spans three deployables — all three must roll together for a schema change to take effect cleanly:
- Schema —
supabase db pushagainst the relevant project. Migrations are forward-only; rollback strategy is a new migration, notdb reset. - API Worker —
wrangler deployfromworkers/api/. Owns CRUD, sealing/unsealing, and the usage-summary RPC caller. - Lambdas (poller + ingest) —
cd infra/cdk && npx cdk deploy IntegrationsStack. Bundlesworkers/api/dist-lambda/s3-pollerandworkers/api/dist-lambda/s3-ingestfrom the API build output. The CDK stack also re-creates the EventBridge schedule (2 min) and SQS queue bindings.
The CDK stack assumes the QueueStack (which owns s3_ingest_queue and pipeline_queue) is already deployed. If you redeploy QueueStack with new queue ARNs, redeploy IntegrationsStack right after to update the IAM grants — otherwise the Lambdas will get AccessDenied on SendMessage.
Monitoring
| Signal | Where | Trigger |
|---|---|---|
s3-poller Lambda errors | CloudWatch → /aws/lambda/<env>-s3-poller | >3 errors in 5 min |
s3-ingest Lambda errors | CloudWatch → /aws/lambda/<env>-s3-ingest | >3 errors in 5 min |
s3_ingest_queue depth | CloudWatch SQS metric | > 1000 messages for 10 min |
s3_ingest_queue DLQ depth | CloudWatch SQS metric | any messages → page |
| Per-integration error rate | Supabase app_logs filtered on service=s3-poller or s3-ingest | n/a — diagnostic only |
RPC failures on usage-summary | Loki / Grafana, label route=GET /api/integrations/s3/:id/usage-summary | >5 in 5 min |
The 30-day customer-facing usage tile reads s3_integration_usage_summary — if it starts returning 500s in bulk, it’s almost always one of: KMS access lost on the Worker, RPC permission revoked, or a migration applied to prod with a schema change to s3_processed_objects that doesn’t match the function signature.
Common customer-reported failures
”Test connection fails immediately”
- Check
app_logsfor the user’s userId +service=s3-integrations, look at thetestConnectionentry. - Most common causes, in order:
- AWS AssumeRole: customer’s trust policy still references the OLD principal ARN (we moved accounts at some point — confirm against the current one in
integration-creds.ts). - AWS AssumeRole: missing or wrong
sts:ExternalIdcondition. - Access key: key disabled / rotated on the customer side without telling us.
- R2: token created before the bucket existed — Cloudflare bug, regenerate.
- AWS AssumeRole: customer’s trust policy still references the OLD principal ARN (we moved accounts at some point — confirm against the current one in
- If logs show the call succeeded but the dashboard still shows red, that’s a UI bug — file an issue, don’t try to patch around it.
”Polling never picks up a file”
- Check
s3-pollerCloudWatch logs for the integration’s UUID at the next poll tick (≤ 2 min from now). - Look for the line
listed N objects under prefix=<prefix>. If N=0 but the customer swears the file is there:- Prefix is case-sensitive. Confirm exact match including trailing slash.
- Confirm the customer’s IAM grants
s3:ListBucketon the bucket (not justs3:GetObjecton objects — a common misconfiguration that passes Test connection but breaks listing).
- If N > 0 but no jobs land in
s3_ingest_queue:- File size > 200 MB (
S3_POLL_MAX_OBJECT_BYTESinpackages/shared/src/s3-integrations.ts). The poller incrementsPollResult.skippedOversizebut does not emit a per-object log line today — you can only see the aggregate count in the poller’s tick-summary log. Confirm by checking the file’sContent-Lengthin S3 directly. - File extension is not
.pdf(we filter at poller).
- File size > 200 MB (
- If integration is in event mode, the poller intentionally skips it. Confirm
detection_modein the DB.
”Event mode produces no jobs”
- Check
app_logsfiltered onservice=s3-ingest, look ateventRejectedentries. - Common rejection reasons:
invalid_secret— customer didn’t re-setEVENT_SOURCE_SECRETafter rotating it in the dashboard. Tell them towrangler secret put EVENT_SOURCE_SECRETand redeploy.bucket_mismatch— event came from a bucket name that doesn’t match the integration row.paused— integration was paused; events are rejected on purpose.mode_mismatch— integration is in poll mode but an event arrived. Either flip to event mode or stop the customer’s event source.
- R2-specific: if no events arrive at all (not just rejected — literally nothing), the customer’s forwarder Worker is probably misconfigured. Check that the Cloudflare Queue name matches across (a) the queue itself, (b) the R2 event notification target, and (c) the Worker’s
queues.consumers.queuebinding.
”Remediated files never appear in my bucket”
- Confirm output prefix in the DB matches what the customer expects.
- Confirm the role/key has
s3:PutObjecton<bucket>/<output_prefix>/*—s3:GetObjectalone is not enough. - If output prefix is empty AND input prefix is empty, our writeback refuses (loop protection). The customer must set at least an output_prefix.
- Check
app_logsforwriteback_failedwith the integration id. Often surfaces an unfriendlyAccessDeniedwe should translate in the UI but don’t yet.
Credential rotation
Customer-initiated key rotation (R2 / B2 / access-key AWS)
From the integration detail page, customer hits Rotate credentials → enters new key + secret. The API:
- Seals the new pair via KMS.
- Atomically updates
access_key_id_ciphertext,secret_access_key_ciphertext, andkms_key_version(within a transaction so a partial write can’t leave a mismatched pair). - Touches
updated_at.
The old plaintext is gone after step 1. There is no audit row that contains either old or new key material. The customer is responsible for revoking the old key on their side.
Event source secret rotation
Same UI button as above. After rotation, every event signed with the old secret is rejected as invalid_secret until the customer re-sets it on their event source. This is intentional — the rejection log becomes the audit trail that an old secret is still floating around in their infrastructure.
KMS CMK rotation
The CMK is set to AWS-managed annual rotation (key material rotated, ARN stable). Application code doesn’t need to do anything for routine rotation — kms:Decrypt continues to work against any prior version automatically.
If we need to manually rotate (compromise event):
- Create new CMK with a new alias (e.g.
alias/accessible-integrations-v2). - Update
workers/apienv var to point at the new alias. - Deploy API Worker. Do not deploy Lambdas yet. New writes use the new key; old reads still use the old.
- Run the re-encrypt batch (
workers/api/src/scripts/reseal-integration-creds.ts— TODO, not yet written; the planned approach is unseal-with-old + reseal-with-new + updatekms_key_version). - Deploy Lambdas after the batch completes — at that point all rows are on the new key.
- Schedule deletion of the old CMK (≥ 7 days, AWS-enforced minimum).
Until the re-encrypt batch exists, manual rotation is a one-way trip with downtime. Don’t do it without coordinating with engineering.
Pause / resume
PATCH /api/integrations/s3/:idwith{ status: 'paused' }stops polling and causes the ingest Lambda to reject inbound events with reasonpaused.- Resume sets it back to
active. No state needs to be cleared on the customer’s side. - Use this proactively when:
- A single integration is failing in a tight loop and filling logs.
- Customer requests a maintenance pause.
- We’re rolling out a schema change that touches
s3_integrationsand want zero in-flight work.
Failure-mode escalation
| Symptom | First responder | Page eng if |
|---|---|---|
| Single customer reports broken integration | Support | Multiple customers, OR Lambda CloudWatch alarms also firing |
s3_ingest_queue DLQ has messages | Eng (oncall) | Always — DLQ is the contract that something needs a human |
| Per-integration error rate > 50% over 1 hour | Eng (oncall) | Always — usually means a provider-side change (R2 API breaking, AWS STS regional outage) |
| KMS Decrypt failures in API logs | Eng (oncall) | Immediately — credentials are unrecoverable if the CMK is gone |
All usage-summary requests 500 | Eng | Within hours — customer-visible regression but doesn’t block ingest |
Known unfinished work (as of 2026-05-25)
- Event source secret sealing —
s3_integrations.event_source_secretis currently a plaintext UUID column. Sealing it with the same KMS envelope used for access keys is straightforward (the seal/unseal helpers already exist) but unscoped. Until then, the column relies on Postgres RLS for confidentiality. - Per-object oversize log line — the poller silently increments
PollResult.skippedOversizefor >200 MB files but doesn’t emit a per-object warn. Adding one is cheap and turns the “Polling never picks up a file” troubleshooting step into a one-grep diagnosis. - Per-folder defaults — promised in the docs (“pick a quality tier, language, and notification target per watched folder”) but not implemented. The UI accepts one set of defaults per integration, not per prefix.
- Cost-tracking join — Phase 6 commit explicitly noted
cost_ledger.file_id(text) vss3_processed_objects.job_id(uuid) mismatch. The customer-facing 30-day rollup intentionally omits dollar figures until that join is validated. - Reseal batch script — see “KMS CMK rotation” above. Manual rotation has no clean path until this lands.
- R2 forwarder Worker auto-deploy — customers currently have to copy/paste the Worker source. A future Cloudflare API integration could deploy on their behalf.
Related references
- User docs:
docs/user/s3-integrations.md - Marketing:
apps/home/src/app/integrations/s3/→ https://theaccessible.org/integrations/s3 - Schema: migrations 111, 112, 114, 115, 116 in
supabase/migrations/ - Original roadmap:
docs/admin/s3-customer-storage-implementation-plan.md(pre-build, kept for context only) - PR thread: #688 (phases 1–6, merged 2026-04 through 2026-05-18)