Skip to content

S3 Integrations — Operational Runbook

Operational reference for the S3 bucket integration feature (PR #688, phases 1–6, shipped 2026-05-18). Pair with the customer-facing guide at docs/user/s3-integrations.md and the original roadmap in docs/admin/s3-customer-storage-implementation-plan.md (historical — does not reflect what shipped).

Architecture at a glance

poll mode
┌──────────────────────────┐
customer bucket ─list/get─▶ s3-poller (Lambda, every 2 min) │
│ ▼
│ s3_ingest_queue (SQS)
│ │
▼ ▼
customer bucket ─event────▶ EventBridge / R2 forwarder ─▶ s3-ingest (Lambda)
pipeline_queue (SQS)
accessibility pipeline
s3-integration-writeback
customer bucket / output_prefix/

Components

LayerPathNotes
Schemasupabase/migrations/20260517_11120260518_116Six migrations total; 116 is the SECURITY DEFINER usage rollup.
DB tabless3_integrations, s3_processed_objectsPer-user via auth.users(id); RLS enforced.
Credential sealingworkers/api/src/services/integration-creds.tsKMS envelope encryption, alias alias/accessible-pdf-<env>-integrations (env-templated via envName() in infra/cdk/lib/env-config.ts — e.g. alias/accessible-pdf-production-integrations). Plaintext never touches Postgres.
API routesworkers/api/src/routes/s3-integrations.tsCRUD + :id/usage-summary (calls s3_integration_usage_summary RPC).
Client factoryworkers/api/src/services/s3-client-factory.tsBuilds an AWS SDK client per provider (AssumeRole / access key / R2 / B2 / generic).
Polling Lambdas3-poller (infra/cdk/lib/stacks/integrations-stack.ts:67)EventBridge rate(2 min). Lists active poll-mode integrations, fans into s3_ingest_queue.
Ingest Lambdas3-ingest (SqsEventSource)Consumes s3_ingest_queue, hands jobs to pipeline_queue.
Writebacks3-integration-writeback.tsAfter pipeline completion, PUTs the remediated PDF + HTML under output_prefix/.
R2 event receiverworkers/r2-event-receiver/Worker template generated for customers; verifies EVENT_SOURCE_SECRET.
UIapps/web/src/app/account/integrations/ + apps/web/src/components/IntegrationsSection.tsxSurfaced in Settings → Integrations.
Marketingapps/home/src/app/integrations/s3/theaccessible.org/integrations/s3

Secrets and KMS

  • KMS CMK alias: alias/accessible-pdf-<env>-integrations (us-east-1 in prod). Env-templated via envName() in infra/cdk/lib/env-config.ts — prod resolves to alias/accessible-pdf-production-integrations. Used for envelope encryption of S3-integration access keys.
  • Access keys (AWS user-key, R2 token, B2 application key) are sealed on insert via integration-creds.sealCreds(). Postgres stores only the ciphertext (bytea) and kms_key_version (int).
  • Event source secret is stored as a plaintext UUID in s3_integrations.event_source_secret (migration 20260518_115). The r2-event-receiver does a constant-time string compare against this value. Sealing this field is tracked under “Known unfinished work” below — until then, treat the column as sensitive (RLS-scoped per user, never logged). Customers see the value once at issuance and after each rotation; they must re-set it in their R2 forwarder Worker or AWS EventBridge target with wrangler secret put / re-deploy.
  • kms_key_version lets a future re-encrypt batch upgrade rows after a CMK rotation without unsealing every blob on first read.

Deployment

The integration feature spans three deployables — all three must roll together for a schema change to take effect cleanly:

  1. Schemasupabase db push against the relevant project. Migrations are forward-only; rollback strategy is a new migration, not db reset.
  2. API Workerwrangler deploy from workers/api/. Owns CRUD, sealing/unsealing, and the usage-summary RPC caller.
  3. Lambdas (poller + ingest)cd infra/cdk && npx cdk deploy IntegrationsStack. Bundles workers/api/dist-lambda/s3-poller and workers/api/dist-lambda/s3-ingest from the API build output. The CDK stack also re-creates the EventBridge schedule (2 min) and SQS queue bindings.

The CDK stack assumes the QueueStack (which owns s3_ingest_queue and pipeline_queue) is already deployed. If you redeploy QueueStack with new queue ARNs, redeploy IntegrationsStack right after to update the IAM grants — otherwise the Lambdas will get AccessDenied on SendMessage.

Monitoring

SignalWhereTrigger
s3-poller Lambda errorsCloudWatch → /aws/lambda/<env>-s3-poller>3 errors in 5 min
s3-ingest Lambda errorsCloudWatch → /aws/lambda/<env>-s3-ingest>3 errors in 5 min
s3_ingest_queue depthCloudWatch SQS metric> 1000 messages for 10 min
s3_ingest_queue DLQ depthCloudWatch SQS metricany messages → page
Per-integration error rateSupabase app_logs filtered on service=s3-poller or s3-ingestn/a — diagnostic only
RPC failures on usage-summaryLoki / Grafana, label route=GET /api/integrations/s3/:id/usage-summary>5 in 5 min

The 30-day customer-facing usage tile reads s3_integration_usage_summary — if it starts returning 500s in bulk, it’s almost always one of: KMS access lost on the Worker, RPC permission revoked, or a migration applied to prod with a schema change to s3_processed_objects that doesn’t match the function signature.

Common customer-reported failures

”Test connection fails immediately”

  1. Check app_logs for the user’s userId + service=s3-integrations, look at the testConnection entry.
  2. Most common causes, in order:
    • AWS AssumeRole: customer’s trust policy still references the OLD principal ARN (we moved accounts at some point — confirm against the current one in integration-creds.ts).
    • AWS AssumeRole: missing or wrong sts:ExternalId condition.
    • Access key: key disabled / rotated on the customer side without telling us.
    • R2: token created before the bucket existed — Cloudflare bug, regenerate.
  3. If logs show the call succeeded but the dashboard still shows red, that’s a UI bug — file an issue, don’t try to patch around it.

”Polling never picks up a file”

  1. Check s3-poller CloudWatch logs for the integration’s UUID at the next poll tick (≤ 2 min from now).
  2. Look for the line listed N objects under prefix=<prefix>. If N=0 but the customer swears the file is there:
    • Prefix is case-sensitive. Confirm exact match including trailing slash.
    • Confirm the customer’s IAM grants s3:ListBucket on the bucket (not just s3:GetObject on objects — a common misconfiguration that passes Test connection but breaks listing).
  3. If N > 0 but no jobs land in s3_ingest_queue:
    • File size > 200 MB (S3_POLL_MAX_OBJECT_BYTES in packages/shared/src/s3-integrations.ts). The poller increments PollResult.skippedOversize but does not emit a per-object log line today — you can only see the aggregate count in the poller’s tick-summary log. Confirm by checking the file’s Content-Length in S3 directly.
    • File extension is not .pdf (we filter at poller).
  4. If integration is in event mode, the poller intentionally skips it. Confirm detection_mode in the DB.

”Event mode produces no jobs”

  1. Check app_logs filtered on service=s3-ingest, look at eventRejected entries.
  2. Common rejection reasons:
    • invalid_secret — customer didn’t re-set EVENT_SOURCE_SECRET after rotating it in the dashboard. Tell them to wrangler secret put EVENT_SOURCE_SECRET and redeploy.
    • bucket_mismatch — event came from a bucket name that doesn’t match the integration row.
    • paused — integration was paused; events are rejected on purpose.
    • mode_mismatch — integration is in poll mode but an event arrived. Either flip to event mode or stop the customer’s event source.
  3. R2-specific: if no events arrive at all (not just rejected — literally nothing), the customer’s forwarder Worker is probably misconfigured. Check that the Cloudflare Queue name matches across (a) the queue itself, (b) the R2 event notification target, and (c) the Worker’s queues.consumers.queue binding.

”Remediated files never appear in my bucket”

  1. Confirm output prefix in the DB matches what the customer expects.
  2. Confirm the role/key has s3:PutObject on <bucket>/<output_prefix>/*s3:GetObject alone is not enough.
  3. If output prefix is empty AND input prefix is empty, our writeback refuses (loop protection). The customer must set at least an output_prefix.
  4. Check app_logs for writeback_failed with the integration id. Often surfaces an unfriendly AccessDenied we should translate in the UI but don’t yet.

Credential rotation

Customer-initiated key rotation (R2 / B2 / access-key AWS)

From the integration detail page, customer hits Rotate credentials → enters new key + secret. The API:

  1. Seals the new pair via KMS.
  2. Atomically updates access_key_id_ciphertext, secret_access_key_ciphertext, and kms_key_version (within a transaction so a partial write can’t leave a mismatched pair).
  3. Touches updated_at.

The old plaintext is gone after step 1. There is no audit row that contains either old or new key material. The customer is responsible for revoking the old key on their side.

Event source secret rotation

Same UI button as above. After rotation, every event signed with the old secret is rejected as invalid_secret until the customer re-sets it on their event source. This is intentional — the rejection log becomes the audit trail that an old secret is still floating around in their infrastructure.

KMS CMK rotation

The CMK is set to AWS-managed annual rotation (key material rotated, ARN stable). Application code doesn’t need to do anything for routine rotation — kms:Decrypt continues to work against any prior version automatically.

If we need to manually rotate (compromise event):

  1. Create new CMK with a new alias (e.g. alias/accessible-integrations-v2).
  2. Update workers/api env var to point at the new alias.
  3. Deploy API Worker. Do not deploy Lambdas yet. New writes use the new key; old reads still use the old.
  4. Run the re-encrypt batch (workers/api/src/scripts/reseal-integration-creds.ts — TODO, not yet written; the planned approach is unseal-with-old + reseal-with-new + update kms_key_version).
  5. Deploy Lambdas after the batch completes — at that point all rows are on the new key.
  6. Schedule deletion of the old CMK (≥ 7 days, AWS-enforced minimum).

Until the re-encrypt batch exists, manual rotation is a one-way trip with downtime. Don’t do it without coordinating with engineering.

Pause / resume

  • PATCH /api/integrations/s3/:id with { status: 'paused' } stops polling and causes the ingest Lambda to reject inbound events with reason paused.
  • Resume sets it back to active. No state needs to be cleared on the customer’s side.
  • Use this proactively when:
    • A single integration is failing in a tight loop and filling logs.
    • Customer requests a maintenance pause.
    • We’re rolling out a schema change that touches s3_integrations and want zero in-flight work.

Failure-mode escalation

SymptomFirst responderPage eng if
Single customer reports broken integrationSupportMultiple customers, OR Lambda CloudWatch alarms also firing
s3_ingest_queue DLQ has messagesEng (oncall)Always — DLQ is the contract that something needs a human
Per-integration error rate > 50% over 1 hourEng (oncall)Always — usually means a provider-side change (R2 API breaking, AWS STS regional outage)
KMS Decrypt failures in API logsEng (oncall)Immediately — credentials are unrecoverable if the CMK is gone
All usage-summary requests 500EngWithin hours — customer-visible regression but doesn’t block ingest

Known unfinished work (as of 2026-05-25)

  • Event source secret sealings3_integrations.event_source_secret is currently a plaintext UUID column. Sealing it with the same KMS envelope used for access keys is straightforward (the seal/unseal helpers already exist) but unscoped. Until then, the column relies on Postgres RLS for confidentiality.
  • Per-object oversize log line — the poller silently increments PollResult.skippedOversize for >200 MB files but doesn’t emit a per-object warn. Adding one is cheap and turns the “Polling never picks up a file” troubleshooting step into a one-grep diagnosis.
  • Per-folder defaults — promised in the docs (“pick a quality tier, language, and notification target per watched folder”) but not implemented. The UI accepts one set of defaults per integration, not per prefix.
  • Cost-tracking join — Phase 6 commit explicitly noted cost_ledger.file_id (text) vs s3_processed_objects.job_id (uuid) mismatch. The customer-facing 30-day rollup intentionally omits dollar figures until that join is validated.
  • Reseal batch script — see “KMS CMK rotation” above. Manual rotation has no clean path until this lands.
  • R2 forwarder Worker auto-deploy — customers currently have to copy/paste the Worker source. A future Cloudflare API integration could deploy on their behalf.
  • User docs: docs/user/s3-integrations.md
  • Marketing: apps/home/src/app/integrations/s3/https://theaccessible.org/integrations/s3
  • Schema: migrations 111, 112, 114, 115, 116 in supabase/migrations/
  • Original roadmap: docs/admin/s3-customer-storage-implementation-plan.md (pre-build, kept for context only)
  • PR thread: #688 (phases 1–6, merged 2026-04 through 2026-05-18)