Consolidate to a Single Production AWS Stack
Problem
We have staging-named CDK stacks (AccessiblePdfStaging-*) running production workloads. The staging Lambda, SQS, DynamoDB, and email stacks are all configured with manual overrides to point at production resources (production SQS queue, production DynamoDB table, production ECR image). Every CDK redeploy risks reverting these manual env var changes back to staging defaults.
Current state of manual overrides
| Resource | CDK default (staging) | Manual override (production) |
|---|---|---|
Lambda SQS_QUEUE_URL | accessible-pdf-staging-pipeline | accessible-pdf-production-pipeline |
| Lambda IAM (inline policy) | staging queue only | added ProductionSqsAccess |
Email Lambda DYNAMODB_TABLE | accessible-pdf-staging-data | accessible-pdf-production-data |
Email Lambda SQS_QUEUE_URL | staging pipeline | production pipeline |
Email Lambda FROM_EMAIL | noreply@staging.theaccessible.org | noreply@pdf.theaccessible.org |
| Email Lambda IAM | staging SQS + DynamoDB only | added ProductionSqsAccess + ProductionDynamoAccess |
| SES receipt rule | convert@staging.theaccessible.org | convert@pdf.theaccessible.org (in helpdesk rule set) |
| EC2 workers | Pull from accessible-pdf-production-worker ECR | Correct (production compute stack) |
| Staging Compute stack | Was running | Deleted |
| Staging Monitoring stack | Was running | Deleted |
Risk: Running cdk deploy --all will recreate staging compute/monitoring stacks, reset Lambda env vars to staging defaults, and break the production workflow.
Goal
Deploy a single production CDK stack set that:
- Uses production resource names and configuration
- Eliminates all manual env var overrides
- Is safe to
cdk deploy --allwithout breaking anything - Keeps the staging environment definition for future use but does not deploy it by default
Plan
Phase 1: Update CDK to deploy production by default
File: infra/cdk/bin/app.ts
Currently the CDK app deploys staging stacks. Change it to deploy production stacks by default, controlled by an environment variable.
// Current: hardcoded stagingconst config = getEnvConfig('staging');
// Change to: default production, override with CDK_ENVconst config = getEnvConfig(process.env.CDK_ENV || 'production');This means cdk deploy deploys production. To deploy staging in the future: CDK_ENV=staging cdk deploy.
Phase 2: Update env-config.ts production values
File: infra/cdk/lib/env-config.ts
Verify the production config matches whatβs actually running. Current production config looks correct:
production: { environment: 'production', maxWorkerInstances: 4, enablePitr: true, alertEmail: 'larry@anglin.com', nodeEnv: 'production', frontendUrl: 'https://pdf.theaccessible.org', fromEmail: 'noreply@pdf.theaccessible.org', emailRecipient: 'convert@pdf.theaccessible.org', s3BucketName: 'accessible-pdf-files',}No changes needed β these values are already correct for production.
Phase 3: Fix api-stack.ts SQS queue reference
File: infra/cdk/lib/stacks/api-stack.ts
The Lambda env var SQS_QUEUE_URL comes from props.queue.queueUrl. When deploying as production, this will automatically be accessible-pdf-production-pipeline. No code change needed β deploying as production fixes this.
However, add the CloudWatch Logs IAM permission that we added manually (already in code from earlier today). Verify itβs present.
Phase 4: Fix email-stack.ts
File: infra/cdk/lib/stacks/email-stack.ts
The email stack gets FROM_EMAIL, FRONTEND_URL, SQS_QUEUE_URL, and DYNAMODB_TABLE from CDK config/props. When deployed as production, all values will be correct automatically.
One change needed: the SES receipt rule is created by CDK in its own rule set (accessible-pdf-production-email), but the active rule set is helpdesk. CDK canβt control which rule set is active (thatβs an account-level setting). Two options:
Option A (recommended): Remove the SES receipt rule from CDK. Manage it manually in the helpdesk rule set (where it already lives). Add a comment in email-stack.ts explaining this.
Option B: Have CDK create the rule but document that it must be manually copied to the helpdesk rule set.
Phase 5: Delete staging stacks
Delete all remaining staging CloudFormation stacks. The staging compute and monitoring stacks are already deleted. Remaining:
# Order matters β dependencies must be deleted lastaws cloudformation delete-stack --stack-name AccessiblePdfStaging-Email --region us-east-1# Wait for completionaws cloudformation delete-stack --stack-name AccessiblePdfStaging-Api --region us-east-1# Wait for completionaws cloudformation delete-stack --stack-name AccessiblePdfStaging-Queue --region us-east-1# Wait for completionaws cloudformation delete-stack --stack-name AccessiblePdfStaging-Storage --region us-east-1# Wait for completionaws cloudformation delete-stack --stack-name AccessiblePdfStaging-Network --region us-east-1Before deleting:
- Verify the production stacks exist and are healthy:
aws cloudformation list-stacks --stack-status-filter CREATE_COMPLETE UPDATE_COMPLETE --region us-east-1 - Verify the production Lambda is serving traffic:
curl https://api-pdf.theaccessible.org/health - The staging DynamoDB table (
accessible-pdf-staging-data) may have data from earlier test runs β check if anything needs to be preserved
After deleting:
- Purge the staging SQS queue if it still exists
- Delete the staging ECR repo images if the stack deletion fails on ECR (same issue we hit before)
- Remove the
accessible-pdf-staging-emailSES rule set (orphaned, not active)
Phase 6: Remove inline IAM policies
The manual IAM policies we added (ProductionSqsAccess, ProductionDynamoAccess) will become redundant once the production stacks own the correct resources. After deploying production stacks:
# These were added as workarounds β production CDK stack grants the correct permissionsROLE=$(aws lambda get-function --function-name accessible-pdf-production-api --query 'Configuration.Role' --output text --region us-east-1 | sed 's/.*\///')aws iam delete-role-policy --role-name "$ROLE" --policy-name ProductionSqsAccess
ROLE=$(aws lambda get-function --function-name accessible-pdf-production-email-intake --query 'Configuration.Role' --output text --region us-east-1 | sed 's/.*\///')aws iam delete-role-policy --role-name "$ROLE" --policy-name ProductionSqsAccessaws iam delete-role-policy --role-name "$ROLE" --policy-name ProductionDynamoAccessPhase 7: Deploy production stacks
cd infra/cdkcdk deploy --all --require-approval neverThis creates AccessiblePdfProd-* stacks (Network, Storage, Queue, Api, Email). The production compute and monitoring stacks already exist.
After deploy:
- Verify health:
curl https://api-pdf.theaccessible.org/healthβ should showplatform: aws - Update Cloudflare LB origin if the API Gateway URL changes
- Update the SES
helpdeskrule set if the email Lambda ARN changes - Rebuild and push the worker Docker image to the production ECR repo if needed
Phase 8: Update build/deploy scripts
File: workers/api/package.json
Add deploy scripts that make production the default:
{ "deploy:lambda": "npm run build:lambda && cd ../../infra/cdk && npx cdk deploy AccessiblePdfProd-Api --require-approval never", "deploy:email": "cd ../email-intake && npm run build && cd ../../infra/cdk && npx cdk deploy AccessiblePdfProd-Email --require-approval never", "deploy:all": "npm run build:lambda && cd ../email-intake && npm run build && cd ../../infra/cdk && npx cdk deploy --all --require-approval never"}Phase 9: Update Cloudflare Load Balancer
If the API Gateway URL changes (new production stack = new API Gateway), update the aws-primary pool origin in the Cloudflare Load Balancer.
Check with: cdk deploy output will show the new ApiUrl.
Execution order
| Step | Action | Risk | Rollback |
|---|---|---|---|
| 1 | Update bin/app.ts to default to production | None (code change only) | Revert file |
| 2 | Deploy production stacks with cdk deploy --all | Medium β new API Gateway URL may differ | Use old staging stacks until LB updated |
| 3 | Update CF LB origin if API Gateway URL changed | Low | Revert LB config |
| 4 | Update SES helpdesk rule with new email Lambda ARN | Low | Update ARN back |
| 5 | Verify everything works end-to-end | β | β |
| 6 | Delete staging stacks | Low (staging is unused) | Canβt easily undo, but not needed |
| 7 | Clean up inline IAM policies | Low | Re-add if needed |
| 8 | Commit deploy scripts | None | β |
What NOT to change
- Production compute stack (
AccessiblePdfProd-Compute) β already running, workers healthy - Production monitoring stack (
AccessiblePdfProd-Monitoring) β already running - Cloudflare Load Balancer β only update origin if API Gateway URL changes
- SES
helpdeskrule set β keep as active set, just update Lambda ARN if it changes - SSM parameters β shared across staging/production, no changes needed
Effort estimate
| Task | Effort |
|---|---|
| Update CDK code (app.ts, email-stack.ts) | 30 min |
| Deploy production stacks | 15 min |
| Update LB + SES rule | 15 min |
| Delete staging stacks | 15 min |
| Verify end-to-end | 30 min |
| Update deploy scripts + commit | 15 min |
| Total | ~2 hours |