Worker ASG Failover β Operational Runbook
The EC2 spot fleet (AccessiblePdfProd-Compute-WorkerAsgASGC59A845D-UUF3it2ENmxm) is
the automatic failover for the self-hosted Node worker on 10.1.1.4. Steady state is
0 instances; the fleet only launches when 10.1.1.4 falls behind on SQS.
How scaling works
Defined in infra/cdk/lib/stacks/compute-stack.ts β scales on the pipeline queueβs
ApproximateAgeOfOldestMessage (Maximum, 60s period):
| Oldest-message age | Action |
|---|---|
| β€ 60s | scale down by maxWorkerInstances |
| β₯ 300s (5 min) | +1 |
| β₯ 900s (15 min) | +2 |
| β₯ 1800s (30 min) | +maxWorkerInstances |
Cooldown: 300s. Min=0, Max=config.maxWorkerInstances (currently 4).
When 10.1.1.4 is healthy it drains the queue fast enough that age stays near zero, so no EC2 launches. If 10.1.1.4 goes offline, age climbs past 5 min and the fleet spins up. When 10.1.1.4 recovers, both drain the backlog, age drops back under 60s, and EC2 terminates.
Post-deploy steps after the queue-age scaling switch
The previous queue-depth scaling policy required manually suspending the ASG
Launch and AlarmNotification processes (done 2026-04-10) to keep the fleet
from spinning up on every queued message. After deploying the queue-age policy,
those suspensions must be lifted or the new failover wonβt work.
-
Deploy the CDK change:
Terminal window cd infra/cdk && npx cdk deploy AccessiblePdfProd-Compute -
Resume the suspended ASG processes:
Terminal window AWS_REGION=us-east-1 aws autoscaling resume-processes \--auto-scaling-group-name AccessiblePdfProd-Compute-WorkerAsgASGC59A845D-UUF3it2ENmxm \--scaling-processes Launch AlarmNotification -
Reset desired capacity to 0 (itβs currently stuck at 1 from the suspension period β without
Launchenabled the ASG could never satisfy desired=1):Terminal window AWS_REGION=us-east-1 aws autoscaling set-desired-capacity \--auto-scaling-group-name AccessiblePdfProd-Compute-WorkerAsgASGC59A845D-UUF3it2ENmxm \--desired-capacity 0 -
Verify:
Terminal window AWS_REGION=us-east-1 aws autoscaling describe-auto-scaling-groups \--auto-scaling-group-names AccessiblePdfProd-Compute-WorkerAsgASGC59A845D-UUF3it2ENmxm \--query 'AutoScalingGroups[0].{Desired:DesiredCapacity,Suspended:SuspendedProcesses,Instances:Instances[*].InstanceId}'Expected:
Desired: 0,Suspended: [],Instances: [].
Manual override (still available)
reference_asg_scaling.md documents the manual scale-up/down commands. With
queue-age scaling in place those should rarely be needed, but they remain useful
for forced scale-up during incident response.
Failover smoke test
To verify failover end-to-end (do this in a quiet window):
- Stop the SQS consumer on 10.1.1.4 (do NOT stop the API containers β only the batch consumer) so messages start aging.
- Enqueue a test conversion or wait for organic queue activity.
- Watch CloudWatch metric
AWS/SQS / ApproximateAgeOfOldestMessageon the pipeline queue. At ~5 min, theScaleOnQueueAgeupper alarm should fire and one instance launch. - Restart the consumer on 10.1.1.4. Backlog drains, age drops under 60s, EC2 terminates after the next cooldown.