Worker ASG Failover — Operational Runbook

The EC2 spot fleet (AccessiblePdfProd-Compute-WorkerAsgASGC59A845D-UUF3it2ENmxm) is the automatic failover for the self-hosted Node worker on 10.1.1.4. Steady state is 0 instances; the fleet only launches when 10.1.1.4 falls behind on SQS.

How scaling works

Defined in infra/cdk/lib/stacks/compute-stack.ts — scales on the pipeline queue’s ApproximateAgeOfOldestMessage (Maximum, 60s period):

Oldest-message age	Action
≤ 60s	scale down by `maxWorkerInstances`
≥ 300s (5 min)	+1
≥ 900s (15 min)	+2
≥ 1800s (30 min)	+`maxWorkerInstances`

Cooldown: 300s. Min=0, Max=config.maxWorkerInstances (currently 4).

When 10.1.1.4 is healthy it drains the queue fast enough that age stays near zero, so no EC2 launches. If 10.1.1.4 goes offline, age climbs past 5 min and the fleet spins up. When 10.1.1.4 recovers, both drain the backlog, age drops back under 60s, and EC2 terminates.

Post-deploy steps after the queue-age scaling switch

The previous queue-depth scaling policy required manually suspending the ASG Launch and AlarmNotification processes (done 2026-04-10) to keep the fleet from spinning up on every queued message. After deploying the queue-age policy, those suspensions must be lifted or the new failover won’t work.

Deploy the CDK change:

cd infra/cdk && npx cdk deploy AccessiblePdfProd-Compute

Resume the suspended ASG processes:

AWS_REGION=us-east-1 aws autoscaling resume-processes \
  --auto-scaling-group-name AccessiblePdfProd-Compute-WorkerAsgASGC59A845D-UUF3it2ENmxm \
  --scaling-processes Launch AlarmNotification

Reset desired capacity to 0 (it’s currently stuck at 1 from the suspension period — without Launch enabled the ASG could never satisfy desired=1):

AWS_REGION=us-east-1 aws autoscaling set-desired-capacity \
  --auto-scaling-group-name AccessiblePdfProd-Compute-WorkerAsgASGC59A845D-UUF3it2ENmxm \
  --desired-capacity 0

Verify:

AWS_REGION=us-east-1 aws autoscaling describe-auto-scaling-groups \
  --auto-scaling-group-names AccessiblePdfProd-Compute-WorkerAsgASGC59A845D-UUF3it2ENmxm \
  --query 'AutoScalingGroups[0].{Desired:DesiredCapacity,Suspended:SuspendedProcesses,Instances:Instances[*].InstanceId}'

Expected: Desired: 0, Suspended: [], Instances: [].

Manual override (still available)

reference_asg_scaling.md documents the manual scale-up/down commands. With queue-age scaling in place those should rarely be needed, but they remain useful for forced scale-up during incident response.

Failover smoke test

To verify failover end-to-end (do this in a quiet window):

Stop the SQS consumer on 10.1.1.4 (do NOT stop the API containers — only the batch consumer) so messages start aging.
Enqueue a test conversion or wait for organic queue activity.
Watch CloudWatch metric AWS/SQS / ApproximateAgeOfOldestMessage on the pipeline queue. At ~5 min, the ScaleOnQueueAge upper alarm should fire and one instance launch.
Restart the consumer on 10.1.1.4. Backlog drains, age drops under 60s, EC2 terminates after the next cooldown.