Skip to content

Worker ASG Failover β€” Operational Runbook

The EC2 spot fleet (AccessiblePdfProd-Compute-WorkerAsgASGC59A845D-UUF3it2ENmxm) is the automatic failover for the self-hosted Node worker on 10.1.1.4. Steady state is 0 instances; the fleet only launches when 10.1.1.4 falls behind on SQS.

How scaling works

Defined in infra/cdk/lib/stacks/compute-stack.ts β€” scales on the pipeline queue’s ApproximateAgeOfOldestMessage (Maximum, 60s period):

Oldest-message ageAction
≀ 60sscale down by maxWorkerInstances
β‰₯ 300s (5 min)+1
β‰₯ 900s (15 min)+2
β‰₯ 1800s (30 min)+maxWorkerInstances

Cooldown: 300s. Min=0, Max=config.maxWorkerInstances (currently 4).

When 10.1.1.4 is healthy it drains the queue fast enough that age stays near zero, so no EC2 launches. If 10.1.1.4 goes offline, age climbs past 5 min and the fleet spins up. When 10.1.1.4 recovers, both drain the backlog, age drops back under 60s, and EC2 terminates.

Post-deploy steps after the queue-age scaling switch

The previous queue-depth scaling policy required manually suspending the ASG Launch and AlarmNotification processes (done 2026-04-10) to keep the fleet from spinning up on every queued message. After deploying the queue-age policy, those suspensions must be lifted or the new failover won’t work.

  1. Deploy the CDK change:

    Terminal window
    cd infra/cdk && npx cdk deploy AccessiblePdfProd-Compute
  2. Resume the suspended ASG processes:

    Terminal window
    AWS_REGION=us-east-1 aws autoscaling resume-processes \
    --auto-scaling-group-name AccessiblePdfProd-Compute-WorkerAsgASGC59A845D-UUF3it2ENmxm \
    --scaling-processes Launch AlarmNotification
  3. Reset desired capacity to 0 (it’s currently stuck at 1 from the suspension period β€” without Launch enabled the ASG could never satisfy desired=1):

    Terminal window
    AWS_REGION=us-east-1 aws autoscaling set-desired-capacity \
    --auto-scaling-group-name AccessiblePdfProd-Compute-WorkerAsgASGC59A845D-UUF3it2ENmxm \
    --desired-capacity 0
  4. Verify:

    Terminal window
    AWS_REGION=us-east-1 aws autoscaling describe-auto-scaling-groups \
    --auto-scaling-group-names AccessiblePdfProd-Compute-WorkerAsgASGC59A845D-UUF3it2ENmxm \
    --query 'AutoScalingGroups[0].{Desired:DesiredCapacity,Suspended:SuspendedProcesses,Instances:Instances[*].InstanceId}'

    Expected: Desired: 0, Suspended: [], Instances: [].

Manual override (still available)

reference_asg_scaling.md documents the manual scale-up/down commands. With queue-age scaling in place those should rarely be needed, but they remain useful for forced scale-up during incident response.

Failover smoke test

To verify failover end-to-end (do this in a quiet window):

  1. Stop the SQS consumer on 10.1.1.4 (do NOT stop the API containers β€” only the batch consumer) so messages start aging.
  2. Enqueue a test conversion or wait for organic queue activity.
  3. Watch CloudWatch metric AWS/SQS / ApproximateAgeOfOldestMessage on the pipeline queue. At ~5 min, the ScaleOnQueueAge upper alarm should fire and one instance launch.
  4. Restart the consumer on 10.1.1.4. Backlog drains, age drops under 60s, EC2 terminates after the next cooldown.