Skip to content

Scaling and Cost Analysis: Smart Cascade

What Drives Cost

The smart cascade converter processes each page through tiered models, escalating only when quality is insufficient. Cost is dominated by LLM API charges β€” compute infrastructure is negligible.

Cost breakdown from a real 6-page chemistry document

ComponentCost% of Total
Smart Cascade (API calls)$0.131398.1%
Visual Layout Scoring (Gemini Flash)$0.00261.9%
EC2 spot compute~$0.005<0.1%
S3/DynamoDB/SQS~$0.0001<0.1%
Total$0.1339

API calls account for >98% of cost. Infrastructure is effectively free.

Per-tier API costs

The cascade tries cheap models first and only escalates when quality is below threshold. Each escalation multiplies the per-page cost:

TierModelCost/pageWhen used
0 (text)gemini-3-flash-preview~$0.001Pure text pages
0 (vision)gemini-3-flash-preview~$0.005Complex pages, first attempt
1 (vision)gemini-2.5-pro~$0.02Flash quality insufficient
2 (agentic)claude-sonnet-4~$0.15Pro quality insufficient, iterates with screenshots

A page that passes at tier 0 costs $0.001-0.005. A page that escalates all the way to agentic vision costs ~$0.15 plus the failed attempts at lower tiers.

What causes escalation

Pages escalate when they score below the quality threshold (default: 80/100) on structural quality, or below the visual layout threshold (default: 75/100) on visual fidelity. Common escalation triggers:

  • Complex tables with merged cells or nested headers
  • Pages mixing text, images, and equations
  • Dense multi-column layouts
  • Diagrams with embedded text labels

The 6-page chemistry PDF had 4 escalations out of 6 pages β€” typical for documents with tables and chemical structures.

Parallel Processing on AWS

Current architecture

SQS Queue --> EC2 Spot Fleet (ASG, 0-4 instances)
Each instance: Docker (Node.js + Puppeteer + Chrome)
Each instance: processes 1 file at a time
Within each file: maxPagesParallel = 5

The smart cascade already processes pages in parallel batches within a single worker. The maxPagesParallel config (default: 5) controls how many pages run concurrently on one instance.

Scaling for throughput

For batch processing hundreds or thousands of files nightly, the bottleneck is wall clock time per file, not cost:

FilesSequential (1 worker)4 workers20 workers50 workers
100~11 hours~2.7 hours33 min13 min
500~54 hours~13.5 hours2.7 hours65 min
1,000~4.5 days~27 hours5.4 hours2.2 hours

These estimates assume ~6.5 min average per file. Actual time varies with document complexity β€” simple text-only files finish in under a minute while complex documents with multiple agentic-vision escalations take up to 5 minutes.

Option A: Scale workers, keep current page parallelism (simplest)

Increase the ASG max from 4 to 20-50 instances. Each worker pulls one SQS message (one file), processes it with maxPagesParallel: 5, writes results, pulls the next message. No code changes required β€” just CDK config:

Queue depth 0 --> 0 instances (scale to zero)
Queue depth 1+ --> 2 instances
Queue depth 50+ --> 10 instances
Queue depth 200+ --> 20 instances
Queue depth 500+ --> 50 instances

This handles 1,000 files in ~2 hours with 20 workers, or ~1 hour with 50.

Option B: Fan out pages across workers (fastest, more complex)

Split a single file’s pages across multiple workers so wall clock time per file = time for the slowest single page (~5 min worst case). This requires:

  1. A coordinator Lambda that splits the PDF and enqueues one SQS message per page
  2. Workers process individual pages and write results to DynamoDB
  3. A completion Lambda assembles pages when all are done

This is only worth the complexity if per-file latency matters (e.g., real-time user-facing conversion). For nightly batch processing, Option A is sufficient.

Spot instance economics

EC2 spot pricing for the instance types we use:

InstanceOn-demandSpot (typical)Savings
c6g.large (2 vCPU, 4GB)$0.068/hr$0.020/hr71%
c6a.large (2 vCPU, 4GB)$0.077/hr$0.023/hr70%
m6g.large (2 vCPU, 8GB)$0.077/hr$0.023/hr70%

At $0.02/hr spot, running 20 instances for 3 hours to process 500 files costs $1.20 in compute. The API charges for those same 500 files (at ~$0.13/file) would be ~$65. Compute is <2% of the total.

Off-hours batch scheduling

Spot instance availability and pricing are best during off-peak hours (nights, weekends). Use EventBridge Scheduler or a cron-triggered Lambda to:

  1. Enqueue all pending files to SQS at e.g. 11 PM
  2. ASG scales up automatically based on queue depth
  3. Workers drain the queue, typically finishing by early morning
  4. ASG scales back to zero when queue is empty

This naturally takes advantage of lower spot prices and higher availability during off-peak windows.

Cost Mitigation Strategies

1. Minimize escalations (highest impact)

Every escalation multiplies cost. A page that passes at tier 0 costs $0.005; one that escalates to agentic vision costs $0.17+. Strategies:

  • Tune the quality threshold. The default (80) works well for most documents. Lowering it to 70 would reduce escalations but accept lower quality output. Monitor the benchmark results to find the right balance.

  • Improve tier 0 prompts. Better prompts for gemini-flash can reduce the number of pages that fail quality checks. The conversion prompt in smart-cascade-converter.ts is the single highest-leverage optimization target.

  • Pre-classification is already active. The smart cascade runs detectComplexity on every page before conversion, routing pure-text pages to the cheap text-to-HTML path (~$0.001/page) and only sending complex pages (images, tables, equations) into the vision cascade. This is already saving significant cost β€” no changes needed here.

2. Cache conversions (medium impact)

The batch worker already caches conversion results in S3 keyed by file hash + backend. If the same PDF is submitted twice, the cached result is returned without re-running the cascade. This is automatic and free.

For incremental updates (e.g., a document that changed one page), a future optimization could cache per-page results and only reconvert changed pages.

3. Reduce visual layout scoring calls (low-medium impact)

Visual layout scoring adds ~$0.0004/page (Gemini Flash vision call). At scale:

VolumeVL scoring cost
1,000 files (5K pages)~$2.00
10,000 files (50K pages)~$20.00

This is small relative to conversion costs, but can be optimized:

  • Only score pages that passed structural quality (skip scoring for pages that will escalate anyway based on structural score alone)
  • Disable visual layout scoring for text-only pages (they rarely have layout issues)

4. Use Gemini iterative refinement before Claude escalation (already implemented)

The smart cascade now tries iterative refinement with the same Gemini model before escalating to the next tier. This catches cases where a single Gemini pass produces mediocre results but a second pass (informed by a screenshot comparison) fixes the issues β€” avoiding the 7.5x cost jump to Claude agentic vision.

5. Negotiate volume pricing

At scale, API costs become significant:

Monthly volumeEstimated API cost
1,000 files~$130
5,000 files~$650
10,000 files~$1,300
50,000 files~$6,500

Both Anthropic and Google offer volume discounts and committed-use pricing. At 10K+ files/month, it’s worth negotiating directly. Anthropic’s batch API also offers 50% discount for non-real-time workloads β€” the nightly batch processing pattern is a natural fit.

6. Model cost reductions over time

LLM API prices have dropped consistently. Gemini Flash pricing has already fallen from the 1.5 era to the 3.0 preview. As newer models release:

  • Cheaper models may achieve the same quality at lower tiers
  • Fewer escalations as base model quality improves
  • The tiered cascade architecture automatically benefits β€” just update the model names and pricing in the tier configuration

Summary

FactorImpactAction
API calls (LLM)98% of costMinimize escalations, negotiate volume pricing
EC2 spot compute<2% of costScale freely β€” it’s nearly free
Escalation rate3-7x cost multiplier per escalationTune prompts and thresholds
Batch API discount50% reductionUse Anthropic batch API for nightly runs
Parallel scalingTime reduction only, same costScale ASG to meet nightly time window