Scaling and Cost Analysis: Smart Cascade
What Drives Cost
The smart cascade converter processes each page through tiered models, escalating only when quality is insufficient. Cost is dominated by LLM API charges β compute infrastructure is negligible.
Cost breakdown from a real 6-page chemistry document
| Component | Cost | % of Total |
|---|---|---|
| Smart Cascade (API calls) | $0.1313 | 98.1% |
| Visual Layout Scoring (Gemini Flash) | $0.0026 | 1.9% |
| EC2 spot compute | ~$0.005 | <0.1% |
| S3/DynamoDB/SQS | ~$0.0001 | <0.1% |
| Total | $0.1339 |
API calls account for >98% of cost. Infrastructure is effectively free.
Per-tier API costs
The cascade tries cheap models first and only escalates when quality is below threshold. Each escalation multiplies the per-page cost:
| Tier | Model | Cost/page | When used |
|---|---|---|---|
| 0 (text) | gemini-3-flash-preview | ~$0.001 | Pure text pages |
| 0 (vision) | gemini-3-flash-preview | ~$0.005 | Complex pages, first attempt |
| 1 (vision) | gemini-2.5-pro | ~$0.02 | Flash quality insufficient |
| 2 (agentic) | claude-sonnet-4 | ~$0.15 | Pro quality insufficient, iterates with screenshots |
A page that passes at tier 0 costs $0.001-0.005. A page that escalates all the way to agentic vision costs ~$0.15 plus the failed attempts at lower tiers.
What causes escalation
Pages escalate when they score below the quality threshold (default: 80/100) on structural quality, or below the visual layout threshold (default: 75/100) on visual fidelity. Common escalation triggers:
- Complex tables with merged cells or nested headers
- Pages mixing text, images, and equations
- Dense multi-column layouts
- Diagrams with embedded text labels
The 6-page chemistry PDF had 4 escalations out of 6 pages β typical for documents with tables and chemical structures.
Parallel Processing on AWS
Current architecture
SQS Queue --> EC2 Spot Fleet (ASG, 0-4 instances) Each instance: Docker (Node.js + Puppeteer + Chrome) Each instance: processes 1 file at a time Within each file: maxPagesParallel = 5The smart cascade already processes pages in parallel batches within a single
worker. The maxPagesParallel config (default: 5) controls how many pages
run concurrently on one instance.
Scaling for throughput
For batch processing hundreds or thousands of files nightly, the bottleneck is wall clock time per file, not cost:
| Files | Sequential (1 worker) | 4 workers | 20 workers | 50 workers |
|---|---|---|---|---|
| 100 | ~11 hours | ~2.7 hours | 33 min | 13 min |
| 500 | ~54 hours | ~13.5 hours | 2.7 hours | 65 min |
| 1,000 | ~4.5 days | ~27 hours | 5.4 hours | 2.2 hours |
These estimates assume ~6.5 min average per file. Actual time varies with document complexity β simple text-only files finish in under a minute while complex documents with multiple agentic-vision escalations take up to 5 minutes.
Recommended scaling approach
Option A: Scale workers, keep current page parallelism (simplest)
Increase the ASG max from 4 to 20-50 instances. Each worker pulls one SQS
message (one file), processes it with maxPagesParallel: 5, writes results,
pulls the next message. No code changes required β just CDK config:
Queue depth 0 --> 0 instances (scale to zero)Queue depth 1+ --> 2 instancesQueue depth 50+ --> 10 instancesQueue depth 200+ --> 20 instancesQueue depth 500+ --> 50 instancesThis handles 1,000 files in ~2 hours with 20 workers, or ~1 hour with 50.
Option B: Fan out pages across workers (fastest, more complex)
Split a single fileβs pages across multiple workers so wall clock time per file = time for the slowest single page (~5 min worst case). This requires:
- A coordinator Lambda that splits the PDF and enqueues one SQS message per page
- Workers process individual pages and write results to DynamoDB
- A completion Lambda assembles pages when all are done
This is only worth the complexity if per-file latency matters (e.g., real-time user-facing conversion). For nightly batch processing, Option A is sufficient.
Spot instance economics
EC2 spot pricing for the instance types we use:
| Instance | On-demand | Spot (typical) | Savings |
|---|---|---|---|
| c6g.large (2 vCPU, 4GB) | $0.068/hr | $0.020/hr | 71% |
| c6a.large (2 vCPU, 4GB) | $0.077/hr | $0.023/hr | 70% |
| m6g.large (2 vCPU, 8GB) | $0.077/hr | $0.023/hr | 70% |
At $0.02/hr spot, running 20 instances for 3 hours to process 500 files costs $1.20 in compute. The API charges for those same 500 files (at ~$0.13/file) would be ~$65. Compute is <2% of the total.
Off-hours batch scheduling
Spot instance availability and pricing are best during off-peak hours (nights, weekends). Use EventBridge Scheduler or a cron-triggered Lambda to:
- Enqueue all pending files to SQS at e.g. 11 PM
- ASG scales up automatically based on queue depth
- Workers drain the queue, typically finishing by early morning
- ASG scales back to zero when queue is empty
This naturally takes advantage of lower spot prices and higher availability during off-peak windows.
Cost Mitigation Strategies
1. Minimize escalations (highest impact)
Every escalation multiplies cost. A page that passes at tier 0 costs $0.005; one that escalates to agentic vision costs $0.17+. Strategies:
-
Tune the quality threshold. The default (80) works well for most documents. Lowering it to 70 would reduce escalations but accept lower quality output. Monitor the benchmark results to find the right balance.
-
Improve tier 0 prompts. Better prompts for gemini-flash can reduce the number of pages that fail quality checks. The conversion prompt in
smart-cascade-converter.tsis the single highest-leverage optimization target. -
Pre-classification is already active. The smart cascade runs
detectComplexityon every page before conversion, routing pure-text pages to the cheap text-to-HTML path (~$0.001/page) and only sending complex pages (images, tables, equations) into the vision cascade. This is already saving significant cost β no changes needed here.
2. Cache conversions (medium impact)
The batch worker already caches conversion results in S3 keyed by file hash + backend. If the same PDF is submitted twice, the cached result is returned without re-running the cascade. This is automatic and free.
For incremental updates (e.g., a document that changed one page), a future optimization could cache per-page results and only reconvert changed pages.
3. Reduce visual layout scoring calls (low-medium impact)
Visual layout scoring adds ~$0.0004/page (Gemini Flash vision call). At scale:
| Volume | VL scoring cost |
|---|---|
| 1,000 files (5K pages) | ~$2.00 |
| 10,000 files (50K pages) | ~$20.00 |
This is small relative to conversion costs, but can be optimized:
- Only score pages that passed structural quality (skip scoring for pages that will escalate anyway based on structural score alone)
- Disable visual layout scoring for text-only pages (they rarely have layout issues)
4. Use Gemini iterative refinement before Claude escalation (already implemented)
The smart cascade now tries iterative refinement with the same Gemini model before escalating to the next tier. This catches cases where a single Gemini pass produces mediocre results but a second pass (informed by a screenshot comparison) fixes the issues β avoiding the 7.5x cost jump to Claude agentic vision.
5. Negotiate volume pricing
At scale, API costs become significant:
| Monthly volume | Estimated API cost |
|---|---|
| 1,000 files | ~$130 |
| 5,000 files | ~$650 |
| 10,000 files | ~$1,300 |
| 50,000 files | ~$6,500 |
Both Anthropic and Google offer volume discounts and committed-use pricing. At 10K+ files/month, itβs worth negotiating directly. Anthropicβs batch API also offers 50% discount for non-real-time workloads β the nightly batch processing pattern is a natural fit.
6. Model cost reductions over time
LLM API prices have dropped consistently. Gemini Flash pricing has already fallen from the 1.5 era to the 3.0 preview. As newer models release:
- Cheaper models may achieve the same quality at lower tiers
- Fewer escalations as base model quality improves
- The tiered cascade architecture automatically benefits β just update the model names and pricing in the tier configuration
Summary
| Factor | Impact | Action |
|---|---|---|
| API calls (LLM) | 98% of cost | Minimize escalations, negotiate volume pricing |
| EC2 spot compute | <2% of cost | Scale freely β itβs nearly free |
| Escalation rate | 3-7x cost multiplier per escalation | Tune prompts and thresholds |
| Batch API discount | 50% reduction | Use Anthropic batch API for nightly runs |
| Parallel scaling | Time reduction only, same cost | Scale ASG to meet nightly time window |