Conversion Pipeline: Upload to Final HTML
This document describes the complete flow from when a file is uploaded until the final accessible HTML is saved, including how the system decides which processing path to use for each file or page, and every AI service involved.
Table of Contents
- High-Level Architecture
- Upload and Ingestion
- Pre-Conversion Checks
- Content Classification
- Routing Decision Tree
- Processing Pipelines
- Post-Processing Pipeline
- AI Services Reference
- Key Files Reference
1. High-Level Architecture
The system runs on two platforms:
- Cloudflare Worker (
workers/api/) β handles auth, credential validation, preflight analysis, routing decisions, and lightweight conversions (Mathpix, Marker). Has a 10-minute timeout. - Node.js Server (
larry@10.1.1.4) β handles Puppeteer rendering, the chunk scheduler, SSE streaming, axe-core audits, and all AI-heavy conversions. Two instances (api-node-1,api-node-2) behind Traefik. No time limit.
The CF Worker proxies heavy-computation requests to Node.js via middleware/node-proxy.ts when browser-based operations (screenshot rendering, axe audit) are needed.
Storage:
| System | Purpose |
|---|---|
| Cloudflare R2 / S3 | PDF originals, chunk HTML fragments, final HTML output |
| Supabase PostgreSQL | files (file metadata), large_conversion_jobs, chunk_jobs, profiles, credits, cost ledger |
Cloudflare KV (KV_SESSIONS) | Session caching, rate limits, file settings, share tokens, tenant config |
| Supabase Auth | Session tokens and user authentication |
2. Upload and Ingestion
Source: workers/api/src/routes/files.ts
Entry Points
| Endpoint | Purpose |
|---|---|
POST /api/files/upload | Allocates a file ID and metadata record |
PUT /api/files/:fileId/upload-data | Receives the actual file bytes |
POST /api/files/from-url | Fetches a remote PDF by URL (SSRF-protected, PDF only) |
Accepted File Types
- PDF:
application/pdf - Images:
image/png,image/jpeg,image/webp,image/gif,image/tiff - DOCX:
application/vnd.openxmlformats-officedocument.wordprocessingml.document - Size limits: 10 MB standard, 200 MB for async large-PDF pipeline
What Happens
-
POST /api/files/uploadβ validates MIME type, generates a UUIDfileId, creates anUploadedFilemetadata record withstatus: 'uploading', upserts it to the Supabasefilestable. -
PUT /api/files/:fileId/upload-dataβ reads raw bytes, writes to R2 atusers/{userId}/uploads/{fileId}/original/{filename}, setsstatus: 'uploaded'. -
POST /api/files/from-urlβ validates URL withvalidateFetchUrl()(SSRF protection), fetches with 30-second timeout, validates Content-Type is PDF, stores identically to direct upload.
3. Pre-Conversion Checks
Source: workers/api/src/routes/convert.ts (lines 238β444)
Entry point: POST /api/convert/:fileId
Before any conversion begins, the system runs these checks in order:
| Check | What It Does | Failure Response |
|---|---|---|
| Credential validation | Verifies required API keys exist for the requested parser | HTTP 500 |
| Anthropic key pre-flight | Makes a lightweight count_tokens API call to verify the key is actually valid (not just present) | HTTP 401 with clear message |
| PDF page counting | Loads PDF from R2, counts pages to estimate credits needed | β |
| Credit check | Verifies user has enough credits (1 per page) via checkCredits() | HTTP 402 |
| Spend limit check | Enforces daily/monthly page limits via checkSpendLimits() | HTTP 429 |
| Dollar budget check | Estimates cost at ~$0.01/page and checks against dollar budget via checkDollarBudget() | HTTP 429 |
| PDF pre-flight | Inspects PDF structure for blockers via runPreflight() | HTTP 422 with blocker list |
Pre-flight Blockers
runPreflight() (services/pdf-preflight.ts) catches:
- Encrypted/password-protected PDFs
- Image-only pages (>50% of pages are pure images β not remediable)
- Embedded audio/video
- Corrupt/unparseable files
- Embedded JavaScript
4. Content Classification
4a. Document-Level Fast Scan
Source: services/pdf-complexity-detector.ts β detectComplexContent()
Before entering the chunked pipeline, the system performs a zero-cost structural scan of the PDF using the raw operator stream (via unpdf/pdfjs). No AI calls are made.
What it detects:
| Feature | How Itβs Detected |
|---|---|
| Images | paintImageXObject, paintInlineImageXObject, paintImageMaskXObject PDF operators |
| Tables/figures | β₯10 path-draw operations (rectangle, constructPath, moveTo, lineTo) combined with path-paint operations |
| Math fonts | Font names containing: cmsy, cmmi, cmex, symbol, mathematicalpi, stix, cambria math, asana math, xits math, latin modern math |
| Math (image-mask heuristic) | β₯15 paintImageMaskXObject ops/page + β₯4 distinct rendered text heights (indicates sub/superscripts) |
Derived flags:
isPureTextβ no images, no tables, no math fontshasMathFontsβ math fonts detected on any pageisComplexβ images, tables, or math present
4b. Per-Page Classification
Source: services/pdf-complexity-detector.ts β detectComplexContentPerPage()
When the smart cascade or inline path needs per-page routing, each page is classified individually:
| Content Type | Criteria | Recommended Backend |
|---|---|---|
text | No images, no tables, no math fonts | marker |
math | Math fonts present, no images | marker+temml |
image | Images present, no math, no tables | gemini-flash |
table | Tables/figures present, no images, no math | marker |
mixed | Multiple features (e.g., images + math, images + tables) | claude-vision |
5. Routing Decision Tree
Primary Decision Flow (Auto Mode, PDF with Anthropic Key)
POST /api/convert/:fileId (parser: 'auto')βββ fileType === 'docx'β ββ mammoth.js local conversion β post-processingβββ fileType === 'image'β ββ Image passthrough + AI alt text β post-processingβββ fileType === 'pdf' + ANTHROPIC_API_KEY available β ββ runPreflight() βββ blockers found? β HTTP 422 PREFLIGHT_BLOCKED β ββ detectComplexContent() (fast document-level scan) β ββ Math fonts detected + Mathpix credentials available β ββ β
ROUTE 1: Mathpix Pipeline β ββ isPureText + Marker key available + !highFidelity β ββ β
ROUTE 2: Marker Fast Path β ββ Otherwise (complex content, or no fast-path match) ββ β
ROUTE 3: Async Chunked Pipeline (primary production path)Route 1: Mathpix Pipeline
Trigger: PDF contains math fonts AND Mathpix API credentials are configured.
Why: Mathpix natively understands LaTeX/MathML notation and produces correct mathematical markup that vision models often get wrong.
Route 2: Marker Fast Path
Trigger: PDF is pure text (no images, tables, or math) AND Marker API key available AND highFidelity is not requested.
Why: For text-only PDFs, Markerβs OCR engine is faster and cheaper than vision models, and produces accurate text extraction.
Route 3: Async Chunked Pipeline
Trigger: Everything else β complex PDFs, mixed content, high-fidelity requests.
Why: Vision models can handle any content type. Chunking enables parallel processing of large documents.
Fallback Decision Flow (No Anthropic Key)
ββ !isComplex + Marker key β Markerββ isComplex + !highFidelity β Smart Cascadeββ isComplex + highFidelity β error (needs Anthropic key)ββ no keys β errorInline Path (Small Documents, Explicit Parser Selection)
parser === 'cascade'ββ Smart Cascade: Marker β MathPix β Agentic Vision (tiered per page)
parser === 'auto' (inline, not async)ββ budget tier β Smart Cascade (budget mode, Marker only)ββ !isComplex + Marker β Markerββ isComplex + !highFidelity β Smart Cascadeββ isComplex + highFidelity + >10 pages β Chunked Agentic Visionββ isComplex + highFidelity + β€10 pages β Agentic Vision (whole document)ββ !isComplex + highFidelity β Agentic Visionββ !isComplex + no Marker β Claude single-passββ no Anthropic β Mathpix fallback β Marker fallback6. Processing Pipelines
6.1 Mathpix Pipeline
Source: services/mathpix-pdf.ts, routes/convert.ts (lines 1904β2025)
Best for: PDFs with mathematical equations, scientific notation, LaTeX content.
Flow:
- Split PDF into individual pages.
- Submit each page to
https://api.mathpix.com/v3/pdf(concurrency limit: 3). - Poll
GET /v3/pdf/{pdfId}every 3 seconds, up to 5 minutes. - Download HTML + images via
/v3/pdf/{pdfId}.html.zip. - Embed extracted images as data URIs in page HTML.
- Wrap each page in
<section class="pdf-page" role="region">. - Continue to post-processing.
AI involved: Mathpix proprietary ML models (math recognition, OCR). Cost: ~$0.005/page.
6.2 Marker Fast Path
Source: services/marker-converter.ts
Best for: Text-only PDFs, simple tables without images.
Flow:
POST https://api.datalab.to/api/v1/markerwithoutput_format: html,paginate_output: true.- Poll
GET /api/v1/marker/{request_id}every 3 seconds, up to 5 minutes. - Receive HTML + extracted images (base64 from response).
- Fallback: if HTML not provided, convert Markdown output to basic HTML.
- Continue to post-processing.
AI involved: Datalab Surya OCR (deep learning model). No Claude/Gemini calls. Cost: ~$0.006/page.
6.3 Async Chunked Pipeline (Primary Production Path)
Source: services/chunk-boundary-detector.ts, scheduler/chunk-scheduler.ts, services/chunk-processor.ts, services/chunk-assembler.ts
Best for: Complex PDFs of any size β the default for production conversions.
Step 1 β Chunk Boundary Detection
Source: services/chunk-boundary-detector.ts β detectChunkBoundaries()
- Read PDF outline (bookmarks) up to 2 levels deep β bookmark pages become natural break points.
- Fallback: per-page heading detection β checks first text item on each page against regex patterns (numbered sections, βChapterβ/βPartβ/βSectionβ keywords, Roman numerals, ALL-CAPS lines).
- Greedy chunking: target 20 pages per chunk (
TARGET_CHUNK_SIZE_PAGES), snap to nearest natural break within lookahead, hard-split at 30 pages (MAX_CHUNK_SIZE_PAGES). - Create one
chunk_jobsrow per boundary in Supabase, plus alarge_conversion_jobsparent record. - Return immediately to client:
{ jobId, asyncMode: true, totalChunks }.
Step 2 β Chunk Scheduling
Source: scheduler/chunk-scheduler.ts β ChunkScheduler
Runs continuously on the Node.js server. Every 3 seconds:
- Query Supabase for
pendingchunks. - Claim up to 8 chunks (
MAX_CONCURRENT_CHUNKS) using optimistic locking:UPDATE chunk_jobs SET status='processing' WHERE id=? AND status='pending'. If another node already claimed it, the update affects 0 rows β skip. - Process each claimed chunk via
processChunk(). - Every 5 cycles: reclaim stale chunks (processing > 15 minutes).
- Every 10 cycles: recover orphaned jobs (all chunks done but counter mismatch).
Step 3 β Per-Chunk Processing
Source: services/chunk-processor.ts β processChunk()
- Extract the chunkβs page range from the PDF via
extractPageRange(). - Run
convertWithChunkedAgenticVision()β Gemini Flash as primary, Claude Sonnet as fallback. - Maximum 4 iterations per page (
maxIterationsPerPage). - Provide
precedingContextHtml(last 2,500 chars from previous chunk) for structural continuity across chunk boundaries. - Wrap result in
<section class="pdf-chunk" data-chunk-index="N" data-start-page="X" data-end-page="Y">. - Store chunk HTML to R2 at
users/{userId}/jobs/{jobId}/chunks/{chunkIndex}.html. - Store
contextTailback tochunk_jobs.context_tailfor the next chunk. - Atomically increment
large_conversion_jobs.done_chunksvia Supabase RPC.
Per-page model routing within a chunk:
| Page Content Type | Primary Model | Escalation (if score stalls) |
|---|---|---|
text or table | Claude Haiku | Claude Sonnet |
image, math, or mixed | Claude Sonnet | β |
Step 4 β Assembly
Source: services/chunk-assembler.ts β assembleChunks()
Triggered when done_chunks reaches total_chunks:
- Load all chunk HTMLs from R2 in parallel.
- Concatenate and repair malformed HTML (auto-close unclosed tags at chunk boundaries).
- Extract embedded PDF images and insert into
<img>tags from the vision model. - Run the full post-processing pipeline.
- Store final HTML to R2.
- Deduct credits, send completion email and web push notification.
Progress streaming: Clients subscribe to GET /api/convert/:fileId/stream (SSE) for real-time progress, chunk, complete, and error events.
6.4 Agentic Vision Converter (Core AI Engine)
Source: services/agentic-vision-converter.ts
This is the central AI engine used by the chunked pipeline, smart cascade, and high-fidelity modes. It implements an iterative visual feedback loop.
How It Works
-
Initial pass: Send the PDF (base64) to the vision model with a detailed prompt specifying semantic HTML rules, MathML requirements, heading hierarchy, figure/image treatment. Receive raw HTML.
-
Screenshot refinement loop (up to
maxIterations):- Render current HTML in Puppeteer at 1280Γ1600 viewport.
- Take a full-page PNG screenshot.
- If layout scorer is configured (Gemini), score the screenshot against the original PDF.
- Score β₯ 90 (
layoutScoreThreshold) β stop early, quality is sufficient. - Score delta < 3 for 2+ consecutive passes β stalling β escalate to more expensive model if
fallbackStrategyis configured.
- Score β₯ 90 (
- Send original PDF + screenshot + current HTML to the model with a refinement prompt (βfix visual differencesβ).
- If model responds
NO_CHANGES_NEEDEDβ stop (converged). - Update HTML with refined version.
-
Return final HTML + token usage + models used.
Model Strategies
| Strategy | SDK | Notes |
|---|---|---|
ClaudeVisionStrategy | Anthropic SDK | Sends PDF as document block with cache_control: ephemeral for prompt caching |
GeminiVisionStrategy | Google Generative AI SDK | Sends PDF as inlineData with mimeType: application/pdf |
6.5 Smart Cascade Converter
Source: services/smart-cascade-converter.ts
Best for: When cost optimization matters β uses cheap tools first and only escalates to expensive vision models when quality is insufficient.
Per-Page Tiered Escalation
| Page Type | Tier 1 (cheapest) | Tier 2 (if quality < 80) | Tier 3 (if still < 80) |
|---|---|---|---|
text or table | Marker API | Gemini Flash vision | Claude Sonnet agentic |
math | Marker + temml | Mathpix per-page image API | Gemini Flash vision |
image | Gemini Flash vision | Claude Sonnet agentic | β |
mixed | Gemini Flash vision | Claude Sonnet agentic | β |
Quality Scoring
Each tierβs output is scored before deciding whether to escalate:
- WCAG validation violations: up to β40 penalty
- Semantic HTML ratio: up to +30 bonus (measures proportion of semantic elements vs raw
<div>/<span>) - Structure bonuses:
langattribute,<title>, valid heading hierarchy - Threshold: Score must reach 80 (
qualityThreshold) to accept a tierβs result
Budget Mode
When budgetMode: true:
- Marker-only, no escalation to vision models.
- Hard cost cap enforced per page (
maxCostUsd).
Concurrency: Up to 8 pages in parallel (maxPagesParallel). Uses a pool where each slot refills as a page finishes β fast text pages donβt block slow vision pages.
6.6 Other Parsers
| Parser | Source | Trigger | AI Involved |
|---|---|---|---|
claude-vision (explicit) | claude-converter.ts | User explicitly selects βClaude Visionβ | Claude Sonnet single-pass (no iteration) |
segmented | convert.ts | User selects; requires Mathpix | Mathpix (structure) + vision models (images) |
vision-tables | convert.ts | User selects table extraction | detectComplexContentPerPage() to find table pages, then Claude vision per page |
| DOCX | convert.ts | .docx file uploaded | mammoth.js (local, no AI) |
| Image passthrough | convert.ts | Image file uploaded | Vision model for alt text only |
7. Post-Processing Pipeline
Applied after every converter, in this order. No matter which route a file took, it goes through the same post-processing.
Source: routes/convert.ts (lines 1497β1833), services/chunk-assembler.ts
Step-by-Step
| Step | Function | What It Does | AI? |
|---|---|---|---|
| 1 | enhanceImagesInHtml() | For each extracted image, calls a vision model to generate descriptive alt text. Uses isAltTextAcceptable() blocklist to reject generic captions like βdiagramβ. | Yes β Gemini Flash, Claude, or GPT-4o-mini |
| 2 | storeAndEmbedImages() | Stores images to R2 and embeds as data URIs in HTML | No |
| 3 | Image extraction fallback | If converter produced no images but classification shows image/mixed pages, extracts embedded PDF image objects via extractImagesFromPdfPages(). If that fails, falls back to full-page Puppeteer screenshots via renderPdfPagesAsDataUris(). | No |
| 4 | structurePages() | Adds page header/footer banners to each <section class="pdf-page">, wraps in page-numbered sections | No |
| 5 | optimizeDeterministic() | Pure HTML transforms (no AI): adds <thead>/<tbody> to tables, promotes first row to <th>, adds scope attributes, converts <br> sequences to <p>, adds aria-label/role="img" to SVGs, cleans unnecessary wrapper <div>s, converts LaTeX to MathML via temml | No |
| 6 | enhanceAccessibility() | Adds lang attribute, <title>, viewport meta, skip-link, source document banner, ensures DOCTYPE | No |
| 7 | validateAndFix() | Custom WCAG rule checker that auto-fixes: missing alt text, empty links/buttons, missing table headers, duplicate IDs, empty headings, invalid scope attributes. Up to 3 fix passes. | No |
| 8 | runAxeAudit() | Full browser-based accessibility audit using axe-core in Puppeteer. Non-blocking; skipped if browser unavailable. | No |
| 9 | runAxeFixLoop() | If fixable violations remain after step 7, runs automated DOM manipulation fixes. | No |
| 10 | wrapInDocument() | Wraps in <!DOCTYPE html> skeleton with responsive CSS. High-fidelity mode adds serif fonts, table borders, figure styling. | No |
Final Steps
| Step | What Happens |
|---|---|
| Storage | Final HTML β R2 at users/{userId}/output/{fileId}/index.html |
| Credit deduction | 1 credit per page via Supabase RPC deduct_credits |
| WCAG failure alert | If wcagStatus.passed === false, email alert sent to ALERT_EMAIL (rate-limited: once per fileId per 24 hours) |
| Notification | Completion email + web push notification to user |
8. AI Services Reference
| Service | Model ID | Provider | Role | When Used | Approx. Cost |
|---|---|---|---|---|---|
| Claude Sonnet | claude-sonnet-4-6 | Anthropic | Primary vision converter for complex/mixed/image pages; iterative refinement with screenshot feedback | Route 3 (complex pages), Smart Cascade Tier 3, high-fidelity mode | ~$3/MTok in, $15/MTok out |
| Claude Haiku | claude-haiku-4-5-20251001 | Anthropic | Cheaper vision for text/table pages; API key pre-flight validation | Route 3 (simple pages), alt text generation fallback | ~$0.80/MTok in, $4/MTok out |
| Gemini Flash | gemini-2.5-flash | First-pass converter in chunk pipeline; layout quality scoring; image pages in cascade | Route 3 primary strategy, Smart Cascade Tier 2, layout scorer | Variable | |
| Mathpix | Mathpix API | Mathpix | Native math/equation extraction β LaTeX, MathML output | Route 1 (math-detected PDFs), Smart Cascade Tier 2 for math pages | ~$0.005/page |
| Marker / Surya | Datalab API | Datalab | Text extraction OCR for text-only PDFs | Route 2 (pure text), Smart Cascade Tier 1 | ~$0.006/page |
| Gemini Flash (images) | gemini-flash | Alt text generation for extracted images | Post-processing step 1 | ~$0.0003/image | |
| GPT-4o-mini | gpt-4o-mini | OpenAI | Optional alt text generation (if configured as imageModel) | Post-processing step 1 (optional) | ~$0.0004/image |
| Claude (images) | Haiku or Sonnet | Anthropic | Alt text generation when Gemini unavailable | Post-processing step 1 (fallback) | $0.002β$0.01/image |
| temml | β | Local library | LaTeX β MathML rendering (no external calls) | Post-processing step 5, Marker+temml path | Free |
9. Key Files Reference
| File | Role |
|---|---|
workers/api/src/routes/convert.ts | Main entry: pre-checks, routing decisions, inline pipeline orchestration |
workers/api/src/routes/files.ts | File upload, URL ingestion, file management |
workers/api/src/utils/file-list.ts | File metadata CRUD β Supabase files table with camelCaseβsnake_case mapping |
workers/api/src/routes/convert-stream.ts | SSE endpoint for real-time chunk progress |
workers/api/src/services/pdf-complexity-detector.ts | Zero-cost PDF structure analysis: images, tables, math fonts per page |
workers/api/src/services/pdf-preflight.ts | Pre-flight checks: encryption, image-only, corruption, JavaScript |
workers/api/src/services/chunk-boundary-detector.ts | Section break detection using PDF outline and heading patterns |
workers/api/src/scheduler/chunk-scheduler.ts | Background job runner: claims, processes, assembles chunks |
workers/api/src/services/chunk-processor.ts | Single chunk processing using Gemini-first/Claude-fallback per-page |
workers/api/src/services/chunk-assembler.ts | Stitches chunk fragments into final WCAG-compliant HTML document |
workers/api/src/services/agentic-vision-converter.ts | Core iterative AI converter: initial pass + screenshot feedback loop |
workers/api/src/services/smart-cascade-converter.ts | Per-page tiered routing: Marker β Gemini β Claude |
workers/api/src/services/marker-converter.ts | Datalab Marker/Surya API client |
workers/api/src/services/mathpix-pdf.ts | Mathpix API client for math-heavy PDFs |
workers/api/src/services/wcag-validator.ts | WCAG rule checker + auto-fixer (no AI) |
workers/api/src/services/image-enhancer.ts | AI-powered alt text generation |
workers/api/src/server.ts | Node.js server entry with ChunkScheduler startup |
packages/shared/src/types.ts | UploadedFile, ParserOptions, FileStatus, QualityTier types |
packages/shared/src/constants.ts | TARGET_CHUNK_SIZE_PAGES (20), MAX_CHUNK_SIZE_PAGES (30), CONTEXT_TAIL_CHARS (2500) |