Image Embedding Strategy

Decision

All images in converted HTML output are embedded as inline base64 data URIs. The final deliverable is a single self-contained .html file with no external dependencies.

How It Works

The pipeline handles images at three points:

Extracted images (Marker, MathPix) — storeAndEmbedImages() converts each image to a data:image/png;base64,... URI and replaces src references in the HTML. Images are also stored to R2 for archival, but the R2 copies are never referenced from the HTML.
Page screenshots (vision converters) — When vision converters produce <img> tags but no actual image data (extractedImages.length === 0), renderPdfPagesAsDataUris() renders the original PDF pages as PNGs via Browser Rendering + pdf.js, then embedPageScreenshots() injects them as data URIs into the matching <section data-page-number="N"> blocks.
AI-enhanced alt text — enhanceImagesInHtml() runs before embedding, so images still have their original filenames for matching. Alt text is written to the <img> tags, then embedding replaces the src with a data URI.

Scale factor

Page screenshots use a scale factor of 1.5 (150% of default viewport). This balances quality against file size — scale 2.0 produces sharper images but roughly doubles the PNG byte count.

Why Inline Data URIs

Benefit	Detail
Single-file portability	Users download one `.html` file and open it anywhere — no broken images, no server dependency
Offline access	Works completely offline after download
No asset management	No CDN, no signed URLs, no expiring links, no CORS
Accessibility tools	Screen readers and assistive tech work identically whether the file is local or hosted
Simplicity	No need to coordinate HTML + image uploads or generate asset manifests

Trade-offs

Cost	Detail
File size	Base64 encoding adds ~33% overhead. A full-page PNG at scale 1.5 is typically 500KB–2MB. A 20-page image-heavy PDF can produce a 20–40MB HTML file.
Browser performance	Very large data URIs (>50MB total) can slow DOM parsing and increase memory usage
Redundant R2 storage	`storeAndEmbedImages` stores images to R2 and embeds them inline — the R2 copies exist only for archival/debugging
No incremental loading	The browser must download the entire HTML file before rendering any content (no lazy-loading of external images)
Cache inefficiency	If the same image appears in multiple conversions, each HTML file embeds its own copy rather than sharing a cached asset

Alternative: External R2 References

If the product evolves to include a hosted preview mode (viewing converted documents in-browser without downloading), switching to external R2-referenced URLs would reduce HTML file sizes significantly.

The R2 path users/{userId}/output/{fileId}/assets/ already exists but is unused. An external-image approach would:

Store images to R2 at assets/page-{N}.png
Reference them as <img src="/api/assets/{fileId}/page-{N}.png">
Require signed URLs or auth middleware for private documents
Require a zip download option (HTML + images folder) for offline use

Current decision: Not pursuing this. The single-file download model is simpler and meets current user needs. Revisit if file sizes become a user complaint or if a hosted preview feature is added.

Key Files

File	Role
`workers/api/src/routes/convert.ts`	`storeAndEmbedImages()`, `embedPageScreenshots()`, pipeline orchestration
`workers/api/src/utils/pdf-to-png.ts`	`renderPdfPageToPng()`, `renderPdfPagesAsDataUris()`
`workers/api/src/utils/html.ts`	`toDataUri()`, `uint8ToBase64()`
`workers/api/src/services/image-enhancer.ts`	AI alt-text generation (runs before embedding)