Skip to content

Image Embedding Strategy

Decision

All images in converted HTML output are embedded as inline base64 data URIs. The final deliverable is a single self-contained .html file with no external dependencies.

How It Works

The pipeline handles images at three points:

  1. Extracted images (Marker, MathPix) — storeAndEmbedImages() converts each image to a data:image/png;base64,... URI and replaces src references in the HTML. Images are also stored to R2 for archival, but the R2 copies are never referenced from the HTML.

  2. Page screenshots (vision converters) — When vision converters produce <img> tags but no actual image data (extractedImages.length === 0), renderPdfPagesAsDataUris() renders the original PDF pages as PNGs via Browser Rendering + pdf.js, then embedPageScreenshots() injects them as data URIs into the matching <section data-page-number="N"> blocks.

  3. AI-enhanced alt text — enhanceImagesInHtml() runs before embedding, so images still have their original filenames for matching. Alt text is written to the <img> tags, then embedding replaces the src with a data URI.

Scale factor

Page screenshots use a scale factor of 1.5 (150% of default viewport). This balances quality against file size — scale 2.0 produces sharper images but roughly doubles the PNG byte count.

Why Inline Data URIs

BenefitDetail
Single-file portabilityUsers download one .html file and open it anywhere — no broken images, no server dependency
Offline accessWorks completely offline after download
No asset managementNo CDN, no signed URLs, no expiring links, no CORS
Accessibility toolsScreen readers and assistive tech work identically whether the file is local or hosted
SimplicityNo need to coordinate HTML + image uploads or generate asset manifests

Trade-offs

CostDetail
File sizeBase64 encoding adds ~33% overhead. A full-page PNG at scale 1.5 is typically 500KB–2MB. A 20-page image-heavy PDF can produce a 20–40MB HTML file.
Browser performanceVery large data URIs (>50MB total) can slow DOM parsing and increase memory usage
Redundant R2 storagestoreAndEmbedImages stores images to R2 and embeds them inline — the R2 copies exist only for archival/debugging
No incremental loadingThe browser must download the entire HTML file before rendering any content (no lazy-loading of external images)
Cache inefficiencyIf the same image appears in multiple conversions, each HTML file embeds its own copy rather than sharing a cached asset

Alternative: External R2 References

If the product evolves to include a hosted preview mode (viewing converted documents in-browser without downloading), switching to external R2-referenced URLs would reduce HTML file sizes significantly.

The R2 path users/{userId}/output/{fileId}/assets/ already exists but is unused. An external-image approach would:

  • Store images to R2 at assets/page-{N}.png
  • Reference them as <img src="/api/assets/{fileId}/page-{N}.png">
  • Require signed URLs or auth middleware for private documents
  • Require a zip download option (HTML + images folder) for offline use

Current decision: Not pursuing this. The single-file download model is simpler and meets current user needs. Revisit if file sizes become a user complaint or if a hosted preview feature is added.

Key Files

FileRole
workers/api/src/routes/convert.tsstoreAndEmbedImages(), embedPageScreenshots(), pipeline orchestration
workers/api/src/utils/pdf-to-png.tsrenderPdfPageToPng(), renderPdfPagesAsDataUris()
workers/api/src/utils/html.tstoDataUri(), uint8ToBase64()
workers/api/src/services/image-enhancer.tsAI alt-text generation (runs before embedding)