Skip to content

source.html — Debugging Converter Output

For every successful conversion the platform stores three HTML artifacts in R2:

ArtifactPathWhat it is
index.htmlusers/{userId}/output/{fileId}/index.htmlFinal user-facing HTML — after all post-processing (ux-optimizer, mathml-validator, axe-fix-loop, visual-polish, etc.). This is what the dashboard preview and the WeasyPrint pipeline see.
ir.xhtmlusers/{userId}/output/{fileId}/ir.xhtmlXHTML 1.0 Strict version of the same content (EPUB-ready).
source.htmlusers/{userId}/output/{fileId}/source.htmlRaw converter output before any post-processing. This is what Claude vision / Mathpix / Marker actually emitted, before the deterministic clean-up passes mutated it.

source.html is the most important debugging artifact in the system. Use it whenever a conversion’s output looks wrong.

Why source.html exists

The post-processing passes are defensive — they catch malformed output and try to make it presentable. That defense erases evidence of the original bug:

  • mathml-validator wraps <math>raw LaTeX</math> blocks in <code class="math-fallback"> — at which point you can no longer tell whether the converter emitted the bad math or whether some upstream step corrupted it.
  • wrapBareLatex wraps any plaintext with two or more LaTeX commands in the same <code class="math-fallback"> shape — the original surrounding context is gone.
  • convertBareLatexToMath (#527) promotes bare LaTeX to a real <math> element — useful, but means you can’t tell from index.html whether Claude emitted the <math> or whether we synthesised it.
  • Image extraction renames files to images/page-N-img-M.png — if Claude referenced images/foo.png in its raw markup, that’s lost.

source.html is captured before any of those passes run, so it shows exactly the shape Claude (or whatever converter ran) produced. Most converter regressions are diagnosed by diffing source.html against index.html.

How to retrieve source.html

The export router exposes it at:

GET /api/export/source-html/:fileId

Authenticated as the file’s owner. Returns the raw HTML with proper MIME type, plus the standard Cache-Control and CORS headers.

Option 2 — Direct from R2 (admin)

Build the path with the shared helper:

import { R2_PATHS } from '@accessible-pdf/shared';
const key = R2_PATHS.sourceHtml(userId, fileId);
// users/{userId}/output/{fileId}/source.html

Then pull via the AWS S3 client pointed at the R2 endpoint:

Terminal window
ssh -i ~/.ssh/nightly-audit larry@10.1.1.4 \
"AWS_ACCESS_KEY_ID=$R2_ACCESS_KEY_ID \
AWS_SECRET_ACCESS_KEY=$R2_SECRET_ACCESS_KEY \
aws s3 cp \
s3://accessible-pdf-files/users/{userId}/output/{fileId}/source.html \
/tmp/source.html \
--endpoint-url https://c6cce84d1636ec85ec946a19edef0103.r2.cloudflarestorage.com"

R2 credentials live in /home/larry/accessible/.env.node-server on 10.1.1.4.

Option 3 — From a recent conversion (quick check)

List artifacts for any file:

Terminal window
aws s3 ls \
s3://accessible-pdf-files/users/{userId}/output/{fileId}/ \
--endpoint-url https://{accountId}.r2.cloudflarestorage.com

Expected listing: accessible.pdf, cost-analysis.html, index.html, ir.xhtml, source.html, styled.html, plus per-conversion images under assets/.

If source.html is missing for a recent conversion, file an issue — its absence is the bug.

How to read it

source.html is what Claude / Mathpix / Marker actually emitted, not a debug rendering. It is valid HTML and can be opened directly in a browser. Diff it against index.html to see what each post-processing pass changed:

Terminal window
diff <(curl -s .../source.html) <(curl -s .../index.html) | less

Common diagnoses you can read directly from source.html:

Symptom in index.htmlWhat to check in source.html
<code class="math-fallback"> wrapping a math elementDid the converter emit <math>raw LaTeX</math> (validator wrapped it) or bare \sum…\binom… text in a <p> (wrapBareLatex wrapped it)?
Image at the wrong location or missingWhat src attribute does the source <img> use? Does the file actually exist in uploads/{fileId}/images/?
Table rendered as a list of <p> paragraphsDid the converter emit a <table> at all, or did it flatten the layout to flowing text?
Math renders wrong only in the PDFInspect both source.html (Claude’s output) and check the converter’s MathML shape — quirky attributes like tml-med-pad can survive the HTML but trip the Node-side prerender.

When the converter pipeline isn’t chunked-vision

source.html is written by every conversion path that goes through storeIrAndHtml — including chunk-assembler (chunked-vision) and the struct-table route (#525, fixed in #529). Mathpix and pure-Marker conversions that take a different storage path may not produce a source.html. If you’re investigating a non-vision conversion and there’s no source.html, that’s a coverage gap to file as an issue.

Don’t expose source.html to end users

source.html is a developer/admin artifact. It contains the unsanitised converter output (including any prompt-injected content if the source PDF was hostile). The export route requires authentication and ownership, but the artifact itself isn’t styled for users — keep it scoped to debugging tools.

  • #525 — original tracking issue (“persist pre-validator HTML for post-mortem debugging”)
  • #529 — closed coverage gap in struct-table conversion path
  • #527 — equation rendering resilience (used source.html to diagnose Claude vision variation)
  • #530, #531, #532 — image and table issues diagnosed from the source.html for file 11c5124a-6048-4390-9bd3-e93affa0f7fd