Skip to content

TODO: Fix Inline Math Rendering in Chunk-Assembled Documents

Status

Unresolved as of 2026-03-09. Border duplication fixed. Math centering not fixed.


Problem Statement

Equations in math-heavy PDFs (e.g. calculus textbooks) render as centered block elements even when they should flow inline within sentences. Example:

"disappears at"
x = 0 ← should be: "disappears at x = 0"
": The differential"
6(u βˆ’ uΒ²)du ← should be: ": The differential 6(u βˆ’ uΒ²)du was chosen..."

Root Cause (Confirmed β€” Read the HTML)

The actual converted HTML (/Users/larryanglin/Downloads/onepagemath.html) shows that the model output IS correct. Math elements ARE tagged display="inline" and ARE correctly embedded in <p> tags alongside surrounding text, e.g.:

<p>We need a competitor... his rule is <math display="inline">...</math> Those
"Gauss points" <math display="inline">...</math> and ...</p>

The math is inline in the HTML. It renders as block because of a CSS cascade conflict.

CSS Conflict Chain

  1. enhanceAccessibility() (wcag-validator.ts ~line 216) injects this CSS:

    .math, .MathJax, math, [class*="equation"] {
    margin: 0.5em 0;
    display: block; /* ← forces ALL <math> to block, ignoring display attr */
    }
  2. optimizeDeterministic() (ux-optimizer.ts lines 208–212) tries to inject counter-rules from UX_CSS:

    math[display="inline"], math:not([display]) { display: inline; vertical-align: middle; }
    math[display="block"] { display: block; margin: auto; text-align: center; }

    But this injection looks for </head> or <body> tags:

    if (result.includes('</head>')) {
    result = result.replace('</head>', `<style>${UX_CSS}</style>\n</head>`);
    } else if (result.includes('<body')) {
    result = result.replace(/<body/, `<style>${UX_CSS}</style>\n<body`);
    }
    // ← If neither tag exists, nothing is injected. Silently a no-op.
  3. Chunk-assembled HTML has no <head> or <body> tags when optimizeDeterministic runs. The assembled content is raw fragment HTML:

    <section class="pdf-chunk" data-chunk-index="0">
    ...content...
    </section>

    So optimizeDeterministic never injects UX_CSS for large PDFs.

  4. wrapInDocument() (html.ts) injects DOCUMENT_STYLES, which has layout, typography, and table CSS β€” but no math display overrides.

  5. Net result: The only CSS that ever applies to <math> elements in chunk-assembled documents is the accessibility CSS math { display: block; }. The HTML attribute display="inline" is overridden by the CSS rule.

Why It Only Affects Large/Chunked PDFs

  • Small single-pass PDFs: processConversion calls the vision converter which returns a full HTML document (with <head>/<body>) β†’ optimizeDeterministic successfully injects UX_CSS β†’ math displays correctly.
  • Large chunked PDFs: chunk-assembler gets raw fragment HTML β†’ UX_CSS injection fails silently β†’ math always block.

The Fix (One Line)

Add to DOCUMENT_STYLES in workers/api/src/utils/html.ts (after the section rules, before the media queries):

/* Override accessibility-css blanket 'math { display: block }' rule.
The HTML display attribute has higher CSS specificity than the bare
'math' selector, so this restores correct inline/block math rendering. */
math[display="inline"] { display: inline; vertical-align: middle; }
math:not([display]) { display: inline; vertical-align: middle; }

DOCUMENT_STYLES is always injected by wrapInDocument which runs as the last step regardless of whether the HTML came from a chunk or single-pass path. math[display="inline"] has specificity (0,1,1) vs math at (0,0,1), so it wins without needing !important.


Secondary Issue: Math Merger (mergeIsolatedBlockMath)

There is ALSO a structural issue where the model sometimes outputs math elements on their own lines (not inside sentence <p> tags). The mergeIsolatedBlockMath function in ux-optimizer.ts attempts to fix this post-hoc.

Current Merger Status

  • Correctly merges <p>text</p> <math> <p>continuation</p> β†’ <p>text math continuation</p>
  • Correctly handles p-wrapped math (Step 0 unwrap)
  • Correctly handles consecutive bare math runs (added 2026-03-09)
  • Still limited to <p> on both sides β€” can’t merge if there’s no <p> before/after

From the actual HTML output, the merger is working for most cases

The math elements in the sample file ARE correctly inline in <p> tags. The centering is purely the CSS bug above, NOT a structural merger failure.

Merger logs from last test run (2026-03-09):

[assembler-diag] block math elements before ux-optimizer: 4
[assembler-diag] total math elements: 46
[mergeBlockMath] pass merged 2, block math remaining: 1
[mergeBlockMath] pass merged 0, block math remaining: 1

Only 2 were merged because only 2 needed structural merging. The other 44 were already correctly inline in <p> tags β€” they just rendered wrong due to the CSS.


PDF Truncation Issue (Separate Problem)

Large PDFs (e.g. 54-page calculus textbook) consistently stop 10–12 pages from the end. The exact cutoff varies per run. Confirmed causes:

  • NOT a hard timeout (processing time varies widely)
  • NOT a fixed page-count limit (cutoff page varies)

Current safeguards

  • MIN_OUTPUT_TOKENS_PER_PAGE = 150 β€” escalates Gemini to Claude if thin output
  • MIN_HTML_CHARS_PER_PAGE = 400 β€” second density check
  • Gemini thinking token subtraction: actualOutputTokens = candidatesTokenCount - thoughtsTokenCount
  • Multi-pass chunk processing with context tail handoff

Suspected cause

The escalation from Gemini to Claude is working, but Claude itself may hit its own output token limit for very dense math pages. The agentic loop runs up to maxIterationsPerPage iterations but if Claude’s context window fills with dense MathML, later chunks get degraded output or stop early.

Investigation needed

  1. Enable more detailed logging per-chunk: which chunk index fails and what is the actual token count and output length for failing chunks
  2. Check whether the last processed chunk shows thin output that passes thresholds
  3. Consider adding a third check: compare r2ChunkKey output length vs expected

Files to Modify

FileChange
workers/api/src/utils/html.tsAdd math[display="inline"] and math:not([display]) to DOCUMENT_STYLES
workers/api/src/services/wcag-validator.tsOptionally fix the accessibility CSS to not set display: block on math[display="inline"] (belt-and-suspenders)

Do NOT modify ux-optimizer.ts UX_CSS for this β€” it’s never injected for chunk documents and fixing the injection point would be more invasive.


Test File

/Users/larryanglin/Downloads/onepagemath.html β€” actual output from 2026-03-09 run. Math elements ARE display="inline" in the HTML; visual rendering is broken by CSS. Convert onepagemath.pdf (1-page calculus excerpt) to reproduce.