Skip to content

The Equation Rendering Problem

TL;DR

We have two PDF-to-HTML converters. MathPix renders equations beautifully but produces mangled HTML structure. Marker produces excellent HTML structure but outputs equations as raw LaTeX strings that browsers can’t render. Neither produces a complete, accessible result for math-heavy documents. Our hybrid approach (Marker structure + MathPix per-image equation OCR) fails because Marker doesn’t output equations as images — it outputs them as raw LaTeX text in <math> tags, giving the equation renderer nothing to work with.


The Two Converters

Marker (Datalab Surya OCR)

What it does well:

  • Clean semantic HTML: proper <h1>, <h2>, <ol>, <li> nesting
  • No page-number artifacts — questions numbered correctly
  • Preserves all content including equation answer choices
  • Extracts real images (photos, diagrams) as separate files with alt text

What it gets wrong — equations rendered as raw LaTeX:

Marker outputs equations inside <math> tags, but the content is raw LaTeX, not valid MathML. Browsers cannot render this:

<!-- Marker output for "5.021 × 10⁴" -->
<math>
5.021 \times 10^4
</math>

A browser just shows the literal string 5.021 \times 10^4. For the chemistry quiz PDF, Marker has 22 <math> tags containing raw LaTeX — all unrenderable.

Marker’s full question 9 (all answer choices present, equations as LaTeX):

<li>
9) The correct scientific notation for the number 0.00050210 is:
<br/>
A) <math>5.021 \times 10^4</math>
B) <math>5.021 \times 10^{-4}</math>
C) <math>5.0210 \times 10^4</math>
D) <math>5.0210 \times 10^{-4}</math>
E) none of the above
</li>

The structure is perfect. The content is all there. But the equations don’t render.

MathPix (Full-PDF conversion)

What it does well:

  • Renders equations as MathJax SVGs that display correctly in browsers
  • Includes MathML and LaTeX metadata for accessibility
  • Handles complex notation: fractions, superscripts, scientific notation

What it gets wrong — five separate problems:

1. Mangled HTML structure. Output is flat <p> tags instead of semantic lists. No proper <h1>/<h2> hierarchy for sections. Everything wrapped in <div id="preview"><div id="preview-content">...</div></div> with high-specificity CSS selectors.

2. Page-number double-numbering. MathPix prepends page numbers to question numbers:

<p class="question">3. 9) The correct scientific notation...</p>
<!-- "3." = page 3, "9)" = question 9 — should just be "9)" -->
<p class="question">4.10) The wavelength of blue light...</p>
<!-- "4." = page 4, "10)" = question 10 -->

3. Lost equation-content answer choices. When answer choices contain equations, MathPix frequently drops them entirely:

<!-- Question 9: should have 5 answer choices (A-E), only 1 survived -->
<p class="question">3. 9) The correct scientific notation for the number 0.00050210 is:</p>
<ol type="A" class="answer-choices"><li>none of the above</li></ol>
<!-- Choices A-D contained equations like "5.021 × 10⁴" — all gone -->

4. Residual text fragments. When equations are partially stripped, orphaned units and choice labels remain:

<!-- Question 10: each answer was "4.5 × 10⁻⁷ m" etc., but equations dropped -->
<p class="question">4.10) The wavelength of blue light is 0.00000045 m.
Express this wavelength in scientific notation. m B) m C) m D) m E) m</p>
<!-- All that survived from each answer is the trailing "m" unit -->

5. Massive SVG payload. Each equation is a full inline SVG with path data. The chemistry quiz has ~60 mjx-container elements, each containing dozens of SVG child elements (<g>, <path>, <rect>, <use>). This bloats the HTML and distorts our semantic quality metrics.

MathPix question 8 (text-only answers — works fine):

<p class="question">2. 8) If a 20.0 mL test tube measures 15.0 cm, what is the length in meters?</p>
<ol type="A" class="answer-choices">
<li>1500 m</li><li>0.150 m</li><li>1.50 m</li><li>15.0 m</li><li>none of the above</li>
</ol>

This is correct — because none of the answer choices contain equations.


The Hybrid Approach (and Why It Fails)

Our hybrid pipeline was designed to combine the best of both:

  1. Marker converts the PDF → clean HTML + extracted images
  2. Equation Renderer sends each extracted image to MathPix’s processImage() endpoint
  3. MathPix returns MathML for equation images → swap <img> tags for <math> elements
  4. Non-equation images stay as <img> with data URIs

The fundamental problem: Marker doesn’t output equations as <img> tags. It outputs them as raw LaTeX in <math> tags. The equation renderer scans for <img> elements to send to MathPix, but finds almost none (only 1 real photo of graduated cylinders vs. 22 equations). The equation images simply don’t exist in Marker’s output for MathPix to process.

The hybrid output ends up being essentially the same as Marker-only, except the <math> tags with raw LaTeX get stripped during downstream HTML processing, leaving empty gaps where equations should be — the worst of both worlds.


Side-by-Side: Question 9

MarkerMathPixHybrid (current)
Structure<li> in <ol><p class="question"><p> (lost <li>)
Numbering9)3. 9)3. 9)
Choice A<math>5.021 \times 10^4</math> (raw LaTeX)missingmissing
Choice B<math>5.021 \times 10^{-4}</math> (raw LaTeX)missingmissing
Choice C<math>5.0210 \times 10^4</math> (raw LaTeX)missingmissing
Choice D<math>5.0210 \times 10^{-4}</math> (raw LaTeX)missingmissing
Choice Enone of the abovenone of the abovenone of the above
Equations render?No (raw LaTeX)N/A (content lost)N/A (content lost)

Possible Paths Forward

1. Post-process Marker’s raw LaTeX <math> tags

Marker already has the equation content — it’s just in the wrong format. Instead of sending images to MathPix, we could:

  • Client-side rendering: Add MathJax or KaTeX to the output HTML. Both libraries can render LaTeX strings. MathJax specifically looks for <math> tags or \( \) delimiters and renders them as accessible SVG+MathML. This is essentially what MathPix does server-side, but we’d do it client-side from Marker’s LaTeX source.
  • Server-side LaTeX → MathML conversion: Use a library like temml, mathjax-node, or MathPix’s processText() API to convert each LaTeX string to MathML before serving the HTML. This produces static, accessible <math> elements with proper MathML that browsers render natively.
  • Hybrid text approach: Send Marker’s extracted LaTeX strings to MathPix’s text OCR endpoint (not image OCR) to get back MathML.

Pros: Uses Marker’s excellent structure + complete content. No content loss. No double-numbering. Cons: Requires a LaTeX→MathML conversion step. Client-side rendering adds JS payload. Server-side conversion needs a math rendering library compatible with Cloudflare Workers.

2. Use MathPix for equations only, merge into Marker’s DOM

Instead of replacing <img> tags, we could:

  • Parse Marker’s HTML to find <math> tags containing raw LaTeX
  • Send each LaTeX string to MathPix’s API to get rendered MathML back
  • Replace the raw-LaTeX <math> tags with proper MathML <math> tags

This is similar to option 1 but uses MathPix as the rendering engine for individual expressions rather than a general-purpose library.

Pros: MathPix’s rendering quality. Marker’s structure. Per-expression cost ($0.002/equation). Cons: API cost scales with equation count. Network latency for each equation.

3. Inject KaTeX/MathJax as a <script> in the output HTML

The simplest approach: include a KaTeX or MathJax <script> tag in the output HTML and let it render Marker’s LaTeX on the client.

For KaTeX (smaller, faster):

<link rel="stylesheet" href="https://cdn.jsdelivr.net/npm/katex/dist/katex.min.css">
<script src="https://cdn.jsdelivr.net/npm/katex/dist/katex.min.js"></script>
<script>
document.querySelectorAll('math').forEach(el => {
katex.render(el.textContent, el, { throwOnError: false, output: 'mathml' });
});
</script>

Pros: Zero server-side cost. Marker already has the LaTeX. Simple to implement. Cons: Requires JavaScript (not accessible in all contexts). External CDN dependency. Output isn’t a self-contained static HTML file.

4. Server-side LaTeX → MathML with temml

Temml is a lightweight LaTeX→MathML converter (~200KB) that outputs pure MathML without JavaScript dependencies. The output is static HTML that browsers render natively.

import temml from 'temml';
html = html.replace(/<math>([\s\S]*?)<\/math>/g, (_, latex) => {
return temml.renderToString(latex.trim(), { displayMode: false });
});

Pros: Static output, no client JS needed. Small library. Produces accessible MathML. Works in Cloudflare Workers (pure JS). Cons: May not support all LaTeX commands. Less battle-tested than MathJax/KaTeX.

5. Full merge: Marker structure + MathPix equation HTML

Run both converters on the full PDF, then merge:

  • Use Marker’s HTML as the structural skeleton (headings, lists, tables)
  • For each <math> tag in Marker’s output, find the corresponding equation in MathPix’s output and transplant the rendered MathJax SVG

This is complex because it requires aligning content between two different HTML trees, but it would combine Marker’s structure with MathPix’s rendering without any additional API calls or libraries.

Pros: Best rendering quality (MathPix SVGs). Best structure (Marker HTML). No new dependencies. Cons: Complex DOM alignment. Double the API cost (both converters on full PDF). Fragile if the two outputs diverge.


Recommendation

Option 1 or 4 (server-side LaTeX→MathML from Marker’s output) is the most promising path. Marker already captures the equation content as LaTeX — the data is there, it just needs to be converted to a browser-renderable format. A server-side conversion step (temml, mathjax-node, or MathPix text API) would:

  • Preserve Marker’s excellent HTML structure
  • Preserve all equation content (no losses)
  • Produce static, accessible MathML
  • Eliminate the double-numbering problem entirely
  • Eliminate the need for client-side JavaScript
  • Keep the output as a self-contained HTML file