Word Document (.docx) Support
Overview
The converter accepts .docx files (Word 2007+, Google Docs export, LibreOffice export) and converts them to accessible HTML using mammoth.js. The converted HTML then flows through the same accessibility pipeline as PDF conversions: WCAG validation, UX optimization, axe-core auto-fixes, and R2 storage.
Unlike PDF conversion, DOCX conversion requires no external API keys — mammoth runs entirely locally. This makes it the cheapest conversion path at $0/document.
How It Works
Conversion Pipeline
.docx upload -> mammoth.js (DOCX -> semantic HTML + image extraction) -> structurePages (page header/footer detection) -> optimizeDeterministic (CSS injection, table headers, SVG labels, LaTeX->MathML) -> enhanceAccessibility (DOCTYPE, lang, title, skip-link, landmarks) -> validateAndFix (WCAG 2.1 AA validation + auto-fix loop) -> R2 storageWhy mammoth.js
Mammoth converts based on semantic meaning, not visual formatting. It reads the document’s style metadata (Heading 1, Quote, List, etc.) and maps them to proper HTML elements (<h1>, <blockquote>, <ul>, etc.). This produces much cleaner HTML than a PDF-to-HTML conversion of the same document, because DOCX files retain the original document structure that PDF flattens away.
Image Handling
Embedded images are extracted from the DOCX file and routed through the standard pipeline:
- Mammoth extracts each image and assigns it a filename (
image-1.png,image-2.jpeg, etc.) - Images are returned as
ConvertedImage[]alongside the HTML - If
enhanceImagesis enabled and an Anthropic API key is set, Claude Vision generates alt text - Images are stored in R2 and embedded as base64 data URIs in the final HTML
Custom Style Mappings
Mammoth ships with default mappings for headings (1-6), lists (ordered/unordered, 5 levels deep), bold (<strong>), italic (<em>), strikethrough (<s>), and basic paragraph styles.
We add custom mappings in docx-converter.ts to cover common Word styles that have clear semantic HTML equivalents but aren’t in mammoth’s defaults:
| Word Style | HTML Output | Why |
|---|---|---|
| Quote | <blockquote><p> | Semantic quotation markup |
| Intense Quote | <blockquote><p> | Same — Word has two quote styles |
| Block Text | <blockquote><p> | Another common quote variant |
| Caption | <figcaption> | Associates captions with figures/tables |
| Title | <h1> | Document title should be the primary heading |
| Subtitle | <p class="doc-subtitle"> | Visually distinct but not a heading |
| Code / Code Block / HTML Code | <code> or <pre> | Preserves code semantics for screen readers |
| TOC Heading | <h2 class="toc-heading"> | Heading for table of contents section |
| toc 1 / toc 2 / toc 3 | <p class="toc-entry toc-N"> | Structured TOC entries |
| Emphasis | <em> | Character-level emphasis |
Adding More Custom Mappings
To add a mapping, edit the ACCESSIBILITY_STYLE_MAP array in workers/api/src/services/docx-converter.ts. The syntax is:
"selector => html-element:modifier"Selectors:
p[style-name='...']— paragraph style (by display name)r[style-name='...']— run/character style (inline)p.StyleId— paragraph style (by internal ID)b/i/u/strike— formattingp:ordered-list(N)/p:unordered-list(N)— list at nesting level N
HTML targets:
- Any HTML element:
h1,p,blockquote,pre,code,em,strong, etc. - With classes:
p.my-class - Nested:
blockquote > p:fresh :freshmodifier — always creates a new parent element (prevents merging siblings)!— ignore/suppress the content entirely
Example: To map a custom Word style called “Legal Note” to an <aside>:
"p[style-name='Legal Note'] => aside > p:fresh"Per-conversion custom mappings can also be passed via the DocxConverterConfig.styleMap option, which takes precedence over the built-in accessibility mappings.
Usage
Upload & Convert
Upload and conversion work identically to PDF files — the frontend auto-detects the file type from the MIME type. The converter options (parser selection) are ignored for DOCX files since mammoth is the only conversion backend.
API
POST /api/files/upload { fileName: "report.docx", fileType: "application/vnd.openxmlformats-officedocument.wordprocessingml.document", fileSize: 12345 }
PUT /api/files/:fileId/upload-data [binary .docx data]
POST /api/convert/:fileId {} (parser options are ignored for DOCX)Limitations
Format Support
- Only
.docx(Office Open XML) is supported — not.doc(legacy binary format),.odt, or.rtf. - Documents must be well-formed OOXML. Corrupted or partially-written files will fail.
Structural Fidelity
Mammoth converts based on semantic styles, not visual layout. This means:
- Direct formatting without styles is partially preserved — bold, italic, and strikethrough applied directly (not via a named style) are converted. Underline is intentionally ignored because underlined text is easily confused with hyperlinks, which is an accessibility problem.
- Visual-only formatting is lost — font sizes, colors, margins, custom spacing, and decorative borders are ignored. This is by design: the output relies on our UX optimizer CSS for consistent, accessible styling.
- Headers and footers are not included — mammoth does not extract running headers/footers from the DOCX document sections.
- Complex page layouts (multi-column, text wrapping around images, absolute positioning) are flattened to linear flow. DOCX documents with complex visual layouts will produce correct content but in a single-column reading order.
- Table formatting (borders, cell colors, column widths) is discarded. The table structure (rows, cells) and text content are preserved. Our UX optimizer CSS applies consistent table styling.
- Embedded objects (charts, SmartArt, ActiveX controls, embedded Excel sheets) are not converted. Only raster images (PNG, JPEG, GIF, TIFF) and EMF/WMF images are extracted.
- Equations — Word’s native equation editor (OMML) is not supported by mammoth. If the document contains MathType or OMML equations, they will appear as images (if embedded) or be missing. For math-heavy documents, PDF conversion via Mathpix or the hybrid pipeline remains the better choice.
- Comments and tracked changes — Comments are ignored by default. Tracked changes (revision marks) are accepted as-is; the “final” version of the text is what gets converted.
- Form fields (checkboxes, dropdowns, text inputs) are not converted to HTML form elements.
Security
Mammoth does not sanitize the source document. Our downstream pipeline (WCAG validator, UX optimizer) handles HTML cleanup, but be aware that a maliciously crafted DOCX could inject HTML through style names or text content. The existing enhanceAccessibility and validateAndFix pipeline mitigates most risks, but this is worth noting for defense-in-depth.
When to Use PDF Conversion Instead
DOCX conversion works best for text-heavy documents with proper style usage (headings, lists, tables). Prefer PDF conversion for:
- Scanned documents (need OCR)
- Math-heavy papers (need MathML via Mathpix)
- Documents where visual layout fidelity matters
- Documents originally authored as PDFs (forms, brochures, slide decks exported to PDF)