Segment CLI
A segmented PDF-to-HTML pipeline that converts PDFs to accessible HTML using a decompose β specialize β recompose approach.
Overview
Unlike monolithic PDF conversion tools, Segment CLI breaks down each PDF page into distinct content regions and processes each with a specialized handler. This produces higher-quality accessible output, especially for documents containing mixed content like equations, diagrams, tables, and handwritten notes.
Architecture
βββββββββββββββ βββββββββββββββββββ βββββββββββββββββββββββ PDF ββββββΆβ Page Images ββββββΆβ Region Detection ββ Input β β (300 DPI) β β (Claude Vision) ββββββββββββββββ βββββββββββββββββββ ββββββββββββββββββββββ β βββββββββββββββββββββββββββββββββββΌββββββββββββββββββββββββββββββββββ β β β βΌ βΌ βΌ ββββββββββββββββββ ββββββββββββββββββ ββββββββββββββββββ β Text Region β β Equation Regionβ β Diagram Region β β (Claude OCR) β β (MathPix) β β (Claude + Copy)β ββββββββββββββββββ ββββββββββββββββββ ββββββββββββββββββ β β β βββββββββββββββββββββββββββββββββββΌββββββββββββββββββββββββββββββββββ β βΌ ββββββββββββββββββββββ β HTML Assembler β β (Reading Order) β ββββββββββββββββββββββ β βΌ ββββββββββββββββββββββ β Accessible HTML β β + Images + Report β ββββββββββββββββββββββPipeline Stages
1. PDF to Images
Converts each PDF page to a high-resolution PNG (default 300 DPI) using Puppeteerβs PDF rendering. This ensures consistent image quality for downstream vision models.
2. Region Detection
Uses Claude Vision (claude-sonnet-4-20250514) to analyze each page image and identify distinct content regions:
| Region Type | Description |
|---|---|
printed-text | Regular printed text (paragraphs, headings, lists) |
diagram | Graphics, figures, charts, drawings, photos |
equation | Mathematical equations or formulas |
handwritten | Handwritten notes or annotations |
table | Tabular data |
For each region, the detector outputs:
- Bounding box (normalized 0-1 coordinates)
- Confidence score
- Reading order (logical sequence for accessibility)
- Description (brief content summary)
3. Specialized Processing
Each region type is routed to a specialized processor:
Text Processor
- Uses Claude Vision for OCR
- Extracts semantic HTML (headings, paragraphs, lists)
- Preserves document structure
Equation Processor
- Uses MathPix API for equation recognition
- Outputs native MathML for accessibility
- Includes LaTeX source as fallback
Diagram Processor
- Uses Claude Vision to generate alt text
- Copies original image to output
- Creates
<figure>with semantic caption
Table Processor
- Uses Claude Vision for table extraction
- Outputs semantic HTML tables with
<th>,<td>,<caption> - Preserves header structure for screen readers
Handwritten Processor
- Uses MathPix API (handles handwritten math well)
- Falls back to Claude Vision for pure text
- Styled distinctly in output
4. HTML Assembly
Combines all processed regions into a single accessible HTML document:
- Sorts by reading order
- Inserts page breaks for multi-page documents
- Applies responsive, accessible CSS
- Supports dark mode
Installation
cd tools/segment-clinpm installConfiguration
Set required API keys in .env (project root or tools/segment-cli/):
ANTHROPIC_API_KEY=sk-ant-...MATHPIX_APP_ID=your_app_idMATHPIX_APP_KEY=your_app_keyNote: MathPix credentials are optional but strongly recommended for equation/handwritten content.
Usage
# Basic usagenpm start -- -i input.pdf -o ./output
# With verbose loggingnpm start -- -i input.pdf -o ./output -v
# Custom DPI (default 300)npm start -- -i input.pdf -o ./output --dpi 150Output Structure
output/βββ result.html # Final accessible HTMLβββ images/ # Extracted diagrams (with alt text)β βββ page1_region2.pngβ βββ page2_region1.pngβββ report.json # Processing report with timing and statsExample Report
{ "inputFile": "/path/to/input.pdf", "outputDir": "/path/to/output", "totalDurationMs": 12500, "summary": { "totalPages": 3, "totalRegions": 12, "regionsByType": { "printed-text": 6, "diagram": 2, "equation": 3, "handwritten": 1, "table": 0 } }}Accessibility Features
The output HTML includes:
- Semantic markup (headings, paragraphs, figures, tables)
- Native MathML for equations (screen reader compatible)
- Alt text for all images
- Reading order preserved via DOM structure
- Responsive design for various devices
- Dark mode support via
prefers-color-scheme
Cost Considerations
| Component | Cost per Page (approx.) |
|---|---|
| Claude Vision (region detection) | ~$0.01-0.02 |
| Claude Vision (text/diagram processing) | ~$0.01-0.03 per region |
| MathPix API (equations) | ~$0.01 per equation |
A typical document page with 3-5 regions costs approximately $0.03-0.10 to process.
Limitations
- No support for scanned/rotated pages (assumes clean, upright PDFs)
- Handwritten text quality depends on legibility
- Very complex layouts may confuse region detection
- No OCR confidence scores exposed (future enhancement)
Benchmarks
The benchmark-output/ directory contains sample outputs from test PDFs:
ls benchmark-output/# sample1/ sample2/ sample3/ sample4/ sample5/Each benchmark folder contains result.html, images/, and report.json.