Skip to content

Segment CLI

A segmented PDF-to-HTML pipeline that converts PDFs to accessible HTML using a decompose β†’ specialize β†’ recompose approach.

Overview

Unlike monolithic PDF conversion tools, Segment CLI breaks down each PDF page into distinct content regions and processes each with a specialized handler. This produces higher-quality accessible output, especially for documents containing mixed content like equations, diagrams, tables, and handwritten notes.

Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ PDF │────▢│ Page Images │────▢│ Region Detection β”‚
β”‚ Input β”‚ β”‚ (300 DPI) β”‚ β”‚ (Claude Vision) β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ β”‚ β”‚
β–Ό β–Ό β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Text Region β”‚ β”‚ Equation Regionβ”‚ β”‚ Diagram Region β”‚
β”‚ (Claude OCR) β”‚ β”‚ (MathPix) β”‚ β”‚ (Claude + Copy)β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚ β”‚ β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚
β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ HTML Assembler β”‚
β”‚ (Reading Order) β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚
β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Accessible HTML β”‚
β”‚ + Images + Report β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Pipeline Stages

1. PDF to Images

Converts each PDF page to a high-resolution PNG (default 300 DPI) using Puppeteer’s PDF rendering. This ensures consistent image quality for downstream vision models.

2. Region Detection

Uses Claude Vision (claude-sonnet-4-20250514) to analyze each page image and identify distinct content regions:

Region TypeDescription
printed-textRegular printed text (paragraphs, headings, lists)
diagramGraphics, figures, charts, drawings, photos
equationMathematical equations or formulas
handwrittenHandwritten notes or annotations
tableTabular data

For each region, the detector outputs:

  • Bounding box (normalized 0-1 coordinates)
  • Confidence score
  • Reading order (logical sequence for accessibility)
  • Description (brief content summary)

3. Specialized Processing

Each region type is routed to a specialized processor:

Text Processor

  • Uses Claude Vision for OCR
  • Extracts semantic HTML (headings, paragraphs, lists)
  • Preserves document structure

Equation Processor

  • Uses MathPix API for equation recognition
  • Outputs native MathML for accessibility
  • Includes LaTeX source as fallback

Diagram Processor

  • Uses Claude Vision to generate alt text
  • Copies original image to output
  • Creates <figure> with semantic caption

Table Processor

  • Uses Claude Vision for table extraction
  • Outputs semantic HTML tables with <th>, <td>, <caption>
  • Preserves header structure for screen readers

Handwritten Processor

  • Uses MathPix API (handles handwritten math well)
  • Falls back to Claude Vision for pure text
  • Styled distinctly in output

4. HTML Assembly

Combines all processed regions into a single accessible HTML document:

  • Sorts by reading order
  • Inserts page breaks for multi-page documents
  • Applies responsive, accessible CSS
  • Supports dark mode

Installation

Terminal window
cd tools/segment-cli
npm install

Configuration

Set required API keys in .env (project root or tools/segment-cli/):

ANTHROPIC_API_KEY=sk-ant-...
MATHPIX_APP_ID=your_app_id
MATHPIX_APP_KEY=your_app_key

Note: MathPix credentials are optional but strongly recommended for equation/handwritten content.

Usage

Terminal window
# Basic usage
npm start -- -i input.pdf -o ./output
# With verbose logging
npm start -- -i input.pdf -o ./output -v
# Custom DPI (default 300)
npm start -- -i input.pdf -o ./output --dpi 150

Output Structure

output/
β”œβ”€β”€ result.html # Final accessible HTML
β”œβ”€β”€ images/ # Extracted diagrams (with alt text)
β”‚ β”œβ”€β”€ page1_region2.png
β”‚ └── page2_region1.png
└── report.json # Processing report with timing and stats

Example Report

{
"inputFile": "/path/to/input.pdf",
"outputDir": "/path/to/output",
"totalDurationMs": 12500,
"summary": {
"totalPages": 3,
"totalRegions": 12,
"regionsByType": {
"printed-text": 6,
"diagram": 2,
"equation": 3,
"handwritten": 1,
"table": 0
}
}
}

Accessibility Features

The output HTML includes:

  • Semantic markup (headings, paragraphs, figures, tables)
  • Native MathML for equations (screen reader compatible)
  • Alt text for all images
  • Reading order preserved via DOM structure
  • Responsive design for various devices
  • Dark mode support via prefers-color-scheme

Cost Considerations

ComponentCost per Page (approx.)
Claude Vision (region detection)~$0.01-0.02
Claude Vision (text/diagram processing)~$0.01-0.03 per region
MathPix API (equations)~$0.01 per equation

A typical document page with 3-5 regions costs approximately $0.03-0.10 to process.

Limitations

  • No support for scanned/rotated pages (assumes clean, upright PDFs)
  • Handwritten text quality depends on legibility
  • Very complex layouts may confuse region detection
  • No OCR confidence scores exposed (future enhancement)

Benchmarks

The benchmark-output/ directory contains sample outputs from test PDFs:

Terminal window
ls benchmark-output/
# sample1/ sample2/ sample3/ sample4/ sample5/

Each benchmark folder contains result.html, images/, and report.json.