Segment CLI

A segmented PDF-to-HTML pipeline that converts PDFs to accessible HTML using a decompose → specialize → recompose approach.

Overview

Unlike monolithic PDF conversion tools, Segment CLI breaks down each PDF page into distinct content regions and processes each with a specialized handler. This produces higher-quality accessible output, especially for documents containing mixed content like equations, diagrams, tables, and handwritten notes.

Architecture

┌─────────────┐     ┌─────────────────┐     ┌────────────────────┐
│   PDF       │────▶│  Page Images    │────▶│  Region Detection  │
│   Input     │     │  (300 DPI)      │     │  (Claude Vision)   │
└─────────────┘     └─────────────────┘     └────────────────────┘
                                                      │
                    ┌─────────────────────────────────┼─────────────────────────────────┐
                    │                                 │                                 │
                    ▼                                 ▼                                 ▼
           ┌────────────────┐               ┌────────────────┐               ┌────────────────┐
           │  Text Region   │               │ Equation Region│               │ Diagram Region │
           │  (Claude OCR)  │               │   (MathPix)    │               │ (Claude + Copy)│
           └────────────────┘               └────────────────┘               └────────────────┘
                    │                                 │                                 │
                    └─────────────────────────────────┼─────────────────────────────────┘
                                                      │
                                                      ▼
                                            ┌────────────────────┐
                                            │   HTML Assembler   │
                                            │ (Reading Order)    │
                                            └────────────────────┘
                                                      │
                                                      ▼
                                            ┌────────────────────┐
                                            │  Accessible HTML   │
                                            │  + Images + Report │
                                            └────────────────────┘

Pipeline Stages

1. PDF to Images

Converts each PDF page to a high-resolution PNG (default 300 DPI) using Puppeteer’s PDF rendering. This ensures consistent image quality for downstream vision models.

2. Region Detection

Uses Claude Vision (claude-sonnet-4-20250514) to analyze each page image and identify distinct content regions:

Region Type	Description
`printed-text`	Regular printed text (paragraphs, headings, lists)
`diagram`	Graphics, figures, charts, drawings, photos
`equation`	Mathematical equations or formulas
`handwritten`	Handwritten notes or annotations
`table`	Tabular data

For each region, the detector outputs:

Bounding box (normalized 0-1 coordinates)
Confidence score
Reading order (logical sequence for accessibility)
Description (brief content summary)

3. Specialized Processing

Each region type is routed to a specialized processor:

Text Processor

Uses Claude Vision for OCR
Extracts semantic HTML (headings, paragraphs, lists)
Preserves document structure

Equation Processor

Uses MathPix API for equation recognition
Outputs native MathML for accessibility
Includes LaTeX source as fallback

Diagram Processor

Uses Claude Vision to generate alt text
Copies original image to output
Creates <figure> with semantic caption

Table Processor

Uses Claude Vision for table extraction
Outputs semantic HTML tables with <th>, <td>, <caption>
Preserves header structure for screen readers

Handwritten Processor

Uses MathPix API (handles handwritten math well)
Falls back to Claude Vision for pure text
Styled distinctly in output

4. HTML Assembly

Combines all processed regions into a single accessible HTML document:

Sorts by reading order
Inserts page breaks for multi-page documents
Applies responsive, accessible CSS
Supports dark mode

Installation

cd tools/segment-cli
npm install

Configuration

Set required API keys in .env (project root or tools/segment-cli/):

ANTHROPIC_API_KEY=sk-ant-...
MATHPIX_APP_ID=your_app_id
MATHPIX_APP_KEY=your_app_key

Note: MathPix credentials are optional but strongly recommended for equation/handwritten content.

Usage

# Basic usage
npm start -- -i input.pdf -o ./output

# With verbose logging
npm start -- -i input.pdf -o ./output -v

# Custom DPI (default 300)
npm start -- -i input.pdf -o ./output --dpi 150

Output Structure

output/
├── result.html       # Final accessible HTML
├── images/           # Extracted diagrams (with alt text)
│   ├── page1_region2.png
│   └── page2_region1.png
└── report.json       # Processing report with timing and stats

Example Report

{
  "inputFile": "/path/to/input.pdf",
  "outputDir": "/path/to/output",
  "totalDurationMs": 12500,
  "summary": {
    "totalPages": 3,
    "totalRegions": 12,
    "regionsByType": {
      "printed-text": 6,
      "diagram": 2,
      "equation": 3,
      "handwritten": 1,
      "table": 0
    }
  }
}

Accessibility Features

The output HTML includes:

Semantic markup (headings, paragraphs, figures, tables)
Native MathML for equations (screen reader compatible)
Alt text for all images
Reading order preserved via DOM structure
Responsive design for various devices
Dark mode support via prefers-color-scheme

Cost Considerations

Component	Cost per Page (approx.)
Claude Vision (region detection)	~$0.01-0.02
Claude Vision (text/diagram processing)	~$0.01-0.03 per region
MathPix API (equations)	~$0.01 per equation

A typical document page with 3-5 regions costs approximately $0.03-0.10 to process.

Limitations

No support for scanned/rotated pages (assumes clean, upright PDFs)
Handwritten text quality depends on legibility
Very complex layouts may confuse region detection
No OCR confidence scores exposed (future enhancement)

Benchmarks

The benchmark-output/ directory contains sample outputs from test PDFs:

ls benchmark-output/
# sample1/  sample2/  sample3/  sample4/  sample5/

Each benchmark folder contains result.html, images/, and report.json.