PDF Accessibility Evaluator β Implementation Plan
Goal
Extend the existing pdf-accessibility-scorer.ts into a comprehensive PDF accessibility evaluator that validates not just the presence of accessibility features but their correctness. This is a prerequisite for the βexport as accessible PDFβ feature β we need to measure quality before we ship.
Current State
workers/api/src/services/pdf-accessibility-scorer.ts performs 6 structural checks via raw byte inspection:
| Check | Deduction | What it measures |
|---|---|---|
| Extractable text | -40 | Is there text at all, or image-only? |
| Tag structure | -20 | Do /MarkInfo + /StructTreeRoot exist? |
| Image alt text | -15 each (max -30) | Do /S /Figure elements have /Alt? |
| Document language | -10 | Does /Lang exist? |
| Document title | -5 | Does /Title exist? |
| Table headers | -10 | Do /S /Table have /S /TH? |
Limitation: These are presence checks only. A PDF could have tags that are completely wrong (headings out of order, tables with no scope, reading order jumbled) and still score 100.
Design Principles
- No AI, no rendering β keep it fast and free (byte-level inspection only)
- Deduction-based scoring β same pattern as existing scorer
- Backward compatible β extend
AccessibilityScoretype, donβt break callers - Two tiers β quick mode (existing 6 checks, for freemium) and full mode (all checks, for generated PDFs)
- Actionable results β each check explains whatβs wrong and how to fix it
New Checks
Tier 1: Structure Correctness (add to existing scorer)
1.1 Heading Hierarchy (-10)
Validate that heading tags follow a logical order without skips.
Scan for /S /H1, /S /H2, ... /S /H6 in the structure tree.Build the sequence. Fail if: - No H1 exists (document has no top-level heading) - Levels are skipped (H1 β H3 with no H2) - Multiple H1s (unless document has clear sections)Byte pattern: Search for /S /H1 through /S /H6, record order of appearance.
Deduction: -10 if hierarchy is invalid, -5 if minor skip (e.g., H2 β H4).
1.2 Reading Order Validation (-15)
Check that the structure tree tag order is plausible. In a tagged PDF, the order of structure elements determines screen reader reading order.
Extract the sequence of /S /xxx tags from the structure tree.Fail if: - Document has tags but they appear in a clearly wrong order (e.g., all tags appear in reverse or random page order) - Content is tagged but paragraphs are interleaved from different columns (multi-column layout issue)Approach: Compare the order of marked-content IDs (MCIDs) against their page positions. If MCIDs on page 1 reference content that appears after page 2 content, reading order is suspect.
Deduction: -15 if reading order appears scrambled, -5 if minor issues.
Note: This is the hardest check to do via byte inspection alone. A simpler v1 heuristic: verify that structure tree elements reference MCIDs in ascending page order (page 1 MCIDs before page 2 MCIDs, etc.).
1.3 Table Header Scope (-10)
Check that table header cells have proper scope attributes.
For each /S /TH element in the structure tree: - Look for /A dictionary with /Scope attribute - Valid values: /Column, /Row, /BothByte pattern: Find /S /TH, look for /Scope within the attribute dictionary.
Deduction: -10 if tables have TH without scope, -5 if some TH have scope but not all.
1.4 List Structure (-5)
Validate that lists use proper L/LI/Lbl/LBody structure.
Search for /S /L (list), /S /LI (list item), /S /Lbl (label/bullet), /S /LBody (list body).If content appears to be a list (bullet characters in text) but no /S /L tags exist, deduct.If /S /L exists, verify it contains /S /LI children.Byte pattern: Count /S /L , /S /LI, check for bullet characters (β’, β, numbered patterns) in text without corresponding list tags.
Deduction: -5 if lists exist but arenβt tagged.
1.5 Link Annotations (-5)
Check that hyperlinks are tagged as /Link with meaningful content.
Count /Subtype /Link annotations.Check if corresponding /S /Link structure elements exist.Byte pattern: Count /Subtype /Link vs /S /Link.
Deduction: -5 if link annotations exist but no /S /Link structure tags.
Tier 2: PDF/UA Compliance (for generated PDFs)
2.1 PDF/UA Identifier (-10)
Check for PDF/UA conformance declaration in XMP metadata.
Search for pdfuaid:part in the metadata stream.PDF/UA-1 requires: <pdfuaid:part>1</pdfuaid:part>Byte pattern: Search for pdfuaid:part in the document.
Deduction: -10 if missing (required for PDF/UA compliance).
2.2 Tab Order (-5)
Each page should specify /Tabs /S (structure order) so keyboard tab order follows the tag tree.
For each /Type /Page dictionary, check for /Tabs /S.Byte pattern: Within page dictionaries, look for /Tabs /S.
Deduction: -5 if any page lacks /Tabs /S.
2.3 Bookmarks / Document Outline (-5)
Check that the document has bookmarks derived from headings.
Look for /Type /Outlines in the catalog.If the document has headings (H1-H6 tags) but no outlines, deduct points.Byte pattern: Search for /Type /Outlines or /Outlines in catalog dictionary.
Deduction: -5 if headings exist but no bookmarks.
2.4 Artifact Marking (-5)
Decorative elements (headers, footers, page numbers, watermarks) should be marked as artifacts, not tagged as content.
Look for /Type /Pagination or BMC/BDC artifact operators in content streams.A document with many pages but no artifact markers likely has repeated header/footer content polluting the tag tree.Byte pattern: Search for /Artifact in content stream BDC operators.
Deduction: -5 if multi-page document has no artifact markers.
2.5 Display Title Flag (-3)
The catalog should specify /ViewerPreferences << /DisplayDocTitle true >> so the title bar shows the document title instead of the filename.
Look for /DisplayDocTitle true in /ViewerPreferences.Byte pattern: Search for /DisplayDocTitle followed by true.
Deduction: -3 if missing.
Scoring Summary
Existing checks (Tier 0): max -115
| Check | Max deduction |
|---|---|
| Extractable text | -40 |
| Tag structure | -20 |
| Image alt text | -30 |
| Document language | -10 |
| Document title | -5 |
| Table headers | -10 |
New Tier 1 checks: max -45
| Check | Max deduction |
|---|---|
| Heading hierarchy | -10 |
| Reading order | -15 |
| Table header scope | -10 |
| List structure | -5 |
| Link annotations | -5 |
New Tier 2 checks (PDF/UA): max -28
| Check | Max deduction |
|---|---|
| PDF/UA identifier | -10 |
| Tab order | -5 |
| Bookmarks | -5 |
| Artifact marking | -5 |
| Display title flag | -3 |
Scoring modes
type ScoringMode = 'quick' | 'full';
// quick: Tier 0 only (existing 6 checks) β fast, for freemium intake scoring// full: Tier 0 + 1 + 2 (all checks) β for evaluating generated PDFsScore remains max(0, 100 - deductions). With more checks, the deductions are more granular but the max is still 100 because a truly inaccessible PDF hits the cap quickly (no text + no tags = -60 already).
Implementation
File changes
| File | Change |
|---|---|
workers/api/src/services/pdf-accessibility-scorer.ts | Add new check functions, add mode parameter |
@accessible-pdf/shared types | Add new check IDs to AccessibilityCheckResult type |
Function signature change
export function scorePdfAccessibility( pdfBytes: Uint8Array, options?: ScorerOptions & { mode?: 'quick' | 'full' },): AccessibilityScore;mode: 'quick'β existing 6 checks (default, backward compatible)mode: 'full'β all checks including heading hierarchy, reading order, PDF/UA
New helper functions
// Tier 1function validateHeadingHierarchy(pdfStr: string): CheckResult;function validateReadingOrder(pdfStr: string): CheckResult;function validateTableHeaderScope(pdfStr: string): CheckResult;function validateListStructure(pdfStr: string): CheckResult;function validateLinkAnnotations(pdfStr: string): CheckResult;
// Tier 2function checkPdfUaIdentifier(pdfStr: string): CheckResult;function checkTabOrder(pdfStr: string): CheckResult;function checkBookmarks(pdfStr: string): CheckResult;function checkArtifactMarking(pdfStr: string): CheckResult;function checkDisplayTitle(pdfStr: string): CheckResult;Each returns the same { passed, weight, deduction, detail } shape.
Usage for PDF Export Validation
Once built, the evaluator integrates into the PDF export pipeline:
1. Generate accessible HTML (existing pipeline)2. Render HTML β tagged PDF via Puppeteer (generateAccessiblePdfFromHtml)3. Run scorePdfAccessibility(pdfBytes, { mode: 'full' })4. If score < 80: - Log deficiencies - Apply post-processing fixes (pdf-lib) for fixable issues: - Set /Lang if missing - Set /Title if missing - Add /DisplayDocTitle - Set /Tabs /S on pages - Add PDF/UA identifier to XMP metadata - Re-score after fixes5. Return PDF with score metadataThis creates a feedback loop: generate, measure, fix, verify. The scorer tells us exactly where Chromeβs tagged PDF output falls short so we can target post-processing.
Testing
Unit tests
- Test each check function with crafted PDF byte patterns
- Test scoring with known-accessible PDFs (from PAC-verified sources)
- Test scoring with known-inaccessible PDFs (untagged, no lang, etc.)
- Verify backward compatibility:
mode: 'quick'produces same results as current scorer
Integration tests
- Generate a PDF via
generateAccessiblePdfFromHtml()with known accessible HTML - Score it with
mode: 'full' - Verify the score and identify any consistent gaps from Chromeβs output
- Document which checks Chrome passes/fails so post-processing can target the gaps
Test fixtures
Create a set of PDF test files:
tagged-accessible.pdfβ fully tagged, all checks passuntagged.pdfβ no structure treescanned.pdfβ image-onlybad-headings.pdfβ H1 β H3 skipno-table-scope.pdfβ TH without scope attributeschrome-generated.pdfβ output fromgenerateAccessiblePdfFromHtml()to benchmark Chromeβs baseline
Effort Estimate
| Phase | Effort |
|---|---|
| Tier 1 checks (heading, reading order, table scope, lists, links) | 2-3 days |
| Tier 2 checks (PDF/UA, tab order, bookmarks, artifacts, display title) | 1-2 days |
| Tests + fixtures | 1-2 days |
| Integration with PDF export pipeline | 1 day |
| Total | 5-8 days |
Sequencing
- Build the enhanced scorer (this plan)
- Run it against Chrome
tagged: trueoutput to identify gaps - Build targeted post-processing fixes for those gaps (pdf-lib)
- Wire up the PDF export endpoint with score validation
- Ship the βExport as Accessible PDFβ feature