PDF Accessibility Score
This document explains the accessibility score shown in apps/web (e.g. on the preview, pipeline, report, and LTI course-scanner pages) for every PDF the app ingests or produces.
Compliance framing β WCAG 2.1 AA is the baseline, not PDF/UA
The laws in scope (ADA Title II DOJ rule 2026-07663, Section 508 Revised Standards, EU EN 301 549) all adopt WCAG 2.1 Level AA as the normative technical standard for web content and PDFs. PDF/UA-1 (ISO 14289-1) and the Matterhorn Protocol 1.1 are the standard implementation / validation checklists for applying WCAG to PDFs, but they are not themselves what the law requires.
Our scorer reflects this:
- The headline score and each of the 16 structural checks are labeled with the WCAG 2.1 AA success criteria they support (see
WCAG_CRITERIA_MAPinpdf-accessibility-scorer.ts). - The underlying byte-level checks derive from PDF/UA-1 / Matterhorn, which is how those SCs are satisfied at the PDF-format level.
- Passing a structural check is evidence toward β not proof of β SC conformance. We do not stamp a PDF/UA-1 claim just because the structural checks pass; that needs a Matterhorn-complete validator (veraPDF / PAC 2024).
1. What the score is
A single integer 0β100 representing how well the PDF meets a set of structural accessibility checks. It is a structural / PDF-UA-oriented score, not a full WCAG page-content audit.
- Implementation:
workers/api/src/services/pdf-accessibility-scorer.ts(functionscorePdfAccessibility) - Shared types / banding:
packages/shared/src/types.ts(AccessibilityScore,AccessibilityCheckResult,getScoreBand) - No AI, no rendering. The scorer reads raw PDF bytes as Latin-1, counts dictionary patterns (
/StructTreeRoot,/S /Figure,/MCID β¦, etc.) and deducts points per failing check.
Score bands (getScoreBand)
| Range | Band | Meaning |
|---|---|---|
| 0β33 | red | Not usable with assistive tech |
| 34β66 | orange | Partial β major gaps |
| 67β99 | green | Good β minor gaps |
| 100 | dark-green | All structural checks pass |
The 50% scores you typically see sit in the orange band β the PDF has text but is missing most of the PDF/UA structural metadata.
Two scoring modes
scorePdfAccessibility(bytes, { mode }) runs either:
quickβ 6 structural presence checks (Tier 0). Used for freemium intake and LTI course scans (workers/api/src/routes/lti-course.ts:157).fullβ all 16 checks (Tier 0 + Tier 1 correctness + Tier 2 PDF/UA). Used on generated PDFs before export inworkers/api/src/routes/convert.ts:281andworkers/api/src/scheduler/chunk-scheduler.ts:806.
A source PDF scored with quick and the same PDF scored with full will produce different numbers β the full score is stricter because it runs more checks.
2. How it is calculated
score = max(0, 100 β sum(deductions from each failing check))Each check has a weight (the maximum it can deduct) and a deduction (what it actually took off this run). Checks that donβt apply (e.g. no tables, no images) pass for free with deduction 0.
Tier 0 β Structural presence (always run, both modes)
| # | Check ID | Weight | Deduction rule |
|---|---|---|---|
| 1 | extractable_text | 40 | β40 if fewer than 3 Tj/TJ text-showing operators found (scanned/image-only PDF) |
| 2 | tag_structure | 20 | β20 if /MarkInfo or /StructTreeRoot is missing |
| 3 | image_alt_text | 30 | β15 per image missing /Alt inside a /S /Figure region, capped at β30 |
| 4 | document_language | 10 | β10 if no /Lang entry (or value shorter than 2 chars) |
| 5 | document_title | 5 | β5 if /Title missing or empty in the info dictionary |
| 6 | table_headers | 10 | β10 if tables exist (/S /Table) but no /S /TH header cells anywhere |
Max Tier-0 deduction: 115. Because the score is floored at 0, failing just the first two (no text + no tags) already pins the score to β€40. This is why scanned PDFs score so low.
Tier 1 β Structure correctness (full mode only)
| # | Check ID | Weight | WCAG SC | Deduction rule |
|---|---|---|---|---|
| 7 | heading_hierarchy | 10 | 1.3.1, 2.4.6, 2.4.10 | β5 for one issue (no H1, or one skipped level); β10 for two or more |
| 8 | reading_order | 15 | 1.3.2 | β15 if no /StructTreeRoot; β15 if /StructTreeRoot has no /ParentTree; β10 if pages have no /StructParents |
| 9 | table_header_scope | 10 | 1.3.1 | β5 if more than half of /S /TH cells have /Scope; β10 if fewer |
| 10 | list_structure | 5 | 1.3.1 | β5 if list tags (/S /L) exist with no list items (/S /LI) |
| 11 | link_annotations | 5 | 1.3.1, 2.4.4, 4.1.2 | β5 if the number of /Subtype /Link annotations exceeds /S /Link structure tags |
Reading-order change (2026-04-23): this check used to count MCID integer inversions, which false-positived on multi-column and floated-figure layouts (valid PDFs with MCIDs emitted out of visual order but read correctly via the structure tree). It now verifies that the structure tree defines the reading order per ISO 32000-1 Β§14.7: /StructTreeRoot contains a /ParentTree, and pages reference it via /StructParents.
Tier 2 β PDF/UA compliance (full mode only)
| # | Check ID | Weight | Deduction rule |
|---|---|---|---|
| 12 | pdfua_identifier | 10 | β10 if XMP metadata does not contain pdfuaid:part |
| 13 | tab_order | 5 | β5 if any page lacks /Tabs /S (fewer /Tabs /S occurrences than estimated pages) |
| 14 | bookmarks | 5 | β5 if the document has headings but no /Type /Outlines (bookmarks) entry |
| 15 | artifact_marking | 5 | β5 if a multi-page document has zero /Artifact markers (headers/footers/page numbers will leak into reading order) |
| 16 | display_doc_title | 3 | β3 if /DisplayDocTitle true is missing from ViewerPreferences |
Worked example β why a healthy-looking PDF scores ~50
A PDF from Word that has text + headings but no tag tree typically fails:
tag_structure(β20)image_alt_text(β30 if images)document_language(β10)document_title(β5)- In
fullmode, alsopdfua_identifier(β10),tab_order(β5),bookmarks(β5),artifact_marking(β5),display_doc_title(β3),reading_ordercascade failure (β15)
Quick mode: 100 β 65 = 35. Full mode: 100 β 108, floored = 0. An βokay but untaggedβ PDF is the 50-ish case β some of the above fail, most Tier-1 pass.
Known scorer caveats
- Byte-level string matching can miss checks that live inside compressed object streams. A PDF that is genuinely tagged but uses
/ObjStmcompression may score lower than it deserves. (pdf-struct-cleaner.ts:67calls this out.) estimatePageCountcounts/Type /Pageoccurrences β a few edge-case PDFs can over- or under-count, which feeds intotab_orderandartifact_marking.document_title,document_language,pdfua_identifier, anddisplay_doc_titleuse substring searches against the first 5 MB of the file β large PDFs with late-file metadata may miss (maxAnalyzeBytesdefault = 5 MB).
3. How we improve the score
We already run a post-processor on every generated PDF that targets exactly these checks:
workers/api/src/services/pdf-accessibility-postprocessor.tsβpostProcessAccessiblePdfis invoked fromserver.ts:698,index-aws.ts:470,chunk-scheduler.ts:794, androutes/convert.ts:269. It fixes: XMP metadata,/DisplayDocTitle,/Tabs /S, outline/bookmarks,/Lang,/Title, artifact marking, bullet normalization, and link/Contents. Per its own header comment, it βpushes scores from ~35% to 80%+β.
Leverage per check (highest-impact first)
For source PDFs (what the user uploaded β we donβt control these, but the score tells the user why remediation is needed):
extractable_text(β40) β OCR the PDF. This is the single biggest lever. Scanned PDFs canβt score above 60 no matter what else we do.tag_structure(β20) β run through our conversion pipeline; WeasyPrint + post-processor adds the tag tree.image_alt_text(β30) β AI-generated alt text is already part of the pipeline (image-extractor.ts, image description prompts). Verify every<img>in the converted HTML has meaningfulalt="".
For our generated PDFs (what we ship to the user), the remaining gaps to close:
pdfua_identifier(β10) β the postprocessor header explicitly says it does not claim PDF/UA-1. To claim it we must also verify all Tier-1 checks pass; adding the XMP claim alone without passing structure checks fails Acrobatβs preflight.reading_order(β15) β ensure WeasyPrint emits MCIDs in visual order. Multi-column layouts and floated figures are the usual culprits. Audit by opening a generated PDF in Acrobat Pro β Accessibility β Reading Order.table_header_scope(β10) β emit<th scope="col">/scope="row"in the converted HTML so WeasyPrint propagates/Scopeinto the PDF tag tree.heading_hierarchy(β5 to β10) β TOC detection already produces an H1; make sure downstream chunks donβt restart at H1 or skip from H1 β H3.list_structure(β5) β ensure<ul>/<ol>survive conversion; donβt emit bare<p>β’ item</p>.link_annotations(β5) β verify every<a href>in the HTML becomes both a/Subtype /Linkannotation and a/S /Linkstructure element.artifact_marking(β5) β the postprocessorβs βmark untagged content as Artifactβ fixup handles this; check itβs running for multi-page outputs.document_title,document_language,display_doc_title,bookmarks,tab_orderβ all fully handled by the postprocessor today. If one regresses, checkpostProcessAccessiblePdfis being called on that code path.
Implemented follow-ups (2026-04-23)
- β
Per-check breakdown surfaced in the UI. Each
AccessibilityCheckResultnow carries awcagCriteria: string[], andapps/web/src/components/lti/ScoreBreakdown.tsxrenders failing checks with their WCAG 2.1 AA SC tags. Wired intoFileScoreRowvia a βWhy?β disclosure. - β
Round-trip regression test.
workers/api/src/__tests__/services/pdf-accessibility-roundtrip.test.tsbuilds synthetic βbeforeβ PDFs with pdf-lib, runspostProcessAccessiblePdf, then re-scores β asserting the score strictly increases, no check regresses from pass to fail, and total deductions monotonically decrease. - β
StructParents-based reading-order check.
checkReadingOrderno longer uses MCID monotonicity. It now validates that the structure tree actually expresses reading order (via/ParentTree+/StructParents), which is what ISO 32000-1 Β§14.7 defines as authoritative.
Remaining follow-ups
- Gate the PDF/UA-1 claim. The post-processor currently stamps
<pdfuaid:part>1</pdfuaid:part>in XMP unconditionally (seeinjectXmpMetadatainpdf-accessibility-postprocessor.ts). It should only be stamped when all 16 structural checks pass β otherwise downstream validators (veraPDF, Acrobat Preflight) will flag a false claim. - Run veraPDF / PAC 2024 in CI on a representative output as a Matterhorn-complete external check. Our scorer is a subset.
- Add a WCAG 2.2 AA pass alongside 2.1 AA β backward-compatible, forward-looking for procurement language.
- Uncompress
/ObjStmbefore scoring so tag presence isnβt missed when writers use object-stream compression.
4. Quick reference β file map
| Concern | File |
|---|---|
| Scoring logic (16 checks, deductions) | workers/api/src/services/pdf-accessibility-scorer.ts |
| Score-band thresholds + shared types | packages/shared/src/types.ts (getScoreBand, AccessibilityScore) |
| PDF fixups that raise the score | workers/api/src/services/pdf-accessibility-postprocessor.ts |
| Full-mode scoring call sites | routes/convert.ts:281, scheduler/chunk-scheduler.ts:806 |
| Quick-mode scoring call site (LTI scan) | routes/lti-course.ts:157 |
| Tests | workers/api/src/__tests__/services/pdf-accessibility-scorer.test.ts |