Skip to content

PDF Remediation Pipeline & Conformance Report

How the platform takes raw HTML / source documents, produces a tagged PDF, and reports independent and self-attested conformance against the relevant accessibility standards.

Companion to pdf-accessibility-score.md, which covers the user-facing 0–100 score in depth. This doc is the operator/engineer view of the whole pipeline.


1. Standards in scope

StandardRoleHow we satisfy it
WCAG 2.1 AAThe legal baseline (ADA Title II 2026-07663, Section 508, EU EN 301 549)Structural checks + content cleanup. Each check is mapped to the SC(s) it supports.
PDF/UA-1 (ISO 14289-1)Standard implementation recipe for applying WCAG to PDFsXMP pdfuaid:part="1" claim + structure tree + tags, validated by veraPDF
Matterhorn Protocol 1.1The 136-failure-condition checklist that operationalizes PDF/UA-1Machine-checkable conditions are caught by veraPDF; the ~20% human conditions (alt-text correctness, reading-order sense) are out of scope for any automated scorer

Bottom line: WCAG 2.1 AA is what the law requires. PDF/UA-1 is how we satisfy it at the PDF format level. Matterhorn is the test plan that proves PDF/UA-1 conformance.


2. The pipeline (HTML β†’ conformant PDF)

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ source (PDF / DOCX / image / etc.) β”‚
β”‚ β”‚ β”‚
β”‚ β–Ό β”‚
β”‚ conversion cascade ── produces semantic HTML β”‚
β”‚ β”‚ β”‚
β”‚ β–Ό β”‚
β”‚ weasyprint-generator.ts ── HTML β†’ tagged PDF β”‚
β”‚ β”‚ β”‚
β”‚ β–Ό β”‚
β”‚ pdf-accessibility-postprocessor.ts ── 11 PDF/UA fixes β”‚
β”‚ β”‚ β”‚
β”‚ β–Ό β”‚
β”‚ verapdf-client.ts ── ISO 14289-1 reference validator (soft-fail) β”‚
β”‚ β”‚ β”‚
β”‚ β–Ό β”‚
β”‚ pdf-accessibility-scorer.ts ── 16 structural checks β†’ 0–100 β”‚
β”‚ β”‚ β”‚
β”‚ β–Ό β”‚
β”‚ R2 / S3 + DB row (files.accessible_pdf_*) β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

2.1 WeasyPrint (workers/api/src/services/weasyprint-generator.ts)

WeasyPrint is the only HTML-to-PDF engine on the market that produces correctly tagged output by default. Chromium’s tagged-PDF mode wraps unrecognised elements in <NonStruct>, which fails PDF/UA. WeasyPrint maps semantic HTML (h1–h6, p, ul, ol, table, figure, section, article, main, nav) to the correct PDF structure types directly.

What WeasyPrint doesn’t do:

  • Set /DisplayDocTitle in ViewerPreferences
  • Set /Tabs /S on every page (structure-based tab order)
  • Generate a document outline (bookmarks)
  • Inject XMP with the PDF/UA-1 claim
  • Mark out-of-tag artifacts (page numbers, headers/footers) explicitly
  • Add /Contents to link annotations
  • Inject /Alt on /Figure structure elements
  • Inject /Scope on /TH cells
  • Normalize bullet labels to a Unicode bullet

All of those land on the post-processor.

The WeasyPrint sidecar runs at weasyprint:5001 on the pdf-net Docker network and is built from services/weasyprint/.

2.2 Post-processor (pdf-accessibility-postprocessor.ts)

Eleven discrete fixes, all using pdf-lib to mutate the document object graph. Together they push the structural score from ~35% (raw WeasyPrint output) to 80%+. They mirror the fixups Adobe Acrobat’s accessibility preflight applies.

#FixWhy it matters
1/Title on document info dictionaryRequired by PDF/UA; assistive tech reads it as the document name
2/Lang on document catalogRequired by PDF/UA; sets pronunciation for screen readers
3/DisplayDocTitle true in ViewerPreferencesTells PDF readers to show the title instead of the filename
4/Tabs /S on every pageTab key follows structure order, not annotation order
5XMP metadata (dc:title, dc:language, pdfuaid:part=β€œ1”, producer)The conformance claim itself; ISO 16684-1 metadata
6Bookmarks from heading structureRequired for documents over a few pages (WCAG 2.4.5 Multiple Ways)
7Mark untagged content as /ArtifactPage numbers, headers, footers stay out of reading order
8Normalize list bullet labelsReplaces engine-specific glyphs with the Unicode bullet
9/Contents on link annotationsAcrobat fixup #3; gives screen readers the link’s spoken name
10/Alt on /Figure structure elementsPulls alt text from source HTML and attaches it to the PDF figure
11/Scope on /TH table headersMarks each header as /Row or /Column (Matterhorn 15-003)

Important caveat: the post-processor asserts PDF/UA-1 in XMP. It does not prove it. That’s veraPDF’s job (next step).

2.3 veraPDF (services/verapdf/, workers/api/src/services/verapdf-client.ts) β€” issue #506

veraPDF is the ISO reference implementation of PDF/UA-1 (and PDF/A) validation. It runs every machine-checkable Matterhorn condition against the bytes and returns a JSON report.

Architecture:

  • Sidecar: services/verapdf/{Dockerfile, server.py} β€” Flask process wrapping the verapdf CLI on top of the upstream verapdf/cli image. Listens on container-internal port 5002. Reachable from the API as http://verapdf:5002 over pdf-net.
  • Client: validatePdfUA1(pdf: Uint8Array): Promise<VerapdfResult>. Streams the PDF body to /validate?flavour=ua1, parses the report, returns { passed, failedRules[], durationMs }.
  • Wiring: called in all four export call sites between postProcessAccessiblePdf and scorePdfAccessibility: server.ts, index-aws.ts, routes/convert.ts, scheduler/chunk-scheduler.ts.

Soft-fail by design. A veraPDF outage, timeout, or unexpected report shape MUST NOT block PDF delivery. The catch logs [verapdf] validation skipped: … and the export proceeds. This is the launch policy (issue #506) β€” promote to hard-fail only after the failure rate stabilizes near zero.

veraPDF only runs where the sidecar is reachable:

RuntimeReachable?Behavior
Node fleet on 10.1.1.4 (api-node-1, api-node-2, batch-worker)Yes β€” same Docker networkValidates and persists summary
AWS Lambda (index-aws.ts β†’ api-pdf.theaccessible.org)NoSoft-skip; no harm. Lambda almost never runs heavy export anyway
EC2 spot fleetCurrently desired=0When re-enabled, AMI compose file must include the verapdf service

2.4 Structural scorer (pdf-accessibility-scorer.ts)

Runs 16 byte-level structural checks (no AI, no rendering). Each check is mapped to the WCAG 2.1 AA success criteria it supports. Scores 0–100; banded display. Full check list is documented in pdf-accessibility-score.md. The scorer is independent of veraPDF β€” they cross-check each other.


3. What we test for, mapped to standards

3.1 Tier 0 β€” Structural presence (always run, β€œquick” mode)

CheckAsksFail mode
extractable_textDoes the PDF contain real text, not just rasterized images?Scanned PDFs, image-only exports
tag_structureIs there a /StructTreeRoot?Untagged PDFs
image_alt_textDo /Figure elements have /Alt?Decorative-only PDFs, missing alt
document_language/Lang on the catalog?Missing language
document_title/Title in the info dict?Engine left it as β€œuntitled”
table_headersAre /TH elements present where /Table exists?Tables that use /TD for headers

3.2 Tier 1 + 2 β€” Correctness + PDF/UA (full mode adds 10 more)

heading_hierarchy, reading_order, table_header_scope, list_structure, link_annotations, pdfua_identifier, tab_order, bookmarks, artifact_marking, display_doc_title.

See WCAG_CRITERIA_MAP in pdf-accessibility-scorer.ts for the SC mapping. See pdf-accessibility-score.md for the per-check rubric and deductions.

3.3 What veraPDF adds on top

veraPDF runs the machine-checkable subset of Matterhorn 1.1 β€” about 80% of the 136 failure conditions. Examples our scorer doesn’t specifically check but veraPDF will:

  • 09-001..09-008: Structure type mapping integrity (every used type must be defined in /RoleMap or be a standard type)
  • 02-001: Document permissions don’t suppress assistive tech
  • 06-002..06-004: Embedded font character mapping completeness
  • 14-001..14-007: Rich annotation requirements
  • 19-001..19-006: PDF version + structure tree consistency rules
  • ISO 32000-1 syntax conformance that PDF/UA inherits

Where the two disagree, veraPDF is authoritative.

3.4 What no automated tool can check

Roughly 20% of Matterhorn is β€œhuman-only” β€” judgement-based:

  • Is the alt text actually descriptive of the image’s meaning?
  • Does the reading order make sense to a human reader?
  • Are decorative images correctly marked decorative (not load-bearing ones marked decorative to silence the validator)?
  • Are equations spoken the way a sighted reader sees them?
  • Is the tagging of complex multi-column or multi-table layouts semantically right, even if structurally valid?

veraPDF flags these as β€œneeds human review” rather than pass/fail. For documents that need to defend a conformance claim (e.g., legal filings, course materials shipped to LMS), a human pass is still required. The platform accelerates that pass β€” it doesn’t replace it.


4. Interpreting the conformance report

The dashboard accessibility report (apps/web preview / pipeline / report pages, plus the LTI course-scanner) shows:

  1. Score (0–100) + band β€” from the structural scorer
  2. 16 individual check results β€” each with pass/fail, deduction, and the WCAG SCs it maps to
  3. veraPDF summary (when available) β€” passed/failed-rule count, plus the failed-rule details (clause, test number, description, occurrences)

4.1 The four interpretation cases

ScoreveraPDFMeaningAction
β‰₯ 80passedStrong evidence of conformance. Both independent checks agree.Ship. Manual sample review for high-stakes docs.
β‰₯ 80failedScore over-reports. veraPDF found PDF/UA violations our scorer doesn’t catch (missing role-mapping, font-cmap issue, etc.).Investigate failed rules. Often a post-processor bug or an upstream HTML quirk.
< 80passedRare. Usually means the post-processor did the structural minimum but the HTML lacked content (no alt text, no headings) β€” veraPDF doesn’t grade content quality, just byte conformance.Improve source HTML; rerun.
< 80failedBoth agree the document isn’t ready. Low-effort wins are usually in image_alt_text, heading_hierarchy, bookmarks.Fix post-processor failures first; rerun veraPDF.
anyskippedSidecar unreachable or timed out. Score still valid.Check [verapdf] log line; restart verapdf container if down.

4.2 Reading veraPDF failed rules

Each failed rule has:

  • clause β€” the spec section, e.g. ISO 14289-1
  • testNumber β€” Matterhorn test ID, e.g. 7.1-2
  • description β€” human-readable rule
  • occurrences β€” how many times this rule fired in the document

Map test numbers to Matterhorn checkpoints via the Matterhorn Protocol 1.1 PDF. The clause group (7.1, 7.18, etc.) matches the ISO 14289-1 section that defines the requirement.

4.3 Headers exposed by HTTP exports

The two HTTP-streaming endpoints (/html-to-pdf on server.ts and on index-aws.ts) expose results as response headers, since there’s no DB row to attach to:

  • X-PDF-Accessibility-Score: 87
  • X-PDF-Accessibility-Band: good
  • X-PDF-Accessibility-Details: <base64 JSON of check array>
  • X-PDF-Verapdf-Passed: true|false (only when veraPDF ran)
  • X-PDF-Verapdf-Failed-Rules: 0 (only when veraPDF ran)

4.4 Persisted columns

The async paths (routes/convert.ts, scheduler/chunk-scheduler.ts) write to public.files:

  • accessible_pdf_score β€” integer
  • accessible_pdf_score_details β€” JSONB array of check results
  • accessible_pdf_verapdf β€” JSONB { passed, failedRules[], durationMs }

These power the dashboard’s accessibility-report rendering.


5. Operational notes

5.1 When veraPDF goes down

Symptoms: [verapdf] validation skipped: … log lines on every export, accessible_pdf_verapdf stays NULL on new files. Score is unaffected β€” exports continue. Fix:

Terminal window
ssh -i ~/.ssh/nightly-audit larry@10.1.1.4
cd ~/accessible
docker compose ps verapdf
docker compose logs --tail 100 verapdf
docker compose restart verapdf # or up -d --force-recreate verapdf

5.2 When the score and veraPDF disagree

The scorer is a fast, cheap heuristic β€” pattern matching against PDF bytes. veraPDF is the slow, correct, ISO reference. When they disagree, the scorer is wrong (in the precision sense). Open an issue with both reports attached so we can either tighten the structural check or accept the divergence as a known limitation of byte-pattern scoring.

5.3 Promoting veraPDF from soft-fail to hard-fail

Per issue #506, the launch plan is:

  1. Ship soft-fail (current state)
  2. Watch the failure rate for ~1 week β€” both veraPDF outage rate AND PDF/UA failure rate among successful runs
  3. Once the failure-rate metric stabilizes near zero, promote to hard-fail by changing the soft-skip catch into an accessiblePdfStatus = 'failed' write

5.4 What the platform deliberately does NOT do

  • We don’t score the visual fidelity of the PDF here
  • We don’t run a Matterhorn human-pass β€” that’s a service offering, not an automated check
  • We don’t validate PDF/UA-2 (veraPDF can’t yet, and our XMP claims PDF/UA-1)
  • We don’t validate PDF/A flavours by default (the sidecar supports it β€” flavour=1b etc. β€” but no caller enables that)

6. Source-of-truth pointers

TopicFile
WeasyPrint clientworkers/api/src/services/weasyprint-generator.ts
Post-processorworkers/api/src/services/pdf-accessibility-postprocessor.ts
Structural scorerworkers/api/src/services/pdf-accessibility-scorer.ts
veraPDF clientworkers/api/src/services/verapdf-client.ts
veraPDF sidecarservices/verapdf/{Dockerfile,server.py,requirements.txt}
Score β†’ WCAG mappingWCAG_CRITERIA_MAP in scorer
Shared typespackages/shared/src/types.ts (AccessibilityScore, VerapdfScoreSummary)
Composedocker-compose.yml (services weasyprint, verapdf)
DB columnsmigration 20260403_059_accessible_pdf_export.sql (score) + 20260505_096_files_accessible_pdf_verapdf.sql (verapdf summary)