Build "Accessible Forms" β PDF-to-HTML Form Conversion Product
Overview
Build a new product called Accessible Forms within the existing accessible monorepo at /Users/larryanglin/Projects/accessible/. This product converts PDF forms (AcroForms and XFA) into accessible, functional HTML forms with WCAG 2.2 AA compliance. It lives at forms.theaccessible.org (standalone) and theaccessible.org/forms (marketing entry point).
The key innovation over the existing premium-form-converter is a hybrid approach: programmatic extraction of PDF form structure (field types, names, options, values, positions, validation rules) combined with vision-model refinement for layout and styling. This replaces the current 100%-vision approach that guesses form structure from pixels.
Architecture & Stack
Follow the exact same patterns as the existing apps (links, music, photos). This is a monorepo β do not create a separate repository.
New files/directories to create:
apps/forms/ # Next.js 14 App Router frontendworkers/api/src/routes/forms.ts # Cloudflare Worker API routes (light operations)workers/api/src/services/acroform-extractor.ts # AcroForm field extractionworkers/api/src/services/xfa-extractor.ts # XFA form extractionworkers/api/src/services/form-field-mapper.ts # Map extracted fields β HTMLworkers/api/src/services/form-hybrid-converter.ts # Hybrid pipeline orchestratorpackages/shared/src/form-types.ts # Shared form domain typessupabase/migrations/YYYYMMDD_forms_*.sql # Database tablesExtend existing:
workers/api/src/index.ts # Mount new /api/forms/* routesworkers/api/src/types/env.ts # Add R2_FORMS_BUCKET bindingworkers/api/wrangler.toml # Add R2 bucket bindingpackages/shared/src/index.ts # Export new form typesStack (matching existing products):
| Layer | Technology | Notes |
|---|---|---|
| Frontend | Next.js 14, App Router, TailwindCSS | apps/forms/ |
| UI Library | @anglinai/ui + @accessible-org/ui | CorporateHeader, CorporateFooter, ThemeProvider |
| Auth | Supabase Auth (same instance as other apps) | Google + email/password, shared auth-context.tsx pattern |
| Database | Supabase PostgreSQL (same instance) | New tables for form jobs, form field metadata |
| Light API | Cloudflare Workers (Hono) | Extend existing workers/api/ β add routes/forms.ts |
| Heavy Processing | Existing Node.js worker | Extend with form-specific endpoints β already has Puppeteer, pdf-lib, unpdf |
| Storage | Cloudflare R2 | New bucket accessible-forms for uploaded PDFs + output HTML |
| AI/Vision | Claude API (Anthropic) | For vision refinement passes (2-3 iterations, not 8) |
| Payments | Stripe (existing credit system) | Same credit_balances / credit_transactions tables |
Phase 1: AcroForm Extractor (acroform-extractor.ts)
This is the foundational service. Build it first.
What it does:
Uses pdf-lib (already installed) to read the AcroForm dictionary from a PDF and extract structured metadata for every field.
Output type (FormField in packages/shared/src/form-types.ts):
export interface FormField { /** Unique field name from PDF (e.g., "topmostSubform[0].Page1[0].f1_01[0]") */ name: string; /** Human-readable alternate name / tooltip (from /TU entry) */ alternativeName?: string; /** Field type */ type: 'text' | 'checkbox' | 'radio' | 'dropdown' | 'listbox' | 'signature' | 'button' | 'barcode'; /** Current value (pre-filled data) */ value?: string | boolean | string[]; /** Default value */ defaultValue?: string | boolean | string[]; /** For dropdowns/listboxes: available options */ options?: { displayValue: string; exportValue: string }[]; /** Bounding box in PDF coordinates [x1, y1, x2, y2] */ rect: [number, number, number, number]; /** 1-based page number */ page: number; /** Tab order index (if specified in PDF) */ tabIndex?: number; /** Validation constraints */ validation: { required: boolean; readOnly: boolean; maxLength?: number; /** Format category from PDF actions (e.g., 'date', 'number', 'ssn', 'zip', 'phone', 'email') */ formatType?: string; /** Raw format mask/pattern */ formatMask?: string; }; /** For radio buttons: the group name (all radios in group share this) */ radioGroupName?: string; /** For radio buttons: this button's export value within the group */ radioExportValue?: string; /** Font info from default appearance string */ appearance?: { fontSize?: number; fontName?: string; textColor?: string; alignment?: 'left' | 'center' | 'right'; }; /** Calculation script (if field is calculated) */ calculationScript?: string;}
export interface FormExtractionResult { fields: FormField[]; /** Total pages in the PDF */ pageCount: number; /** Whether this PDF uses XFA (vs AcroForm) */ isXFA: boolean; /** Page dimensions for coordinate mapping */ pageDimensions: { page: number; width: number; height: number }[]; /** Document-level metadata */ metadata: { title?: string; author?: string; language?: string; }; /** Warnings encountered during extraction */ warnings: string[];}Implementation notes:
- Use
pdf-libβsPDFDocument.load()and traverse the AcroForm dictionary - Access field widgets via
doc.catalog.lookup(PDFName.of('AcroForm'))and iterate the/Fieldsarray - Each fieldβs
/FT(field type) maps to:/Txβ text,/Btnβ checkbox/radio,/Chβ dropdown/listbox,/Sigβ signature - Distinguish checkbox vs radio via the
/Ffflags (bit 16 = radio) - Extract
/Optarray for dropdown/listbox options (each entry may be a string or [exportValue, displayValue] pair) - Extract
/V(current value),/DV(default value),/TU(tooltip/alt name),/Rect(position) - Parse
/AA(additional actions) for calculation and validation scripts - Parse
/DA(default appearance) for font/size/color - Extract
/MaxLenfor text field max length - Check
/Ffflag bits: bit 1 = readOnly, bit 2 = required - Handle field hierarchies (parent/child fields in the AcroForm tree) β fully qualified field name uses dot notation
Test coverage:
- Write tests using real-world PDF form fixtures (create small test PDFs with pdf-lib that have each field type)
- Test: text fields, checkboxes, radio groups, dropdowns with options, signature fields, required fields, read-only fields, pre-filled values, multi-page forms, nested field hierarchies
Phase 2: XFA Extractor (xfa-extractor.ts)
What it does:
Reads the XFA stream from the PDF catalog, parses the XML, and extracts field definitions into the same FormField[] structure.
Implementation notes:
- Check for
/XFAkey in the PDF catalogβs AcroForm dictionary - XFA data is stored as XML streams (may be segmented:
template,datasets,config,localeSet) - The
templateXML contains field definitions:<field>,<subform>,<draw>,<exclGroup>(radio groups) - Parse with a fast XML parser (add
fast-xml-parseras a dependency) - Map XFA field types to our
FormField.type:<field>with<ui><textEdit>β text<field>with<ui><checkButton>β checkbox<field>with<ui><choiceList>β dropdown or listbox<field>with<ui><dateTimeEdit>β text with formatType βdateβ<field>with<ui><signature>β signature<exclGroup>β radio group
- Extract
<items>children for dropdown options - Extract
<validate>elements for validation rules - Extract
<calculate>elements for calculated fields - Map XFA coordinate system to page coordinates using
<contentArea>dimensions - Handle dynamic XFA (growable subforms, repeatable rows) β flag these in warnings since HTML canβt fully replicate dynamic XFA behavior
XFA detection in preflight:
Update pdf-preflight.ts to detect and flag XFA forms separately from AcroForms. Set isXFA: true in the extraction result.
Phase 3: Form Field Mapper (form-field-mapper.ts)
What it does:
Takes a FormField[] array and generates a skeleton HTML form with correct semantic elements, field types, attributes, groupings, and basic CSS positioning.
Output:
A complete <form> HTML string with:
- Proper
<input>,<select>,<textarea>elements matching field types <label for="id">associations (usingalternativeNameornameas label text)<fieldset>+<legend>wrapping radio/checkbox groups- Pre-filled
value,checked,selectedattributes from extracted data required,readonly,maxlength,pattern,type(email/date/tel/number) from validationautocompleteattributes based on field name heuristics (name, email, phone, address, etc.)inputmodeattributes for mobile keyboardstabindexmatching PDF tab order- CSS positioning derived from field
rectcoordinates mapped to relative page layout - Signature fields rendered with a clear βSign hereβ visual treatment and
role="img"or canvas placeholder - DOM order matching visual reading order (top-to-bottom, left-to-right within rows)
Field name β label heuristic:
PDF field names are often cryptic (f1_01, topmostSubform[0].Page1[0].SSN[0]). Use the alternativeName (tooltip) first. If unavailable, apply heuristics:
- Strip
topmostSubform[0].PageN[0].prefixes - Convert camelCase/PascalCase to spaces
- Strip trailing
[0]array indices - Flag fields with no usable label text β these will need vision-model label extraction
Coordinate mapping:
- PDF coordinates: origin at bottom-left, units in points (1/72 inch)
- HTML coordinates: origin at top-left, units in pixels
- Convert:
htmlY = (pageHeight - pdfY) * scale,htmlX = pdfX * scale - Group fields on the same horizontal band into flex rows
- Use relative positioning within a page container, not absolute positioning
Phase 4: Hybrid Converter (form-hybrid-converter.ts)
What it does:
Orchestrates the two-phase pipeline:
Phase A β Structural extraction (programmatic, fast, cheap):
- Run AcroForm or XFA extractor β get
FormField[] - Run form-field-mapper β generate skeleton HTML form
- This skeleton has correct field types, names, groups, options, values, validation β but may have imperfect labels and layout
Phase B β Vision refinement (LLM, 1-3 iterations):
- Render skeleton HTML in browser β screenshot
- Send to Claude: [Original PDF] + [Screenshot] + [Skeleton HTML] + [Extracted FormField[] JSON]
- Prompt: βThe skeleton HTML was generated from programmatic extraction. The field types, names, options, and values are correct. Your job is to: (a) fix label text using the PDF as reference, (b) adjust layout/alignment to match the PDF, (c) add section headings and visual structure, (d) improve CSS styling. Do NOT change field types, names, option lists, or values β those are authoritative.β
- Iterate until NO_CHANGES_NEEDED or max 3 iterations
- Run axe-core WCAG 2.2 AA validation β remediation pass if needed
Key differences from existing premium-form-converter:
- Starts from a structurally correct skeleton, not raw converted HTML
- LLM only handles label text + layout + styling (not field type guessing)
- 3 iterations max instead of 8 (structure is already right)
- Passes extracted
FormField[]JSON as context so the LLM knows whatβs authoritative - ~60-70% cheaper per form
Fork the existing code:
Copy premium-form-converter.ts as a starting point. Replace the iteration prompt with the hybrid-specific prompt described above. Keep the axe-core validation pass, progress callbacks, and cost tracking.
Phase 5: Database Schema
Create a migration file supabase/migrations/YYYYMMDD_forms_tables.sql:
-- Form conversion jobsCREATE TABLE public.form_conversions ( id UUID PRIMARY KEY DEFAULT gen_random_uuid(), user_id UUID NOT NULL REFERENCES auth.users(id) ON DELETE CASCADE, original_name TEXT NOT NULL, file_size_bytes BIGINT NOT NULL, page_count INTEGER, field_count INTEGER, is_xfa BOOLEAN DEFAULT FALSE,
-- R2 storage keys input_r2_key TEXT NOT NULL, skeleton_r2_key TEXT, output_r2_key TEXT,
-- Status tracking status TEXT NOT NULL DEFAULT 'pending' CHECK (status IN ('pending', 'extracting', 'mapping', 'refining', 'validating', 'completed', 'failed')), progress INTEGER DEFAULT 0 CHECK (progress >= 0 AND progress <= 100), phase TEXT, error TEXT,
-- Conversion metrics extraction_duration_ms INTEGER, refinement_iterations INTEGER, total_duration_ms INTEGER, input_tokens INTEGER, output_tokens INTEGER, estimated_cost_usd NUMERIC(10,6), credits_charged INTEGER,
-- Quality metrics wcag_violations_found INTEGER, wcag_violations_fixed INTEGER, fields_extracted INTEGER, fields_in_output INTEGER,
created_at TIMESTAMPTZ DEFAULT NOW(), completed_at TIMESTAMPTZ);
-- Extracted form fields (for analytics and debugging)CREATE TABLE public.form_fields ( id UUID PRIMARY KEY DEFAULT gen_random_uuid(), conversion_id UUID NOT NULL REFERENCES public.form_conversions(id) ON DELETE CASCADE, field_name TEXT NOT NULL, field_type TEXT NOT NULL, page_number INTEGER NOT NULL, has_label BOOLEAN DEFAULT FALSE, has_value BOOLEAN DEFAULT FALSE, has_options BOOLEAN DEFAULT FALSE, option_count INTEGER DEFAULT 0, is_required BOOLEAN DEFAULT FALSE, is_readonly BOOLEAN DEFAULT FALSE, rect JSONB, created_at TIMESTAMPTZ DEFAULT NOW());
-- IndexesCREATE INDEX idx_form_conversions_user ON public.form_conversions(user_id, created_at DESC);CREATE INDEX idx_form_conversions_status ON public.form_conversions(status);CREATE INDEX idx_form_fields_conversion ON public.form_fields(conversion_id);
-- RLSALTER TABLE public.form_conversions ENABLE ROW LEVEL SECURITY;ALTER TABLE public.form_fields ENABLE ROW LEVEL SECURITY;
CREATE POLICY "Users can view own conversions" ON public.form_conversions FOR SELECT USING (auth.uid() = user_id);CREATE POLICY "Users can insert own conversions" ON public.form_conversions FOR INSERT WITH CHECK (auth.uid() = user_id);CREATE POLICY "Users can update own conversions" ON public.form_conversions FOR UPDATE USING (auth.uid() = user_id);
CREATE POLICY "Users can view own form fields" ON public.form_fields FOR SELECT USING ( EXISTS (SELECT 1 FROM public.form_conversions fc WHERE fc.id = conversion_id AND fc.user_id = auth.uid()) );-- Service role handles inserts/updates to form_fields (from the worker)Phase 6: API Routes (workers/api/src/routes/forms.ts)
Mount at /api/forms/* in the existing Hono worker.
Endpoints:
POST /api/forms/upload Upload a PDF form β returns jobIdGET /api/forms/:jobId Get conversion status + metadataGET /api/forms/:jobId/download Download converted HTMLGET /api/forms/:jobId/fields Get extracted field metadata (for debugging/preview)DELETE /api/forms/:jobId Delete a conversion and its R2 filesGET /api/forms/history List user's past conversions (paginated)POST /api/forms/:jobId/retry Retry a failed conversionUpload flow:
- Validate file (PDF, under 50MB, not encrypted)
- Run preflight to detect form type (AcroForm vs XFA) and field count
- Calculate credit cost:
Math.ceil(pageCount * FORM_CREDIT_MULTIPLIER)β defineFORM_CREDIT_MULTIPLIER = 3in shared constants (cheaper than premium-formβs 8 because hybrid is more efficient) - Check credit balance, deduct credits
- Store PDF in R2 at
forms/{userId}/{jobId}/input.pdf - Create
form_conversionsrow with status βpendingβ - Dispatch to Node worker for heavy processing (or use
waitUntilfor Cloudflare background) - Return
{ jobId, status: 'pending', fieldCount, isXFA, creditsCharged }
Processing pipeline (runs async):
- Status β βextractingβ: Run AcroForm or XFA extractor
- Status β βmappingβ: Run form-field-mapper to generate skeleton HTML
- Store skeleton in R2 at
forms/{userId}/{jobId}/skeleton.html - Status β βrefiningβ: Run hybrid converter (2-3 vision iterations)
- Status β βvalidatingβ: Run axe-core WCAG validation + remediation
- Status β βcompletedβ: Store final HTML in R2 at
forms/{userId}/{jobId}/output.html
Phase 7: Frontend (apps/forms/)
Structure:
Follow the exact same pattern as apps/music/ or apps/links/:
apps/forms/βββ src/β βββ app/β β βββ layout.tsx # ThemeProvider, AuthProvider, CorporateHeader, CorporateFooterβ β βββ page.tsx # Landing page (marketing + upload CTA)β β βββ globals.css # @anglinai/ui theme imports + Tailwindβ β βββ auth/β β β βββ callback/route.ts # Supabase auth callbackβ β βββ dashboard/β β β βββ page.tsx # List of past conversionsβ β β βββ [jobId]/β β β βββ page.tsx # Conversion detail + downloadβ β β βββ preview/β β β βββ page.tsx # Live preview of converted formβ β βββ pricing/β β β βββ page.tsx # Credit packages + pricingβ β βββ docs/β β βββ page.tsx # Documentation / how it worksβ βββ components/β β βββ layout/β β β βββ AppHeader.tsx # CorporateHeader with forms nav linksβ β β βββ SiteFooter.tsx # CorporateFooterβ β β βββ ServiceBanner.tsxβ β βββ upload/β β β βββ FormDropZone.tsx # Drag-and-drop PDF uploadβ β β βββ UploadProgress.tsxβ β βββ conversion/β β β βββ ConversionStatus.tsx # Real-time status with progress barβ β β βββ FieldPreview.tsx # Show extracted fields before conversionβ β β βββ FormPreview.tsx # Iframe preview of converted HTMLβ β βββ dashboard/β β βββ ConversionHistory.tsx # Table of past conversionsβ βββ lib/β β βββ supabase.ts # Supabase client (copy from music/links)β β βββ auth-context.tsx # Auth context (copy from music/links)β β βββ api.ts # API client for /api/forms/*β β βββ strings.ts # i18n string keysβ βββ hooks/β β βββ useConversion.ts # Poll conversion statusβ β βββ useCredits.ts # Credit balance hookβ βββ locales/β β βββ en.json # All UI strings externalizedβ βββ __tests__/β βββ components/β βββ a11y/β βββ hooks/βββ public/β βββ favicon.icoβ βββ favicon.svgβ βββ site.webmanifestβββ tailwind.config.js # @anglinai/ui preset + primary colorsβββ next.config.jsβββ tsconfig.jsonβββ package.jsonβββ vitest.config.tsβββ playwright.config.tsLanding page features:
- Hero: βConvert PDF Forms to Accessible HTMLβ with upload dropzone
- How it works: 3-step visual (Upload β Extract β Download)
- Feature highlights: AcroForm + XFA support, WCAG 2.2 AA, pre-filled data preservation, field validation
- Before/after comparison slider showing PDF β HTML form
- Pricing section (credit packages)
- FAQ section
Dashboard features:
- Table of past conversions with status, date, page count, field count
- Click to view details: extracted fields, download HTML, preview in iframe
- Upload new form button
Conversion detail page:
- Real-time progress indicator during conversion
- After completion: side-by-side preview (original PDF vs converted HTML)
- Download button for HTML output
- Field extraction summary (X text fields, Y checkboxes, Z dropdowns, etc.)
- WCAG compliance badge (pass/fail with details)
Phase 8: Form Submission & Data Export
The converted HTML form should be functional, not just visual. Add these capabilities to the output HTML:
Client-side (embedded in the HTML output):
- A
<script>block at the bottom of the HTML that provides:- βDownload as JSONβ button β serializes all form field values to JSON and triggers download
- βDownload as CSVβ button β serializes to CSV
- βPrintβ button β triggers
window.print()with print-optimized CSS - βResetβ button β clears all fields
- These scripts are self-contained (no external dependencies) so the HTML works as a standalone file
Optional webhook (future):
- Allow users to configure a
<form action="https://...">POST target - Not in MVP β just the client-side export buttons
Cross-Cutting Requirements
Testing (80% coverage minimum):
- Unit tests for AcroForm extractor (test each field type, edge cases)
- Unit tests for XFA extractor (test XML parsing, field mapping)
- Unit tests for form-field-mapper (test HTML generation, label heuristics, coordinate mapping)
- Integration tests for hybrid converter (mock vision model, verify iteration loop)
- API route tests (upload, status polling, download, error handling)
- Frontend component tests (upload flow, status display, preview)
- Accessibility tests (
axe-corein Vitest for all rendered components) - E2E tests with Playwright (upload a PDF, wait for conversion, download result)
- Mobile tests (iPhone 14, iPad, Pixel 7 viewports)
Accessibility (WCAG 2.2 AA):
- The product UI itself must be fully accessible (not just the output)
- All form upload interactions keyboard-navigable
- Progress indicators announced to screen readers (
role="progressbar",aria-live) - Preview iframe has proper
titleattribute - Skip links, focus management on route changes
- Color contrast AA on all text
i18n:
- All UI strings in
locales/en.json - Use
next-intlor equivalent - No hardcoded user-facing strings in components
Performance:
- File upload: stream to R2, donβt buffer entire file in memory
- Status polling: use exponential backoff (1s β 2s β 4s β 8s, cap at 10s)
- Dashboard: paginate with cursor-based pagination for large histories
- Extracted fields: cache in Supabase, donβt re-extract on every view
Security:
- Validate uploaded files are actually PDFs (magic bytes check)
- Enforce max file size (50MB)
- Rate limit uploads (10/minute per user)
- Sanitize output HTML (strip any residual
<script>from LLM output, except our own export scripts) - RLS on all database tables
SEO & Meta:
- Landing page: unique title, description, OG tags
sitemap.xmlvia Next.jsapp/sitemap.tsrobots.txt(allow landing + docs, block dashboard)- JSON-LD structured data on landing page
Deployment
DNS & Routing:
forms.theaccessible.orgβ Cloudflare Pages (apps/forms)theaccessible.org/formsβ redirect toforms.theaccessible.org(add to existing web appβs next.config.js rewrites/redirects)- API calls from the frontend go to the existing
api-pdf.theaccessible.orgworker at/api/forms/*
R2 Bucket:
- Create new bucket
accessible-formsin Cloudflare - Add binding
R2_FORMS_BUCKETtoworkers/api/wrangler.toml
Build Counter:
Register the new app in the tagzen Supabase build_counters table:
INSERT INTO public.build_counters (app_id, counter, prefix, description)VALUES ('accessible-forms', 0, '1.0.0', 'Accessible Forms - PDF to HTML form converter');Environment Variables (apps/forms/.env.local):
NEXT_PUBLIC_SUPABASE_URL=<same as other apps>NEXT_PUBLIC_SUPABASE_ANON_KEY=<same as other apps>NEXT_PUBLIC_API_URL=https://api-pdf.theaccessible.orgNEXT_PUBLIC_APP_ENV=developmentWhat NOT to build (out of scope for this prompt):
- PDF re-generation (filling converted HTML back into a PDF)
- Real-time collaborative form filling
- Form builder/designer UI
- Custom branding on output forms (beyond basic styling)
- Multi-language form conversion (translate field labels)
- OCR for scanned paper forms (the existing vision pipeline handles this)
Order of Operations
Build in this sequence β each phase depends on the previous:
- Shared types (
packages/shared/src/form-types.ts) β FormField, FormExtractionResult, etc. - AcroForm extractor (
workers/api/src/services/acroform-extractor.ts) + tests - XFA extractor (
workers/api/src/services/xfa-extractor.ts) + tests - Form field mapper (
workers/api/src/services/form-field-mapper.ts) + tests - Database migration β form_conversions, form_fields tables
- API routes (
workers/api/src/routes/forms.ts) β upload, status, download, history - Hybrid converter (
workers/api/src/services/form-hybrid-converter.ts) β fork premium-form-converter, integrate extractor + mapper - Frontend app (
apps/forms/) β landing page, upload, dashboard, preview - Data export scripts β embedded JSON/CSV/print in output HTML
- E2E tests β full upload-to-download flow
- Accessibility audit β axe-core on all pages, fix violations
- Mobile tests β Playwright device emulation
- Deploy β Cloudflare Pages, R2 bucket, DNS, build counter registration
Start with phases 1-4 (the extraction engine) since theyβre the foundation. The frontend and API can be built in parallel once the core services exist.