Data Privacy Tiers β Implementation Roadmap
The Data Privacy page at theaccessible.org/data-privacy describes five escalating levels of data control that customers can exercise. This document is the engineering plan to implement each tier.
Current State
| Tier | Feature | Status |
|---|---|---|
| 1 | Bring Your Own Storage (BYOS) | Not started |
| 2 | Public API for headless use | API exists, needs auth improvements |
| 3 | Customer AWS deployment (CDK) | CDK stacks exist, not customer-facing |
| 4 | Docker self-hosted | Docker Compose exists for dev, not productized |
| 5 | Bring Your Own LLM Keys (BYOK) | Provider abstraction exists, no user-facing config |
Tier 1: Bring Your Own Storage
Goal: Users configure their own S3-compatible storage. Processed files go to their bucket, not ours.
Backend Work
-
Database schema β Add
user_storage_configstable:CREATE TABLE user_storage_configs (id UUID PRIMARY KEY DEFAULT gen_random_uuid(),user_id UUID NOT NULL REFERENCES auth.users(id) ON DELETE CASCADE,endpoint TEXT, -- S3-compatible endpoint (null for AWS S3)region TEXT DEFAULT 'us-east-1',bucket TEXT NOT NULL,access_key_id TEXT NOT NULL, -- encrypted via pgcryptosecret_access_key TEXT NOT NULL, -- encrypted via pgcryptois_verified BOOLEAN DEFAULT FALSE,verified_at TIMESTAMPTZ,created_at TIMESTAMPTZ DEFAULT NOW(),updated_at TIMESTAMPTZ DEFAULT NOW(),UNIQUE(user_id)); -
API routes β
workers/api/src/routes/user-storage.ts:GET /api/user-storageβ get current config (redacted secret)PUT /api/user-storageβ create/update configDELETE /api/user-storageβ remove configPOST /api/user-storage/verifyβ test connection (put a test object, read it, delete it)
-
Credential encryption β Use Supabase Vault or pgcrypto
pgp_sym_encryptwith a server-side key stored in env vars. Never return raw secrets to the client. -
File pipeline modification β In the conversion pipeline (
workers/api/src/routes/convert.tsandworkers/api/src/routes/files.ts):- Before writing output, check if user has a storage config
- If yes: write to userβs S3 bucket, store only a reference (bucket + key) in our DB
- If no: write to our R2/S3 as today
- On file download: proxy from userβs bucket or our bucket based on config
-
Data deletion β
POST /api/user-storage/purge-our-data:- Delete all files from our storage for this user
- Keep only metadata/references
- Irreversible β require confirmation
Frontend Work
- Settings page β Add βStorageβ tab to
apps/web/src/app/settings/page.tsx:- Form fields: endpoint, region, bucket, access key, secret key
- βTest Connectionβ button
- Status indicator (verified/unverified)
- βDelete My Dataβ button with confirmation modal
Estimated Effort: 5-7 days
Tier 2: API for Headless Use
Goal: External developers can use our API without our web UI. Documents go in and out via API β nothing stored unless explicitly requested.
Work Items
-
API key management β Add to Settings page:
- Generate/revoke API keys
- Keys stored as hashed values in DB
- Rate limiting per key (configurable by plan)
-
Stateless conversion endpoint β Ensure
POST /api/convertsupports:- File upload via multipart form
- Webhook callback URL for async results
- Option to return output directly (sync) for small files
X-No-Store: trueheader to prevent any file persistence
-
SDK / examples β Create code examples in:
- Python (requests)
- Node.js (fetch)
- cURL
- Published at
pdf.theaccessible.org/docs/api
-
OpenAPI spec updates β Ensure
public/openapi.yamldocuments all public endpoints with auth, request/response schemas, error codes
Estimated Effort: 3-5 days
Tier 3: Customer AWS Deployment
Goal: Customer runs our full stack in their own AWS account. We provide CDK scripts and a coordination layer.
Work Items
-
Parameterize CDK β Modify
infra/cdk/to accept customer AWS account ID, region, and deployment preferences via a config file:{"accountId": "123456789012","region": "us-east-1","vpcCidr": "10.0.0.0/16","domainName": "accessible.customer.edu","licenseKey": "lic_xxx"} -
Coordination layer β Lightweight Lambda that:
- Validates license key on startup (calls our API)
- Reports version info (opt-in telemetry)
- Checks for available updates
- Does NOT send any customer data
-
Customer onboarding script β
npx @accessible/deploy-aws:- Prompts for config values
- Validates AWS credentials
- Runs CDK bootstrap + deploy
- Outputs endpoint URLs and admin credentials
-
IAM documentation β Minimum required IAM permissions for deployment:
- CloudFormation, Lambda, API Gateway, S3, DynamoDB, SQS, EC2, VPC
- Documented in
docs/admin/customer-aws-iam.md
-
Update mechanism β
npx @accessible/update-aws:- Runs CDK diff to show changes
- Applies update with rollback on failure
- Sends update status to coordination layer
-
GovCloud support β Test and document deployment to AWS GovCloud regions
Estimated Effort: 20-30 days (see docs/admin/customer-aws-deployment.md for detailed breakdown)
Tier 4: Docker Self-Hosted
Goal: Customer runs everything on their own hardware with Docker Compose. Full air-gap capability.
Work Items
-
Productize Docker Compose β Fork
docker-compose.ymlintodocker-compose.production.yml:- Remove dev-only services (LocalStack, Supabase Studio)
- Add healthchecks to all services
- Volume mounts for persistent data
- Environment variable documentation
-
accessible-serverCLI β Finish the management CLI (spec indocs/admin/on-prem-deployment.md):accessible-server setupβ interactive first-run configurationaccessible-server start/stop/restartβ lifecycle managementaccessible-server upgradeβ pull new images, apply migrations, restartaccessible-server backupβ database dump + config archiveaccessible-server restoreβ restore from backupaccessible-server healthβ dashboard showing service statusaccessible-server logsβ tail aggregated logs
-
Container registry β Publish images to
ghcr.io/anglinai/:accessible-api(API + workers)accessible-web(Next.js static export + Caddy)accessible-weasyprint(PDF engine)accessible-audiveris(music OCR)accessible-marker(PDF parser)- Semantic versioning +
latesttag - Automated builds via GitHub Actions
-
License validation β Offline-capable license system:
- License key encodes: customer ID, expiry date, feature flags
- Signed with our private key, verified locally with public key
- Grace period for expired licenses (30 days warning, 60 days read-only)
- No phone-home required (but optional for update checks)
-
Quickstart β Single curl command to bootstrap:
Terminal window curl -sSL https://get.theaccessible.org | bashDownloads CLI, pulls images, runs
accessible-server setup -
Integration tests β CI job that:
- Builds all images
- Starts Docker Compose stack
- Runs smoke tests (upload PDF, convert, download)
- Validates all healthchecks pass
Estimated Effort: 15-20 days
Tier 5: Bring Your Own LLM Keys
Goal: Customers provide their own API keys for vision-capable AI models. AI traffic goes directly from their deployment to their provider.
Backend Work
-
Database schema β Add
user_ai_configstable:CREATE TABLE user_ai_configs (id UUID PRIMARY KEY DEFAULT gen_random_uuid(),user_id UUID NOT NULL REFERENCES auth.users(id) ON DELETE CASCADE,provider TEXT NOT NULL CHECK (provider IN ('gemini', 'anthropic', 'openai')),api_key TEXT NOT NULL, -- encrypted via pgcryptois_verified BOOLEAN DEFAULT FALSE,verified_at TIMESTAMPTZ,created_at TIMESTAMPTZ DEFAULT NOW(),updated_at TIMESTAMPTZ DEFAULT NOW(),UNIQUE(user_id, provider)); -
Key validation β
POST /api/user-ai/verify:- Send a minimal test request to the provider (e.g., describe a 1x1 white pixel)
- Confirm the key works and has vision capability
- Store verification status
-
Provider cascade modification β In
workers/api/src/utils/gemini-client.tsand equivalent files:- Before using platform keys, check if user has their own key for that provider
- User keys take priority in the cascade
- If user key fails, fall back to next provider (user or platform)
- Log which provider/key was used per conversion step
-
Credit tracking β When a userβs own key is used:
- Do NOT debit platform credits for AI processing
- Still charge for compute/infrastructure costs if applicable
- Show βYour Keyβ badge in conversion reports
-
Conversion report β Add provider attribution:
- βImage analysis: Gemini (your key)β
- βText extraction: Claude (platform)β
- Helps users verify their keys are being used
Frontend Work
- Settings page β Add βAI Providersβ tab:
- Card per provider (Gemini, Claude, GPT-4o) with:
- API key input (masked)
- βVerify Keyβ button
- Status badge (verified/unverified/expired)
- Last used timestamp
- Explanation: βProvide at least one vision-capable AI keyβ
- Card per provider (Gemini, Claude, GPT-4o) with:
Estimated Effort: 5-7 days
Priority Order
| Priority | Tier | Rationale |
|---|---|---|
| 1 | Tier 2 (API) | Closest to done, highest demand from technical users |
| 2 | Tier 5 (BYOK AI) | High demand from privacy-sensitive orgs, moderate effort |
| 3 | Tier 1 (BYOS) | Important for enterprise, moderate effort |
| 4 | Tier 4 (Docker) | Foundation exists, requires productization |
| 5 | Tier 3 (AWS) | Largest effort, fewer customers, needs Tier 4 first |
Total Estimated Effort
- Tier 1: 5-7 days
- Tier 2: 3-5 days
- Tier 3: 20-30 days
- Tier 4: 15-20 days
- Tier 5: 5-7 days
- Total: 48-69 days (not including testing, documentation, and customer pilots)
Dependencies
- Tier 3 and Tier 4 share infrastructure work (container images, license system)
- Tier 5 requires the provider cascade refactor which benefits Tier 3 and 4
- Tier 1 and Tier 5 both require Settings page expansion β can share a UI framework
- All tiers need updates to the privacy policy page