Skip to content

Data Privacy Tiers β€” Implementation Roadmap

The Data Privacy page at theaccessible.org/data-privacy describes five escalating levels of data control that customers can exercise. This document is the engineering plan to implement each tier.

Current State

TierFeatureStatus
1Bring Your Own Storage (BYOS)Not started
2Public API for headless useAPI exists, needs auth improvements
3Customer AWS deployment (CDK)CDK stacks exist, not customer-facing
4Docker self-hostedDocker Compose exists for dev, not productized
5Bring Your Own LLM Keys (BYOK)Provider abstraction exists, no user-facing config

Tier 1: Bring Your Own Storage

Goal: Users configure their own S3-compatible storage. Processed files go to their bucket, not ours.

Backend Work

  1. Database schema β€” Add user_storage_configs table:

    CREATE TABLE user_storage_configs (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    user_id UUID NOT NULL REFERENCES auth.users(id) ON DELETE CASCADE,
    endpoint TEXT, -- S3-compatible endpoint (null for AWS S3)
    region TEXT DEFAULT 'us-east-1',
    bucket TEXT NOT NULL,
    access_key_id TEXT NOT NULL, -- encrypted via pgcrypto
    secret_access_key TEXT NOT NULL, -- encrypted via pgcrypto
    is_verified BOOLEAN DEFAULT FALSE,
    verified_at TIMESTAMPTZ,
    created_at TIMESTAMPTZ DEFAULT NOW(),
    updated_at TIMESTAMPTZ DEFAULT NOW(),
    UNIQUE(user_id)
    );
  2. API routes β€” workers/api/src/routes/user-storage.ts:

    • GET /api/user-storage β€” get current config (redacted secret)
    • PUT /api/user-storage β€” create/update config
    • DELETE /api/user-storage β€” remove config
    • POST /api/user-storage/verify β€” test connection (put a test object, read it, delete it)
  3. Credential encryption β€” Use Supabase Vault or pgcrypto pgp_sym_encrypt with a server-side key stored in env vars. Never return raw secrets to the client.

  4. File pipeline modification β€” In the conversion pipeline (workers/api/src/routes/convert.ts and workers/api/src/routes/files.ts):

    • Before writing output, check if user has a storage config
    • If yes: write to user’s S3 bucket, store only a reference (bucket + key) in our DB
    • If no: write to our R2/S3 as today
    • On file download: proxy from user’s bucket or our bucket based on config
  5. Data deletion β€” POST /api/user-storage/purge-our-data:

    • Delete all files from our storage for this user
    • Keep only metadata/references
    • Irreversible β€” require confirmation

Frontend Work

  1. Settings page β€” Add β€œStorage” tab to apps/web/src/app/settings/page.tsx:
    • Form fields: endpoint, region, bucket, access key, secret key
    • β€œTest Connection” button
    • Status indicator (verified/unverified)
    • β€œDelete My Data” button with confirmation modal

Estimated Effort: 5-7 days


Tier 2: API for Headless Use

Goal: External developers can use our API without our web UI. Documents go in and out via API β€” nothing stored unless explicitly requested.

Work Items

  1. API key management β€” Add to Settings page:

    • Generate/revoke API keys
    • Keys stored as hashed values in DB
    • Rate limiting per key (configurable by plan)
  2. Stateless conversion endpoint β€” Ensure POST /api/convert supports:

    • File upload via multipart form
    • Webhook callback URL for async results
    • Option to return output directly (sync) for small files
    • X-No-Store: true header to prevent any file persistence
  3. SDK / examples β€” Create code examples in:

    • Python (requests)
    • Node.js (fetch)
    • cURL
    • Published at pdf.theaccessible.org/docs/api
  4. OpenAPI spec updates β€” Ensure public/openapi.yaml documents all public endpoints with auth, request/response schemas, error codes

Estimated Effort: 3-5 days


Tier 3: Customer AWS Deployment

Goal: Customer runs our full stack in their own AWS account. We provide CDK scripts and a coordination layer.

Work Items

  1. Parameterize CDK β€” Modify infra/cdk/ to accept customer AWS account ID, region, and deployment preferences via a config file:

    {
    "accountId": "123456789012",
    "region": "us-east-1",
    "vpcCidr": "10.0.0.0/16",
    "domainName": "accessible.customer.edu",
    "licenseKey": "lic_xxx"
    }
  2. Coordination layer β€” Lightweight Lambda that:

    • Validates license key on startup (calls our API)
    • Reports version info (opt-in telemetry)
    • Checks for available updates
    • Does NOT send any customer data
  3. Customer onboarding script β€” npx @accessible/deploy-aws:

    • Prompts for config values
    • Validates AWS credentials
    • Runs CDK bootstrap + deploy
    • Outputs endpoint URLs and admin credentials
  4. IAM documentation β€” Minimum required IAM permissions for deployment:

    • CloudFormation, Lambda, API Gateway, S3, DynamoDB, SQS, EC2, VPC
    • Documented in docs/admin/customer-aws-iam.md
  5. Update mechanism β€” npx @accessible/update-aws:

    • Runs CDK diff to show changes
    • Applies update with rollback on failure
    • Sends update status to coordination layer
  6. GovCloud support β€” Test and document deployment to AWS GovCloud regions

Estimated Effort: 20-30 days (see docs/admin/customer-aws-deployment.md for detailed breakdown)


Tier 4: Docker Self-Hosted

Goal: Customer runs everything on their own hardware with Docker Compose. Full air-gap capability.

Work Items

  1. Productize Docker Compose β€” Fork docker-compose.yml into docker-compose.production.yml:

    • Remove dev-only services (LocalStack, Supabase Studio)
    • Add healthchecks to all services
    • Volume mounts for persistent data
    • Environment variable documentation
  2. accessible-server CLI β€” Finish the management CLI (spec in docs/admin/on-prem-deployment.md):

    • accessible-server setup β€” interactive first-run configuration
    • accessible-server start/stop/restart β€” lifecycle management
    • accessible-server upgrade β€” pull new images, apply migrations, restart
    • accessible-server backup β€” database dump + config archive
    • accessible-server restore β€” restore from backup
    • accessible-server health β€” dashboard showing service status
    • accessible-server logs β€” tail aggregated logs
  3. Container registry β€” Publish images to ghcr.io/anglinai/:

    • accessible-api (API + workers)
    • accessible-web (Next.js static export + Caddy)
    • accessible-weasyprint (PDF engine)
    • accessible-audiveris (music OCR)
    • accessible-marker (PDF parser)
    • Semantic versioning + latest tag
    • Automated builds via GitHub Actions
  4. License validation β€” Offline-capable license system:

    • License key encodes: customer ID, expiry date, feature flags
    • Signed with our private key, verified locally with public key
    • Grace period for expired licenses (30 days warning, 60 days read-only)
    • No phone-home required (but optional for update checks)
  5. Quickstart β€” Single curl command to bootstrap:

    Terminal window
    curl -sSL https://get.theaccessible.org | bash

    Downloads CLI, pulls images, runs accessible-server setup

  6. Integration tests β€” CI job that:

    • Builds all images
    • Starts Docker Compose stack
    • Runs smoke tests (upload PDF, convert, download)
    • Validates all healthchecks pass

Estimated Effort: 15-20 days


Tier 5: Bring Your Own LLM Keys

Goal: Customers provide their own API keys for vision-capable AI models. AI traffic goes directly from their deployment to their provider.

Backend Work

  1. Database schema β€” Add user_ai_configs table:

    CREATE TABLE user_ai_configs (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    user_id UUID NOT NULL REFERENCES auth.users(id) ON DELETE CASCADE,
    provider TEXT NOT NULL CHECK (provider IN ('gemini', 'anthropic', 'openai')),
    api_key TEXT NOT NULL, -- encrypted via pgcrypto
    is_verified BOOLEAN DEFAULT FALSE,
    verified_at TIMESTAMPTZ,
    created_at TIMESTAMPTZ DEFAULT NOW(),
    updated_at TIMESTAMPTZ DEFAULT NOW(),
    UNIQUE(user_id, provider)
    );
  2. Key validation β€” POST /api/user-ai/verify:

    • Send a minimal test request to the provider (e.g., describe a 1x1 white pixel)
    • Confirm the key works and has vision capability
    • Store verification status
  3. Provider cascade modification β€” In workers/api/src/utils/gemini-client.ts and equivalent files:

    • Before using platform keys, check if user has their own key for that provider
    • User keys take priority in the cascade
    • If user key fails, fall back to next provider (user or platform)
    • Log which provider/key was used per conversion step
  4. Credit tracking β€” When a user’s own key is used:

    • Do NOT debit platform credits for AI processing
    • Still charge for compute/infrastructure costs if applicable
    • Show β€œYour Key” badge in conversion reports
  5. Conversion report β€” Add provider attribution:

    • β€œImage analysis: Gemini (your key)”
    • β€œText extraction: Claude (platform)”
    • Helps users verify their keys are being used

Frontend Work

  1. Settings page β€” Add β€œAI Providers” tab:
    • Card per provider (Gemini, Claude, GPT-4o) with:
      • API key input (masked)
      • β€œVerify Key” button
      • Status badge (verified/unverified/expired)
      • Last used timestamp
    • Explanation: β€œProvide at least one vision-capable AI key”

Estimated Effort: 5-7 days


Priority Order

PriorityTierRationale
1Tier 2 (API)Closest to done, highest demand from technical users
2Tier 5 (BYOK AI)High demand from privacy-sensitive orgs, moderate effort
3Tier 1 (BYOS)Important for enterprise, moderate effort
4Tier 4 (Docker)Foundation exists, requires productization
5Tier 3 (AWS)Largest effort, fewer customers, needs Tier 4 first

Total Estimated Effort

  • Tier 1: 5-7 days
  • Tier 2: 3-5 days
  • Tier 3: 20-30 days
  • Tier 4: 15-20 days
  • Tier 5: 5-7 days
  • Total: 48-69 days (not including testing, documentation, and customer pilots)

Dependencies

  • Tier 3 and Tier 4 share infrastructure work (container images, license system)
  • Tier 5 requires the provider cascade refactor which benefits Tier 3 and 4
  • Tier 1 and Tier 5 both require Settings page expansion β€” can share a UI framework
  • All tiers need updates to the privacy policy page