Skip to main content

SPEC-089 v0.1 — Cloud Customer Onboarding and Auth System

Status: draft (pending Donna ratification; Texi architecture-approved via ADR-53/54/55) Author: Donna (Engineering — lead, PO) Architecture: Texi (ADR-53/54/55 + per-PR review) Release Train: Samantha (smoke + rollback rehearsal + SLO) Install: Lafonda (install-side env-block plumbing) Operator: Frank (WorkOS account, secret provisioning, cutover sign-off) Origin: Lola TaskAssigned 1e9da272-258d-48d2-854d-1f1670fef7b8 2026-05-06; engineering response in journal entry #90 (id 81b9b993-105a-4328-a1a5-63b4d333e069). Research artifact: docs/research/auth-architecture-engineering-pov-2026-05-06.md (PR #193, superseded by this SPEC + the shipped PRs).

Summary

Ship a cloud-grade human auth system for Prism’s first hosted customer (Frank’s Ntecdev tenant) with provider-agnostic identity binding, parallel-provider runtime, and a clean cutover path between providers. WorkOS is the v0.1 default after evaluation against Auth0; GitHub OAuth remains side-by-side for backwards compatibility and easy rollback. Agent identity stays Prism-internal (api_keys / bcrypt) — the auth provider authenticates WHO (humans), Prism authorizes WHAT (capabilities, agent tokens, project access). This SPEC is the SOR for cloud auth mechanics. It pairs with SPEC-087 (cloud deployment) and references the three Auth ADRs (53/54/55) that lock the cross-cutting invariants.

Problem

Prism’s pre-this-SPEC auth system had the following gaps for a hosted customer:
  1. Single hard-coded provider. The OAuthProvider Protocol existed in code (PR #194 era) but only GitHubProvider was implemented. provision_from_identity assumed GitHub’s external_id shape and stored github_user_id directly on users. Adding any second IdP would have hijacked existing rows.
  2. No tenant + identity isolation guarantees. Email-based auto-link logic in upsert_user would have silently rewritten an existing GitHub-bound user’s canonical identity tuple if a WorkOS login presented the same email — surfaced during PR #194 review (Texi rev1 blocking).
  3. No first-class provider routing. /api/v1/auth/github/* routes were hard-coded; no parametric mount for additional IdPs. No /whoami / /logout surface for smoke harness or future Tenant Console.
  4. No cloud cutover ergonomics. Render env mutation patterns weren’t established; no rollback rehearsal SLO; no operational guardrails against destructive config writes.

§1 — Goals & Non-Goals

Goals (v0.1)

  1. Provider-agnostic User identity binding via a child table (user_external_identities) so multiple IdPs coexist on one User row when account-linking lands in v0.2, without rewriting canonical identity in v0.1.
  2. WorkOS provider implementation (AuthKit / User Management API) parallel to GitHub OAuth, both routed via a parametric /auth/{provider}/{start,callback} surface.
  3. /auth/whoami + /auth/logout endpoints for smoke harness, future Tenant Console, and operator probes.
  4. End-to-end Tier-3 hybrid recall (vec + lex + graph + temporal) verified via a WorkOS-authed bearer.
  5. Production-grade Render cutover: parallel-provider mode in cloud, default-flip, rollback rehearsal with measured SLO.
  6. ADR-locked invariants: Authority Boundary, External Directory Mapping, Key Boundary.

Non-Goals (v0.1)

  1. Account linking. Same email across two providers creates two distinct User rows in v0.1. Verified-email-gated auto-attach is Phase 4 / v0.2 work.
  2. SCIM directory sync. ADR-54 locks the mapping shape; the implementation lands in v0.2.
  3. Tenant Console UI. Operator console at the Prism dashboard covers v0.1 needs.
  4. JWT/JWKS verification middleware. WorkOS AuthKit returns JSON over TLS, not id_tokens; middleware re-opens if SSO connections or JWT bearers ever land.
  5. Multi-region failover for the auth provider.

§2 — Architecture

§2.1 Authority Boundary (ADR-53)

Provider authenticates WHO; Prism authorizes WHAT.
The auth provider’s responsibility ends at producing an ExternalIdentity DTO (provider, subject_id, email, email_verified, display_name, optional orgs[]). Prism owns everything downstream: tenant ownership, org boundaries, project membership, agent identity, API key minting, capability tokens, x403 audit, governance lookup. This means the OAuth router NEVER consults provider claims for authorization decisions beyond the bearer’s bound User and tenant_id. The bearer-validation pipeline (require_auth) returns a Prism AuthContext whose tenant_id and user_id are Prism-issued — never the IdP’s subject_id.

§2.2 External Directory Mapping (ADR-54)

Provider Organization → Prism tenants. Provider OrgMembership → Prism memberships. SCIM groups → tenant-scoped external group mappings (deferred). Prism orgs and projects stay canonical and are NOT auto-created from provider state.
The cardinality model: a single Prism tenant can hold multiple orgs, multiple projects, multiple users. A WorkOS Organization maps to exactly one Prism tenant. Conflating the two breaks the single-codebase parity invariant (ADR-021): LAN/personal modes have orgs without any provider concept, and the org_id is the canonical memory-namespace key per ADR-018.

§2.3 Key Boundary (ADR-55)

Agent / API keys are Prism-internal (bcrypt-hashed api_keys rows). OAuth state-token signing uses a dedicated oauth_state_secret, separate from authforge_secret, agent JWT signing material, capability-token signing, and x403 signing.
Six reasons the agent-key boundary doesn’t move (per journal #90 §c-1):
  1. Self-hosted parity (LAN must work without provider reachability).
  2. Latency (sub-ms bcrypt vs RTT to provider).
  3. Lifecycle velocity (mint/rotate happens on persona-create, daemon-restart).
  4. Cost (M2M-token pricing compounds at 100s of agents).
  5. Audit + scoping (api_keys already tracks tenant scope + last_used_at).
  6. No security benefit (bcrypt-on-disk is not weaker than provider-issued tokens).

§2.4 Identity binding shape — user_external_identities

user_external_identities (
  id                PK
  user_id           FK → users.id
  provider          VARCHAR(32)    -- "github" | "workos" | "auth0"
  subject_id        VARCHAR(255)   -- IdP's stable subject claim
  email             VARCHAR(255)
  email_verified    BOOLEAN        -- propagated from IdP claim; sticky-True
  last_seen_at      TIMESTAMPTZ
  created_at, updated_at
  UNIQUE(provider, subject_id)
)
Sticky-True on email_verified: once an IdP has attested verification, a later login carrying email_verified=false (downgrade attempt, stale device claim, regression) MUST NOT flip the flag back. Phase-4 account-linking gates auto-attach on this field. upsert_user keys on (provider, subject_id) via this child table; brand-new external identities create a fresh User row even when the email collides with an existing user. Account linking is deferred surface; v0.1 prefers two distinct Users to a silent identity hijack.

§3 — Implementation Phases

Phase 1 — Provider abstraction polish

SliceWorkVehicle
1.1users.email UNIQUE dropped → ordinary index. user_external_identities child table created. Migration 041 backfills one row per existing github_user_id with email_verified=false (GitHub OAuth doesn’t expose verification state without an extra API hop).PR #194
1.2prism_auth_provider setting ({github, workos, auth0} whitelist). WorkOS config block (workos_api_key, workos_client_id, workos_organization_id, workos_issuer).PR #194
1.3Inline doc of provider-selection contract — captured in PR #194 commit message + this SPEC §2.4.PR #194

Phase 2 — WorkOS provider implementation

SliceWorkVehicle
2.1WorkOSProvider calls the WorkOS AuthKit / User Management API directly via httpx — NOT bare OIDC; AuthKit’s /user_management/authenticate returns a single JSON payload (user + organization_id + access_token) rather than the discovery + token + userinfo round-trip pattern. We deliberately skip the workos Python SDK (sync-only; async-mismatched with FastAPI). The exchange’s organization_id is mirrored as an ExternalOrg entry purely as a transport DTO so provision_from_identity reuses its existing ident.orgs iteration; ExternalOrg is NOT a claim about Prism org authority mapping (ADR-54 reserves that — provider Organization → Prism tenant, not Prism org).PR #195
2.2DEFERRED — JWT/JWKS verification middleware. WorkOS AuthKit’s /user_management/authenticate returns JSON over TLS, not id_tokens. Re-open if SSO connections or JWT bearers ever land.n/a
2.3Parametric /api/v1/auth/{provider}/{start,callback} router. New /whoami (returns User + external_identities[]) and /logout (revokes calling bearer’s api_key + invalidates auth_cache). _redirect_uri_for() uses Settings.oauth_redirect_uri for github (legacy env), request.base_url for others.PR #195 + #203
2.4GitHubProvider contract tests via httpx.MockTransport seam. Symmetric coverage with WorkOSProvider (9 + 11 contract tests). Smoke harness scaffolds (workos / parallel_providers / rollback_rehearsal / recall_tier3). https-redirect_uri assertion on smoke harness (PR #204).PR #195 + #196 + #199 + #204
Cross-cutting Phase-2 fixes:
  • PR #197 — Redis pattern-subscriber publish-gating (PUBSUB NUMSUB for direct count) — fixes false pushed_to_ws classification when only pattern subscribers are attached.
  • PR #203 — uvicorn --proxy-headers --forwarded-allow-ips '*' so request.base_url returns https:// behind Render’s TLS edge. Fix for /workos/start emitting http:// redirect_uri (caught live during cutover).
  • PR #205 — heartbeat 410 auto-register loop fix + WS stream close on session-hash gone (Amanda mini1 PeerJoined storm root cause).

Phase 3 — Cloud cutover

SliceWorkVehicle
3.1Stage WorkOS as parallel provider in cloud (prism_auth_provider=github initially). Effectively in place once Phase 2 deployed — both providers’ routes mount unconditionally.live
3.2End-to-end WorkOS-authed Tier-3 recall: project-create + journal write + POST /api/v1/memory/recall returns all 4 RRF legs (vec / lex / graph / temporal). Verified live 2026-05-07 00:09Z.verified
3.3Rollback rehearsal SLO measurement (env-flip → deploy.status=‘live’ wall-clock; 360s on prism-server cutover, within 600s budget). Default flip to prism_auth_provider=workos (cutover deploy dep-d7tum2l7vvec73b9lung LIVE 2026-05-07 01:35Z).live
Cross-cutting Phase-3 fixes:
  • PR #207smoke_auth_rollback_rehearsal.sh rewrite: per-key PUT + count-preservation guard + SERVICE_ID self-resolve (postmortem 6b351dc6 — destructive list-PUT wiped prism-server env at 00:27Z; full autonomous recovery).
  • PR #208 — same script, deploy.status=‘live’ as primary SLO + /start as post-live sanity (Samantha PR #207 review observation).

Phase 4 — DEFERRED to v0.2

  • Tenant Console UI (customer-facing org/project/member admin).
  • Account-linking flow (verified-email-gated auto-attach).
  • SCIM directory sync (groups → tenant-scoped external group mappings).
  • WorkOS SSO connections per enterprise customer (per-connection cost accounting).

§4 — Risks

  1. Provider downtime → login outage. AuthKit availability gates new logins. JWKS-style caching is not applicable to AuthKit’s JSON response; existing bearers continue to work via the local bcrypt cache.
  2. Render env mutation pattern footgun. PUT /v1/services/{id}/env-vars (list-level) is full-replace, not patch. Postmortem 6b351dc6 documents the wipe + recovery; the smoke harness now uses per-key PUT + count-preservation guard. Any future tooling that touches Render env MUST follow the safe pattern.
  3. Cross-provider hijack via email. Closed by ADR-54 + the user_external_identities child-table shape: same email across providers yields two distinct User rows. Texi rev1 surfaced this during PR #194 review.
  4. State-token + agent JWT key conflation. Closed by ADR-55: split signing material. State tokens use oauth_state_secret; never shared.
  5. Reverse-proxy scheme drift. Render terminates TLS at the edge; without --proxy-headers, request.base_url returned http:// and WorkOS exact-match redirect-URI whitelist failed. Closed by PR #203 + smoke harness assertion (PR #204).
  6. Account-linking deferral. Operators with the same email at multiple IdPs have two User rows in v0.1. Documented in user-facing release notes; v0.2 brings verified-email-gated linking.

§5 — Test Strategy

Per-phase gates (all closed): Phase 1 gate
  • pytest tests/test_user_provider_subject.py tests/test_config_auth_provider.py green.
  • ☑ Migration 041 applies + rolls back cleanly on staging-clone DB.
  • ☑ Existing GitHub OAuth flow unchanged in personal-mode + cloud staging.
Phase 2 gate
  • WorkOSProvider contract tests + GitHubProvider contract tests pass (httpx.MockTransport seam).
  • ☑ Provider-mock fixture parity across both providers.
  • ☑ Cloud staging: GitHub flow still works alongside WorkOS (parallel-provider mode).
Phase 3 gate
  • ☑ Tier-3 recall green via WorkOS-authed session (vec + lex + graph + temporal scores returned).
  • ☑ Rollback rehearsal SLO measured = 360s within 600s budget.
  • ☑ Tenant-isolation: WorkOS-authed user cannot recall a different tenant’s memories.

§6 — Operational Notes

§6.1 Env-var contract on prism-server (cloud)

Required (yaml-managed via render.yaml or dashboard):
  • PRISM_MODE=cloud
  • ENVIRONMENT=production
  • DATABASE_URL (fromDatabase prism-postgres)
  • PRISM_REDIS_URL (fromService prism-redis)
  • NEO4J_URI=bolt://prism-neo4j:7687, NEO4J_USERNAME, NEO4J_PASSWORD
  • GITHUB_CLIENT_ID, GITHUB_CLIENT_SECRET
  • OAUTH_REDIRECT_URI=https://<host>/api/v1/auth/github/callback
  • OAUTH_STATE_SECRET (32-byte random hex; never shared with agent JWT material)
  • WORKOS_API_KEY, WORKOS_CLIENT_ID, WORKOS_ORGANIZATION_ID, WORKOS_ISSUER
  • PRISM_AUTH_PROVIDER (canonical uppercase form; defaults to whichever provider is the cutover default; v0.1 = workos)

§6.2 Cutover & rollback

Default-flip is a single per-key PUT to PRISM_AUTH_PROVIDER. Render auto-redeploys; the new container picks up the new default at startup. Both provider routes remain mounted regardless, so existing bearers continue working uninterrupted. Rollback is the same flip in reverse; rehearsal SLO budget = 600s wall-clock from PUT to deploy.status='live'.

§6.3 Render API safe-pattern (mandatory)

Per-key PUT is the mandatory default for any Render env mutation. List-level PUT is destructive-by-design (full-replace) and is allowed only under explicit full-snapshot intent with paired guards. Any tooling that violates this is a postmortem-class defect — not a review nit.
Default (use this for >99% of cases):
  • PUT /v1/services/{id}/env-vars/{KEY} with body {"value":"..."} — upserts a single key in place. Non-destructive; idempotent; the only safe shape for cutover scripts, rotation tooling, and ad-hoc operator fixes.
Allowed exception — full-snapshot intent: PUT /v1/services/{id}/env-vars (list-level) is permitted only when ALL of the following hold:
  1. The caller has already GET-ed the full env-var list (snapshot taken in the SAME script run).
  2. The caller’s PUT body is the full snapshot with all keys preserved + the intended modifications applied in memory — never a single-element list.
  3. The caller has paired pre-call and post-call guards:
    • Count guard — assert post-call key count is greater than or equal to pre-call key count.
    • Key-set diff guard — assert the post-call key set is a superset of the pre-call key set, accounting for intentional adds/removes from the in-memory edits.
  4. The caller documents the full-snapshot intent in a code comment naming postmortem 6b351dc6 so the reader knows the safe-pattern context is acknowledged.
Forbidden (always): PUT /v1/services/{id}/env-vars with a single-element body ([{key, value}] or similar). This is the exact shape that wiped prism-server on 2026-05-07 00:27Z. Render API treats list-level PUT as full-replace, NOT patch. The smoke harness smoke_auth_rollback_rehearsal.sh exemplifies the safe pattern: per-key PUT for the flip and the restore, plus a count-preservation guard at step 7 that catches the bug class even if a future edit reverts the safe pattern.

§7 — References

Implements: Lola TaskAssigned 1e9da272; engineering plan in journal #90 (81b9b993-105a-4328-a1a5-63b4d333e069). Depends on:
  • ADR-53 — Auth Authority Boundary
  • ADR-54 — External Directory Mapping
  • ADR-55 — Key Boundary
  • ADR-018 — RRF hybrid recall (org_id namespace)
  • ADR-021 — Two-bucket env resolution + cross-platform parity
  • SPEC-087 — Prism Cloud Deployment to Render
Related:
  • SPEC-085 v0.2 — Constitutional Governance Vocabulary (independent; namespace clarified)
  • SPEC-088 — Agent Preflight Discipline + Reflexive Rule Gates (governance-side companion)
Postmortems:
  • 6b351dc6 (2026-05-07) — smoke_auth_rollback_rehearsal.sh destructive PUT wiped prism-server env; autonomous recovery in 27 min; structural fix in PR #207 + #208.
Shipped PRs (chronological):
  • #192 — _ensure_default_org per Tenant on OAuth provision (cloud-mode regression)
  • #194 — Phase 1: UEI child table + auth-provider settings
  • #195 — Phase 2: WorkOSProvider + parametric router + /whoami + /logout
  • #196 — Phase 2 smoke harness scaffolds (Samantha)
  • #197 — Redis pattern-subscriber publish-gating fix (Texi)
  • #199 — Smoke harness payload-shape live-fix (Samantha)
  • #203 — uvicorn --proxy-headers for https redirect_uri behind Render TLS edge
  • #204 — Smoke harness https redirect_uri assertion (Samantha)
  • #205 — Heartbeat 410 stale-session resurrection loop fix (Texi)
  • #207 — Rollback rehearsal: per-key PUT + count-preservation guard + SERVICE_ID self-resolve
  • #208 — Rollback rehearsal: deploy.status=‘live’ SLO + ?limit=100 fix
Superseded:
  • PR #193 — docs/research/auth-architecture-engineering-pov-2026-05-06.md. Substance absorbed into this SPEC, ADR-53/54/55, and the shipped PRs above. Closed without merging.
Last modified on June 11, 2026