SPEC-089 v0.1 — Cloud Customer Onboarding and Auth System
Status: draft (pending Donna ratification; Texi architecture-approved via ADR-53/54/55)
Author: Donna (Engineering — lead, PO)
Architecture: Texi (ADR-53/54/55 + per-PR review)
Release Train: Samantha (smoke + rollback rehearsal + SLO)
Install: Lafonda (install-side env-block plumbing)
Operator: Frank (WorkOS account, secret provisioning, cutover sign-off)
Origin: Lola TaskAssigned 1e9da272-258d-48d2-854d-1f1670fef7b8 2026-05-06; engineering response in journal entry #90 (id 81b9b993-105a-4328-a1a5-63b4d333e069).
Research artifact: docs/research/auth-architecture-engineering-pov-2026-05-06.md (PR #193, superseded by this SPEC + the shipped PRs).
Summary
Ship a cloud-grade human auth system for Prism’s first hosted customer (Frank’s Ntecdev tenant) with provider-agnostic identity binding, parallel-provider runtime, and a clean cutover path between providers. WorkOS is the v0.1 default after evaluation against Auth0; GitHub OAuth remains side-by-side for backwards compatibility and easy rollback. Agent identity stays Prism-internal (api_keys / bcrypt) — the auth provider authenticates WHO (humans), Prism authorizes WHAT (capabilities, agent tokens, project access).
This SPEC is the SOR for cloud auth mechanics. It pairs with SPEC-087 (cloud deployment) and references the three Auth ADRs (53/54/55) that lock the cross-cutting invariants.
Problem
Prism’s pre-this-SPEC auth system had the following gaps for a hosted customer:
- Single hard-coded provider. The
OAuthProvider Protocol existed in code (PR #194 era) but only GitHubProvider was implemented. provision_from_identity assumed GitHub’s external_id shape and stored github_user_id directly on users. Adding any second IdP would have hijacked existing rows.
- No tenant + identity isolation guarantees. Email-based auto-link logic in
upsert_user would have silently rewritten an existing GitHub-bound user’s canonical identity tuple if a WorkOS login presented the same email — surfaced during PR #194 review (Texi rev1 blocking).
- No first-class provider routing.
/api/v1/auth/github/* routes were hard-coded; no parametric mount for additional IdPs. No /whoami / /logout surface for smoke harness or future Tenant Console.
- No cloud cutover ergonomics. Render env mutation patterns weren’t established; no rollback rehearsal SLO; no operational guardrails against destructive config writes.
§1 — Goals & Non-Goals
Goals (v0.1)
- Provider-agnostic User identity binding via a child table (
user_external_identities) so multiple IdPs coexist on one User row when account-linking lands in v0.2, without rewriting canonical identity in v0.1.
- WorkOS provider implementation (AuthKit / User Management API) parallel to GitHub OAuth, both routed via a parametric
/auth/{provider}/{start,callback} surface.
/auth/whoami + /auth/logout endpoints for smoke harness, future Tenant Console, and operator probes.
- End-to-end Tier-3 hybrid recall (vec + lex + graph + temporal) verified via a WorkOS-authed bearer.
- Production-grade Render cutover: parallel-provider mode in cloud, default-flip, rollback rehearsal with measured SLO.
- ADR-locked invariants: Authority Boundary, External Directory Mapping, Key Boundary.
Non-Goals (v0.1)
- Account linking. Same email across two providers creates two distinct
User rows in v0.1. Verified-email-gated auto-attach is Phase 4 / v0.2 work.
- SCIM directory sync. ADR-54 locks the mapping shape; the implementation lands in v0.2.
- Tenant Console UI. Operator console at the Prism dashboard covers v0.1 needs.
- JWT/JWKS verification middleware. WorkOS AuthKit returns JSON over TLS, not id_tokens; middleware re-opens if SSO connections or JWT bearers ever land.
- Multi-region failover for the auth provider.
§2 — Architecture
§2.1 Authority Boundary (ADR-53)
Provider authenticates WHO; Prism authorizes WHAT.
The auth provider’s responsibility ends at producing an ExternalIdentity DTO (provider, subject_id, email, email_verified, display_name, optional orgs[]). Prism owns everything downstream: tenant ownership, org boundaries, project membership, agent identity, API key minting, capability tokens, x403 audit, governance lookup.
This means the OAuth router NEVER consults provider claims for authorization decisions beyond the bearer’s bound User and tenant_id. The bearer-validation pipeline (require_auth) returns a Prism AuthContext whose tenant_id and user_id are Prism-issued — never the IdP’s subject_id.
§2.2 External Directory Mapping (ADR-54)
Provider Organization → Prism tenants. Provider OrgMembership → Prism memberships. SCIM groups → tenant-scoped external group mappings (deferred). Prism orgs and projects stay canonical and are NOT auto-created from provider state.
The cardinality model: a single Prism tenant can hold multiple orgs, multiple projects, multiple users. A WorkOS Organization maps to exactly one Prism tenant. Conflating the two breaks the single-codebase parity invariant (ADR-021): LAN/personal modes have orgs without any provider concept, and the org_id is the canonical memory-namespace key per ADR-018.
§2.3 Key Boundary (ADR-55)
Agent / API keys are Prism-internal (bcrypt-hashed api_keys rows). OAuth state-token signing uses a dedicated oauth_state_secret, separate from authforge_secret, agent JWT signing material, capability-token signing, and x403 signing.
Six reasons the agent-key boundary doesn’t move (per journal #90 §c-1):
- Self-hosted parity (LAN must work without provider reachability).
- Latency (sub-ms bcrypt vs RTT to provider).
- Lifecycle velocity (mint/rotate happens on persona-create, daemon-restart).
- Cost (M2M-token pricing compounds at 100s of agents).
- Audit + scoping (
api_keys already tracks tenant scope + last_used_at).
- No security benefit (bcrypt-on-disk is not weaker than provider-issued tokens).
§2.4 Identity binding shape — user_external_identities
user_external_identities (
id PK
user_id FK → users.id
provider VARCHAR(32) -- "github" | "workos" | "auth0"
subject_id VARCHAR(255) -- IdP's stable subject claim
email VARCHAR(255)
email_verified BOOLEAN -- propagated from IdP claim; sticky-True
last_seen_at TIMESTAMPTZ
created_at, updated_at
UNIQUE(provider, subject_id)
)
Sticky-True on email_verified: once an IdP has attested verification, a later login carrying email_verified=false (downgrade attempt, stale device claim, regression) MUST NOT flip the flag back. Phase-4 account-linking gates auto-attach on this field.
upsert_user keys on (provider, subject_id) via this child table; brand-new external identities create a fresh User row even when the email collides with an existing user. Account linking is deferred surface; v0.1 prefers two distinct Users to a silent identity hijack.
§3 — Implementation Phases
Phase 1 — Provider abstraction polish
| Slice | Work | Vehicle |
|---|
| 1.1 | users.email UNIQUE dropped → ordinary index. user_external_identities child table created. Migration 041 backfills one row per existing github_user_id with email_verified=false (GitHub OAuth doesn’t expose verification state without an extra API hop). | PR #194 |
| 1.2 | prism_auth_provider setting ({github, workos, auth0} whitelist). WorkOS config block (workos_api_key, workos_client_id, workos_organization_id, workos_issuer). | PR #194 |
| 1.3 | Inline doc of provider-selection contract — captured in PR #194 commit message + this SPEC §2.4. | PR #194 |
Phase 2 — WorkOS provider implementation
| Slice | Work | Vehicle |
|---|
| 2.1 | WorkOSProvider calls the WorkOS AuthKit / User Management API directly via httpx — NOT bare OIDC; AuthKit’s /user_management/authenticate returns a single JSON payload (user + organization_id + access_token) rather than the discovery + token + userinfo round-trip pattern. We deliberately skip the workos Python SDK (sync-only; async-mismatched with FastAPI). The exchange’s organization_id is mirrored as an ExternalOrg entry purely as a transport DTO so provision_from_identity reuses its existing ident.orgs iteration; ExternalOrg is NOT a claim about Prism org authority mapping (ADR-54 reserves that — provider Organization → Prism tenant, not Prism org). | PR #195 |
| 2.2 | DEFERRED — JWT/JWKS verification middleware. WorkOS AuthKit’s /user_management/authenticate returns JSON over TLS, not id_tokens. Re-open if SSO connections or JWT bearers ever land. | n/a |
| 2.3 | Parametric /api/v1/auth/{provider}/{start,callback} router. New /whoami (returns User + external_identities[]) and /logout (revokes calling bearer’s api_key + invalidates auth_cache). _redirect_uri_for() uses Settings.oauth_redirect_uri for github (legacy env), request.base_url for others. | PR #195 + #203 |
| 2.4 | GitHubProvider contract tests via httpx.MockTransport seam. Symmetric coverage with WorkOSProvider (9 + 11 contract tests). Smoke harness scaffolds (workos / parallel_providers / rollback_rehearsal / recall_tier3). https-redirect_uri assertion on smoke harness (PR #204). | PR #195 + #196 + #199 + #204 |
Cross-cutting Phase-2 fixes:
- PR #197 — Redis pattern-subscriber publish-gating (
PUBSUB NUMSUB for direct count) — fixes false pushed_to_ws classification when only pattern subscribers are attached.
- PR #203 — uvicorn
--proxy-headers --forwarded-allow-ips '*' so request.base_url returns https:// behind Render’s TLS edge. Fix for /workos/start emitting http:// redirect_uri (caught live during cutover).
- PR #205 — heartbeat 410 auto-register loop fix + WS stream close on session-hash gone (Amanda mini1 PeerJoined storm root cause).
Phase 3 — Cloud cutover
| Slice | Work | Vehicle |
|---|
| 3.1 | Stage WorkOS as parallel provider in cloud (prism_auth_provider=github initially). Effectively in place once Phase 2 deployed — both providers’ routes mount unconditionally. | live |
| 3.2 | End-to-end WorkOS-authed Tier-3 recall: project-create + journal write + POST /api/v1/memory/recall returns all 4 RRF legs (vec / lex / graph / temporal). Verified live 2026-05-07 00:09Z. | verified |
| 3.3 | Rollback rehearsal SLO measurement (env-flip → deploy.status=‘live’ wall-clock; 360s on prism-server cutover, within 600s budget). Default flip to prism_auth_provider=workos (cutover deploy dep-d7tum2l7vvec73b9lung LIVE 2026-05-07 01:35Z). | live |
Cross-cutting Phase-3 fixes:
- PR #207 —
smoke_auth_rollback_rehearsal.sh rewrite: per-key PUT + count-preservation guard + SERVICE_ID self-resolve (postmortem 6b351dc6 — destructive list-PUT wiped prism-server env at 00:27Z; full autonomous recovery).
- PR #208 — same script, deploy.status=‘live’ as primary SLO + /start as post-live sanity (Samantha PR #207 review observation).
Phase 4 — DEFERRED to v0.2
- Tenant Console UI (customer-facing org/project/member admin).
- Account-linking flow (verified-email-gated auto-attach).
- SCIM directory sync (groups → tenant-scoped external group mappings).
- WorkOS SSO connections per enterprise customer (per-connection cost accounting).
§4 — Risks
- Provider downtime → login outage. AuthKit availability gates new logins. JWKS-style caching is not applicable to AuthKit’s JSON response; existing bearers continue to work via the local bcrypt cache.
- Render env mutation pattern footgun.
PUT /v1/services/{id}/env-vars (list-level) is full-replace, not patch. Postmortem 6b351dc6 documents the wipe + recovery; the smoke harness now uses per-key PUT + count-preservation guard. Any future tooling that touches Render env MUST follow the safe pattern.
- Cross-provider hijack via email. Closed by ADR-54 + the
user_external_identities child-table shape: same email across providers yields two distinct User rows. Texi rev1 surfaced this during PR #194 review.
- State-token + agent JWT key conflation. Closed by ADR-55: split signing material. State tokens use
oauth_state_secret; never shared.
- Reverse-proxy scheme drift. Render terminates TLS at the edge; without
--proxy-headers, request.base_url returned http:// and WorkOS exact-match redirect-URI whitelist failed. Closed by PR #203 + smoke harness assertion (PR #204).
- Account-linking deferral. Operators with the same email at multiple IdPs have two
User rows in v0.1. Documented in user-facing release notes; v0.2 brings verified-email-gated linking.
§5 — Test Strategy
Per-phase gates (all closed):
Phase 1 gate
- ☑
pytest tests/test_user_provider_subject.py tests/test_config_auth_provider.py green.
- ☑ Migration 041 applies + rolls back cleanly on staging-clone DB.
- ☑ Existing GitHub OAuth flow unchanged in personal-mode + cloud staging.
Phase 2 gate
- ☑
WorkOSProvider contract tests + GitHubProvider contract tests pass (httpx.MockTransport seam).
- ☑ Provider-mock fixture parity across both providers.
- ☑ Cloud staging: GitHub flow still works alongside WorkOS (parallel-provider mode).
Phase 3 gate
- ☑ Tier-3 recall green via WorkOS-authed session (vec + lex + graph + temporal scores returned).
- ☑ Rollback rehearsal SLO measured = 360s within 600s budget.
- ☑ Tenant-isolation: WorkOS-authed user cannot recall a different tenant’s memories.
§6 — Operational Notes
§6.1 Env-var contract on prism-server (cloud)
Required (yaml-managed via render.yaml or dashboard):
PRISM_MODE=cloud
ENVIRONMENT=production
DATABASE_URL (fromDatabase prism-postgres)
PRISM_REDIS_URL (fromService prism-redis)
NEO4J_URI=bolt://prism-neo4j:7687, NEO4J_USERNAME, NEO4J_PASSWORD
GITHUB_CLIENT_ID, GITHUB_CLIENT_SECRET
OAUTH_REDIRECT_URI=https://<host>/api/v1/auth/github/callback
OAUTH_STATE_SECRET (32-byte random hex; never shared with agent JWT material)
WORKOS_API_KEY, WORKOS_CLIENT_ID, WORKOS_ORGANIZATION_ID, WORKOS_ISSUER
PRISM_AUTH_PROVIDER (canonical uppercase form; defaults to whichever provider is the cutover default; v0.1 = workos)
§6.2 Cutover & rollback
Default-flip is a single per-key PUT to PRISM_AUTH_PROVIDER. Render auto-redeploys; the new container picks up the new default at startup. Both provider routes remain mounted regardless, so existing bearers continue working uninterrupted. Rollback is the same flip in reverse; rehearsal SLO budget = 600s wall-clock from PUT to deploy.status='live'.
§6.3 Render API safe-pattern (mandatory)
Per-key PUT is the mandatory default for any Render env mutation.
List-level PUT is destructive-by-design (full-replace) and is allowed
only under explicit full-snapshot intent with paired guards. Any
tooling that violates this is a postmortem-class defect — not a
review nit.
Default (use this for >99% of cases):
PUT /v1/services/{id}/env-vars/{KEY} with body {"value":"..."} — upserts a single key in place. Non-destructive; idempotent; the only safe shape for cutover scripts, rotation tooling, and ad-hoc operator fixes.
Allowed exception — full-snapshot intent:
PUT /v1/services/{id}/env-vars (list-level) is permitted only when ALL of the following hold:
- The caller has already
GET-ed the full env-var list (snapshot taken in the SAME script run).
- The caller’s PUT body is the full snapshot with all keys preserved + the intended modifications applied in memory — never a single-element list.
- The caller has paired pre-call and post-call guards:
- Count guard — assert post-call key count is greater than or equal to pre-call key count.
- Key-set diff guard — assert the post-call key set is a superset of the pre-call key set, accounting for intentional adds/removes from the in-memory edits.
- The caller documents the full-snapshot intent in a code comment naming postmortem
6b351dc6 so the reader knows the safe-pattern context is acknowledged.
Forbidden (always):
PUT /v1/services/{id}/env-vars with a single-element body ([{key, value}] or similar). This is the exact shape that wiped prism-server on 2026-05-07 00:27Z. Render API treats list-level PUT as full-replace, NOT patch.
The smoke harness smoke_auth_rollback_rehearsal.sh exemplifies the safe pattern: per-key PUT for the flip and the restore, plus a count-preservation guard at step 7 that catches the bug class even if a future edit reverts the safe pattern.
§7 — References
Implements: Lola TaskAssigned 1e9da272; engineering plan in journal #90 (81b9b993-105a-4328-a1a5-63b4d333e069).
Depends on:
- ADR-53 — Auth Authority Boundary
- ADR-54 — External Directory Mapping
- ADR-55 — Key Boundary
- ADR-018 — RRF hybrid recall (org_id namespace)
- ADR-021 — Two-bucket env resolution + cross-platform parity
- SPEC-087 — Prism Cloud Deployment to Render
Related:
- SPEC-085 v0.2 — Constitutional Governance Vocabulary (independent; namespace clarified)
- SPEC-088 — Agent Preflight Discipline + Reflexive Rule Gates (governance-side companion)
Postmortems:
6b351dc6 (2026-05-07) — smoke_auth_rollback_rehearsal.sh destructive PUT wiped prism-server env; autonomous recovery in 27 min; structural fix in PR #207 + #208.
Shipped PRs (chronological):
- #192 —
_ensure_default_org per Tenant on OAuth provision (cloud-mode regression)
- #194 — Phase 1: UEI child table + auth-provider settings
- #195 — Phase 2: WorkOSProvider + parametric router + /whoami + /logout
- #196 — Phase 2 smoke harness scaffolds (Samantha)
- #197 — Redis pattern-subscriber publish-gating fix (Texi)
- #199 — Smoke harness payload-shape live-fix (Samantha)
- #203 — uvicorn
--proxy-headers for https redirect_uri behind Render TLS edge
- #204 — Smoke harness https redirect_uri assertion (Samantha)
- #205 — Heartbeat 410 stale-session resurrection loop fix (Texi)
- #207 — Rollback rehearsal: per-key PUT + count-preservation guard + SERVICE_ID self-resolve
- #208 — Rollback rehearsal: deploy.status=‘live’ SLO + ?limit=100 fix
Superseded:
- PR #193 —
docs/research/auth-architecture-engineering-pov-2026-05-06.md. Substance absorbed into this SPEC, ADR-53/54/55, and the shipped PRs above. Closed without merging.
Last modified on June 11, 2026