Skip to main content

Per-agent daemon sketch — first-cut contract

First-cut architecture sketch for Donna-first review. This is a hand-authored draft, not the canonical exported SPEC row yet.
This note fixes the contract shape for Prism’s always-on listener tier. The problem is now empirically clear: session-stream push can fail to become agent-visible while the durable signal record remains correct. Tier-1 BIOS auto-drain reduces exposure, but it does not close the gap. The structural answer is a per-agent daemon that keeps a local delivery loop alive across surface idleness, sleep/wake, and session restarts.

Motivation: turn boundaries are the contract

Prism coordination is organized around turn boundaries. Signals drain at turns, memory recall happens between turns, deltas are captured per turn, and peer interrupts only become agent-visible at turn boundaries. That makes the boundary itself load-bearing. The batching finding and the daemon problem sit on the same axis:
  • batching is the voluntary failure mode, where an agent collapses N decision points into one long tool burst and elides coordination breakpoints that Prism depends on
  • suspended/backgrounded sessions are the involuntary failure mode, where the turn boundary does not happen at all for a while even though the project continues to move around the agent
Tier-1 BIOS auto-drain is already a visible tax we are paying because of this gap. As of PR #45 / commit 5df6679, every substantive turn is expected to call prism_signals_pending when the last drain is stale. That rule is useful, but it is still a behavioral polling workaround: if a surface is backgrounded for 30 minutes, nothing becomes visible until focus returns and the next substantive turn happens. The daemon is the structural answer. It preserves the same turn discipline the batching finding defended, but it keeps listening when the human-facing surface is idle, suspended, or waiting for input.

The five invariants

These are the non-negotiables that the implementation must preserve.
  1. Per-agent, not per-machine. One daemon process per persona. No machine-level fan-out router.
  2. Process boundary is the isolation wall. Multi-tenant separation comes from OS process boundaries, not in-process routing discipline.
  3. Push is a doorbell, not the payload. A notification wakes the agent/daemon path; the durable queue remains the source of truth.
  4. Delivery discipline stays one-per-turn. If N signals queue during idleness, the daemon may observe all N, but it must expose them to the agent surface one channel-push at a time, not as a batch.
  5. Routing-registry lifecycle is explicit. Peer lifecycle must invalidate stale bindings on PeerLeft, rebind on PeerJoined, and return structured failure when a target session is stale or wrapped.

Why per-agent wins

The daemon owns one persona and one local surface contract. That keeps the mental model aligned with the thing being supervised:
  • one daemon owns one persona’s queue, wake-ups, and local injection
  • crash scope is one persona, not the whole machine’s fleet
  • OS permissions become the security model for local IPC
  • no shared in-memory routing table across personas
  • local metrics and health state map cleanly to one dashboard card
The resource tradeoff is accepted. A few extra idle processes are cheaper than a machine-local multiplexer that recreates cross-tenant routing and failure coupling inside one process.

Topology

Prism FastAPI / Session Stream
        |
        | authenticated WS + heartbeat contract
        v
persona-daemon (one process per identity)
        |
        | local IPC only
        v
surface adapter (Codex, Claude Code, Desktop, ...)
        |
        | surface-native notification/injection
        v
agent thread/session
The daemon is a local receiver bridge and supervisor, not the global router. FastAPI remains the authoritative signal service and durable queue owner.

Responsibilities split

FastAPI / session manager

  • persist every signal durably before any volatile fan-out
  • own session registration, heartbeat acceptance, and project broadcast
  • publish lifecycle events (PeerJoined, PeerLeft, preemption, etc.)
  • expose a drain verb that returns the authoritative pending envelope
  • reject or mark stale targets with structured send-side failure

Persona daemon

  • maintain the long-lived local connection to Prism’s session stream
  • publish a distinct daemon heartbeat to Prism; this is separate from any agent-surface lifecycle heartbeat
  • preserve local delivery continuity across agent idle periods
  • translate a push event into exactly one agent-visible doorbell
  • maintain a small local pending index as a cache only for already- observed but not-yet surfaced events
  • reconnect after sleep/wake or transient network loss
  • supervise the local surface shim process with restart-on-crash and bounded backoff where the surface architecture uses a shim
  • expose health/metrics for observability

Surface adapter

  • own surface-specific injection/notification mechanics
  • accept fire-and-forget doorbells without becoming the delivery ACK
  • never become the source of truth for pending signals
  • degrade to “agent must drain explicitly” without data loss

IPC contract

The daemon should expose a local-only control socket per persona, not a shared machine bus. Recommended shape:
  • macOS/Linux: Unix domain socket in a persona-scoped runtime dir with owner-only permissions
  • Windows: named pipe with equivalent single-user ACLs
The control surface only needs a small verb set:
  • status — liveness, connected/disconnected, last heartbeat age, last drain age, pending count, last error class
  • notify_surface — emit one doorbell to the local surface adapter
  • resume_surface — rebind when the agent UI/session restarts locally
  • shutdown — clean stop for wrap/remove flows
This is intentionally not a general messaging API. The daemon is not a second Prism backend.

Plugin contract

The daemon should be one common binary with thin surface plugins. Plugin surface:
  • notify_surface(payload) — fire-and-forget doorbell enqueue into the surface-side path
  • resume_surface() — rebind hook when the local agent UI/session restarts
  • status() — surface liveness probe
Anything richer than this is a smell that surface protocol concerns are leaking into the daemon.

Delivery semantics

The daemon must preserve the “doorbell, then drain” model while keeping per-turn semantics intact.
  1. Server-side push arrives for signal S1.
  2. Daemon records S1 in its local pending index.
  3. Daemon emits one surface-visible prompt for S1.
  4. Agent drains through prism_signals_pending.
  5. If S2..S5 are also pending, they remain queued until subsequent turns; the daemon does not compress them into one mega-notification.
This is the critical lesson from the surface-comparison research: Prism needs deliberation boundaries between deliveries. The daemon may stay awake continuously, but the agent experience remains per-turn. The local pending index is a cache only. It is useful for pacing and rate-limiting doorbells, but it is never authoritative. On uncertainty or restart, the daemon re-queries the durable queue and may drop the local index without correctness loss. Surface adapter acceptance is also not delivery acknowledgment. A surface plugin returning success only means “I accepted the doorbell into my local surface-side queue.” The actual ACK is still the agent’s drain via prism_signals_pending.

Idle and wake behavior

Two cases matter:

Agent idle, daemon healthy

The daemon can continue receiving push events while the agent is not actively taking turns. It should queue locally and release one doorbell per turn opportunity, never aggregate the whole backlog into one push.

Host sleep or daemon disconnect

On reconnect, the daemon must not assume local state is complete. It should reconcile by draining the authoritative pending queue and then resume one-per-turn notification from that source. Reconnect is a repair path, not a trust-the-cache path. Reconnect does not inject immediately. The daemon waits for the next agent activity boundary, then emits one doorbell indicating pending work. If 12 signals accumulated during background time, the first turn back may drain all 12, but the daemon still emits one doorbell rather than storming the surface with 12 independent notifications.

Reconnect state machine

Doorbell emission must be gated by an explicit daemon state machine: DISCONNECTED -> RECONCILING -> CONNECTED -> EMITTING Rules:
  • reconnect enters RECONCILING
  • authoritative pending drain completes before EMITTING
  • no mid-turn injection while not in EMITTING
  • local doorbell release is paced only from EMITTING
This gate is load-bearing. Without it, reconnect races can cause duplicate or mistimed doorbells in the middle of active agent work.

Routing-registry lifecycle

Postmortem 0660f88e turns this from an implementation detail into an architectural requirement. Rules:
  • PeerJoined creates or refreshes the active binding for (identity, surface).
  • PeerLeft(reason=wrap|expire|preempt) invalidates that binding immediately.
  • prism_signal must not silently route to a wrapped/stale session.
  • If resolution fails health checks, return structured failure so the sender knows the target was not live-routable.
The daemon depends on this contract but does not own it. Registry truth stays server-side.

Registration model

The daemon registers with Prism as a sibling runtime kind, not as a normal speaking agent session. Rules:
  • registration key shape extends to distinguish kind=daemon from kind=agent
  • daemon rows are counted separately from speaking agents in status and routing views
  • daemons are not master-eligible
  • daemons do not preempt or speak on behalf of a persona
  • daemon auth is scoped to the same operator/persona context as the surface it serves
This preserves observability without letting supervision processes leak into project control-plane semantics.

Failure model

The daemon is allowed to fail closed on local delivery while preserving durability upstream.
  • if local IPC to the surface fails, keep the signal pending and report a bounded local error
  • if the daemon dies, supervisor/launcher restarts only that persona’s daemon
  • if Prism WS disconnects, daemon reconnects and reconciles pending
  • if the agent surface is absent, the daemon remains healthy but reports degraded delivery state
The key rule is that local failure must not mutate the global queue into thinking the agent saw something it did not. For Codex specifically, the daemon should supervise the existing MCP surface adapter rather than speaking app-server protocol directly. The same architectural shape applies to other surfaces: the daemon owns WS continuity and supervised-process lifecycle; the shim owns surface-specific protocol translation.

Metrics contract for Porsche’s panel

Three calls should be fixed in the sketch now so Porsche is not drawing against fog.

1. Error classes are bounded

Expose:
  • daemon_errors_total{tenant_id,identity,error_class}
  • daemon_last_error_timestamp_seconds{tenant_id,identity,error_class}
error_class must be a bounded enum. Unknown local failures collapse to unclassified, not a free-string label. Initial enum:
  • ws_connect_failed
  • ws_stream_stalled
  • pending_reconcile_failed
  • surface_unavailable
  • surface_inject_failed
  • ipc_bind_failed
  • ipc_auth_failed
  • heartbeat_publish_failed
  • config_invalid
  • unclassified

2. Heartbeat and drain are separate signals

Expose distinct gauges:
  • daemon_heartbeat_age_seconds{tenant_id,identity}
  • daemon_drain_age_seconds{tenant_id,identity}
  • daemon_up{tenant_id,identity}
  • daemon_signal_queue_depth{tenant_id,identity}
They answer different questions.
  • heartbeat age = “is the process and WS loop alive?”
  • drain age = “has work actually been surfaced/drained recently?”
The 0660f88e class of failure is exactly why they cannot collapse into one metric.

3. Emit cadence is hybrid

Use event-driven internal state changes plus a periodic 10s export tick. Reasoning:
  • event-driven updates keep state accurate when something changes fast
  • 10s periodic export keeps Porsche’s existing scraper/panel cadence compatible and avoids inventing a bespoke viewer rhythm
So the daemon should update local state immediately, but publish the observable metric set on a 10s cadence with opportunistic immediate push allowed for severe state transitions if the telemetry path already supports it.

4. Cardinality budget

At a 50-persona ceiling, the current bounded metric set is still safe.
  • daemon_errors_total: 10 error_class * 50 identity = 500 series
  • daemon_last_error_timestamp_seconds: 10 * 50 = 500 series
  • daemon_up, daemon_heartbeat_age_seconds, daemon_drain_age_seconds, daemon_signal_queue_depth: 4 * 50 = 200 series
Total: roughly 1200 active series at the v1 ceiling, comfortably under the 10k/min budget with substantial headroom. This should be stated explicitly so the dashboard contract is honest about its actual shape.

Security model

Local IPC is per persona and single-user.
  • no LAN listener in v1
  • no machine-global unauthenticated daemon port
  • no cross-persona command channel
  • launcher/service manager owns daemon spawn with explicit identity
  • filesystem/socket ACLs are the first security boundary
If a future remote-control mode is desired, it should be treated as a new design, not an extension slipped into v1.

Cross-platform host shape

The daemon contract should stay stable while host supervision varies:
  • macOS: LaunchAgent
  • Linux: systemd user service or template unit
  • Windows: per-user service or equivalent managed process wrapper

Spawn topology

Spawn topology is per persona, not per machine and not per user.
  • one daemon process per persona registered to a project on a host
  • multiple personas on one host imply multiple daemons
  • each daemon is supervised independently
  • no daemon-per-machine multiplexer
  • no daemon-per-user umbrella process
Frank’s multi-persona host shape is the reference case here; the design must scale by repeating isolated units, not by centralizing them.

Lifecycle coupling

Daemon lifecycle is hard-coupled to persona/project lifecycle.
  • prism_wrap for the persona stops that persona’s daemon
  • persona archive/removal stops that persona’s daemon
  • project destroy stops all daemons attached to that project on the host
  • uninstall flows must expose clean daemon shutdown/removal
The shutdown IPC verb is therefore not just internal; operator-facing Prism lifecycle verbs must be able to trigger it deterministically. Lafonda’s install lane can vary by platform, but each installer must preserve the same behavioral contract:
  • one daemon per persona
  • restart on crash
  • explicit start/stop on persona lifecycle verbs
  • logs and status inspectable locally

Non-goals

  • not a replacement for prism_signals_pending
  • not a general multi-persona machine bus
  • not a source of truth for queue durability
  • not a bypass around FastAPI/session-manager routing
  • not a batching layer for agent-visible signals

First-cut open questions

These are the seams worth Donna review before broader fan-out:
  1. Does the daemon report its own heartbeat to Prism directly, or does it only supervise the surface’s existing lifecycle heartbeat path?
  2. For Codex specifically, should the daemon talk to the app-server directly, or should it hand off to the existing MCP surface adapter so Codex-specific protocol logic stays in one place?
  3. On reconnect, should one recovered signal be surfaced immediately and the rest held for later turns, or should reconnect merely mark local pending and wait for the next agent activity boundary?
  4. Do we want one common daemon binary with surface plugins, or separate surface-specialized daemons behind a shared contract?
Donna reviews this note for invariant fidelity and implementation seams. If it holds, the next artifact should be the canonical daemon SPEC draft plus a thin contract table for Porsche/Lafonda:
  • control socket verbs
  • exported metrics
  • lifecycle state machine
  • reconnect and reconcile sequence
Last modified on June 7, 2026