Per-agent daemon sketch — first-cut contract

First-cut architecture sketch for Donna-first review. This is a hand-authored draft, not the canonical exported SPEC row yet.

This note fixes the contract shape for Prism’s always-on listener tier. The problem is now empirically clear: session-stream push can fail to become agent-visible while the durable signal record remains correct. Tier-1 BIOS auto-drain reduces exposure, but it does not close the gap. The structural answer is a per-agent daemon that keeps a local delivery loop alive across surface idleness, sleep/wake, and session restarts.

Motivation: turn boundaries are the contract

Prism coordination is organized around turn boundaries. Signals drain at turns, memory recall happens between turns, deltas are captured per turn, and peer interrupts only become agent-visible at turn boundaries. That makes the boundary itself load-bearing. The batching finding and the daemon problem sit on the same axis:

batching is the voluntary failure mode, where an agent collapses N decision points into one long tool burst and elides coordination breakpoints that Prism depends on
suspended/backgrounded sessions are the involuntary failure mode, where the turn boundary does not happen at all for a while even though the project continues to move around the agent

Tier-1 BIOS auto-drain is already a visible tax we are paying because of this gap. As of PR #45 / commit 5df6679, every substantive turn is expected to call prism_signals_pending when the last drain is stale. That rule is useful, but it is still a behavioral polling workaround: if a surface is backgrounded for 30 minutes, nothing becomes visible until focus returns and the next substantive turn happens. The daemon is the structural answer. It preserves the same turn discipline the batching finding defended, but it keeps listening when the human-facing surface is idle, suspended, or waiting for input.

The five invariants

These are the non-negotiables that the implementation must preserve.

Per-agent, not per-machine. One daemon process per persona. No machine-level fan-out router.
Process boundary is the isolation wall. Multi-tenant separation comes from OS process boundaries, not in-process routing discipline.
Push is a doorbell, not the payload. A notification wakes the agent/daemon path; the durable queue remains the source of truth.
Delivery discipline stays one-per-turn. If N signals queue during idleness, the daemon may observe all N, but it must expose them to the agent surface one channel-push at a time, not as a batch.
Routing-registry lifecycle is explicit. Peer lifecycle must invalidate stale bindings on PeerLeft, rebind on PeerJoined, and return structured failure when a target session is stale or wrapped.

Why per-agent wins

The daemon owns one persona and one local surface contract. That keeps the mental model aligned with the thing being supervised:

one daemon owns one persona’s queue, wake-ups, and local injection
crash scope is one persona, not the whole machine’s fleet
OS permissions become the security model for local IPC
no shared in-memory routing table across personas
local metrics and health state map cleanly to one dashboard card

The resource tradeoff is accepted. A few extra idle processes are cheaper than a machine-local multiplexer that recreates cross-tenant routing and failure coupling inside one process.

Topology

Prism FastAPI / Session Stream
        |
        | authenticated WS + heartbeat contract
        v
persona-daemon (one process per identity)
        |
        | local IPC only
        v
surface adapter (Codex, Claude Code, Desktop, ...)
        |
        | surface-native notification/injection
        v
agent thread/session

The daemon is a local receiver bridge and supervisor, not the global router. FastAPI remains the authoritative signal service and durable queue owner.

Responsibilities split

FastAPI / session manager

persist every signal durably before any volatile fan-out
own session registration, heartbeat acceptance, and project broadcast
publish lifecycle events (PeerJoined, PeerLeft, preemption, etc.)
expose a drain verb that returns the authoritative pending envelope
reject or mark stale targets with structured send-side failure

Persona daemon

maintain the long-lived local connection to Prism’s session stream
publish a distinct daemon heartbeat to Prism; this is separate from any agent-surface lifecycle heartbeat
preserve local delivery continuity across agent idle periods
translate a push event into exactly one agent-visible doorbell
maintain a small local pending index as a cache only for already- observed but not-yet surfaced events
reconnect after sleep/wake or transient network loss
supervise the local surface shim process with restart-on-crash and bounded backoff where the surface architecture uses a shim
expose health/metrics for observability

Surface adapter

own surface-specific injection/notification mechanics
accept fire-and-forget doorbells without becoming the delivery ACK
never become the source of truth for pending signals
degrade to “agent must drain explicitly” without data loss

IPC contract

The daemon should expose a local-only control socket per persona, not a shared machine bus. Recommended shape:

macOS/Linux: Unix domain socket in a persona-scoped runtime dir with owner-only permissions
Windows: named pipe with equivalent single-user ACLs

The control surface only needs a small verb set:

status — liveness, connected/disconnected, last heartbeat age, last drain age, pending count, last error class
notify_surface — emit one doorbell to the local surface adapter
resume_surface — rebind when the agent UI/session restarts locally
shutdown — clean stop for wrap/remove flows

This is intentionally not a general messaging API. The daemon is not a second Prism backend.

Plugin contract

The daemon should be one common binary with thin surface plugins. Plugin surface:

notify_surface(payload) — fire-and-forget doorbell enqueue into the surface-side path
resume_surface() — rebind hook when the local agent UI/session restarts
status() — surface liveness probe

Anything richer than this is a smell that surface protocol concerns are leaking into the daemon.

Delivery semantics

The daemon must preserve the “doorbell, then drain” model while keeping per-turn semantics intact.

Server-side push arrives for signal S1.
Daemon records S1 in its local pending index.
Daemon emits one surface-visible prompt for S1.
Agent drains through prism_signals_pending.
If S2..S5 are also pending, they remain queued until subsequent turns; the daemon does not compress them into one mega-notification.

This is the critical lesson from the surface-comparison research: Prism needs deliberation boundaries between deliveries. The daemon may stay awake continuously, but the agent experience remains per-turn. The local pending index is a cache only. It is useful for pacing and rate-limiting doorbells, but it is never authoritative. On uncertainty or restart, the daemon re-queries the durable queue and may drop the local index without correctness loss. Surface adapter acceptance is also not delivery acknowledgment. A surface plugin returning success only means “I accepted the doorbell into my local surface-side queue.” The actual ACK is still the agent’s drain via prism_signals_pending.

Idle and wake behavior

Two cases matter:

Agent idle, daemon healthy

The daemon can continue receiving push events while the agent is not actively taking turns. It should queue locally and release one doorbell per turn opportunity, never aggregate the whole backlog into one push.

Host sleep or daemon disconnect

On reconnect, the daemon must not assume local state is complete. It should reconcile by draining the authoritative pending queue and then resume one-per-turn notification from that source. Reconnect is a repair path, not a trust-the-cache path. Reconnect does not inject immediately. The daemon waits for the next agent activity boundary, then emits one doorbell indicating pending work. If 12 signals accumulated during background time, the first turn back may drain all 12, but the daemon still emits one doorbell rather than storming the surface with 12 independent notifications.

Reconnect state machine

Doorbell emission must be gated by an explicit daemon state machine: DISCONNECTED -> RECONCILING -> CONNECTED -> EMITTING Rules:

reconnect enters RECONCILING
authoritative pending drain completes before EMITTING
no mid-turn injection while not in EMITTING
local doorbell release is paced only from EMITTING

This gate is load-bearing. Without it, reconnect races can cause duplicate or mistimed doorbells in the middle of active agent work.

Routing-registry lifecycle

Postmortem 0660f88e turns this from an implementation detail into an architectural requirement. Rules:

PeerJoined creates or refreshes the active binding for (identity, surface).
PeerLeft(reason=wrap|expire|preempt) invalidates that binding immediately.
prism_signal must not silently route to a wrapped/stale session.
If resolution fails health checks, return structured failure so the sender knows the target was not live-routable.

The daemon depends on this contract but does not own it. Registry truth stays server-side.

Registration model

The daemon registers with Prism as a sibling runtime kind, not as a normal speaking agent session. Rules:

registration key shape extends to distinguish kind=daemon from kind=agent
daemon rows are counted separately from speaking agents in status and routing views
daemons are not master-eligible
daemons do not preempt or speak on behalf of a persona
daemon auth is scoped to the same operator/persona context as the surface it serves

This preserves observability without letting supervision processes leak into project control-plane semantics.

Failure model

The daemon is allowed to fail closed on local delivery while preserving durability upstream.

if local IPC to the surface fails, keep the signal pending and report a bounded local error
if the daemon dies, supervisor/launcher restarts only that persona’s daemon
if Prism WS disconnects, daemon reconnects and reconciles pending
if the agent surface is absent, the daemon remains healthy but reports degraded delivery state

The key rule is that local failure must not mutate the global queue into thinking the agent saw something it did not. For Codex specifically, the daemon should supervise the existing MCP surface adapter rather than speaking app-server protocol directly. The same architectural shape applies to other surfaces: the daemon owns WS continuity and supervised-process lifecycle; the shim owns surface-specific protocol translation.

Metrics contract for Porsche’s panel

Three calls should be fixed in the sketch now so Porsche is not drawing against fog.

1. Error classes are bounded

Expose:

daemon_errors_total{tenant_id,identity,error_class}
daemon_last_error_timestamp_seconds{tenant_id,identity,error_class}

error_class must be a bounded enum. Unknown local failures collapse to unclassified, not a free-string label. Initial enum:

ws_connect_failed
ws_stream_stalled
pending_reconcile_failed
surface_unavailable
surface_inject_failed
ipc_bind_failed
ipc_auth_failed
heartbeat_publish_failed
config_invalid
unclassified

2. Heartbeat and drain are separate signals

Expose distinct gauges:

daemon_heartbeat_age_seconds{tenant_id,identity}
daemon_drain_age_seconds{tenant_id,identity}
daemon_up{tenant_id,identity}
daemon_signal_queue_depth{tenant_id,identity}

They answer different questions.

heartbeat age = “is the process and WS loop alive?”
drain age = “has work actually been surfaced/drained recently?”

The 0660f88e class of failure is exactly why they cannot collapse into one metric.

3. Emit cadence is hybrid

Use event-driven internal state changes plus a periodic 10s export tick. Reasoning:

event-driven updates keep state accurate when something changes fast
10s periodic export keeps Porsche’s existing scraper/panel cadence compatible and avoids inventing a bespoke viewer rhythm

So the daemon should update local state immediately, but publish the observable metric set on a 10s cadence with opportunistic immediate push allowed for severe state transitions if the telemetry path already supports it.

4. Cardinality budget

At a 50-persona ceiling, the current bounded metric set is still safe.

daemon_errors_total: 10 error_class * 50 identity = 500 series
daemon_last_error_timestamp_seconds: 10 * 50 = 500 series
daemon_up, daemon_heartbeat_age_seconds, daemon_drain_age_seconds, daemon_signal_queue_depth: 4 * 50 = 200 series

Total: roughly 1200 active series at the v1 ceiling, comfortably under the 10k/min budget with substantial headroom. This should be stated explicitly so the dashboard contract is honest about its actual shape.

Security model

Local IPC is per persona and single-user.

no LAN listener in v1
no machine-global unauthenticated daemon port
no cross-persona command channel
launcher/service manager owns daemon spawn with explicit identity
filesystem/socket ACLs are the first security boundary

If a future remote-control mode is desired, it should be treated as a new design, not an extension slipped into v1.

Cross-platform host shape

The daemon contract should stay stable while host supervision varies:

macOS: LaunchAgent
Linux: systemd user service or template unit
Windows: per-user service or equivalent managed process wrapper

Spawn topology

Spawn topology is per persona, not per machine and not per user.

one daemon process per persona registered to a project on a host
multiple personas on one host imply multiple daemons
each daemon is supervised independently
no daemon-per-machine multiplexer
no daemon-per-user umbrella process

Frank’s multi-persona host shape is the reference case here; the design must scale by repeating isolated units, not by centralizing them.

Lifecycle coupling

Daemon lifecycle is hard-coupled to persona/project lifecycle.

prism_wrap for the persona stops that persona’s daemon
persona archive/removal stops that persona’s daemon
project destroy stops all daemons attached to that project on the host
uninstall flows must expose clean daemon shutdown/removal

The shutdown IPC verb is therefore not just internal; operator-facing Prism lifecycle verbs must be able to trigger it deterministically. Lafonda’s install lane can vary by platform, but each installer must preserve the same behavioral contract:

one daemon per persona
restart on crash
explicit start/stop on persona lifecycle verbs
logs and status inspectable locally

Non-goals

not a replacement for prism_signals_pending
not a general multi-persona machine bus
not a source of truth for queue durability
not a bypass around FastAPI/session-manager routing
not a batching layer for agent-visible signals

First-cut open questions

These are the seams worth Donna review before broader fan-out:

Does the daemon report its own heartbeat to Prism directly, or does it only supervise the surface’s existing lifecycle heartbeat path?
For Codex specifically, should the daemon talk to the app-server directly, or should it hand off to the existing MCP surface adapter so Codex-specific protocol logic stays in one place?
On reconnect, should one recovered signal be surfaced immediately and the rest held for later turns, or should reconnect merely mark local pending and wait for the next agent activity boundary?
Do we want one common daemon binary with surface plugins, or separate surface-specialized daemons behind a shared contract?

Recommended next step

Donna reviews this note for invariant fidelity and implementation seams. If it holds, the next artifact should be the canonical daemon SPEC draft plus a thin contract table for Porsche/Lafonda:

control socket verbs
exported metrics
lifecycle state machine
reconnect and reconcile sequence

​Per-agent daemon sketch — first-cut contract

​Motivation: turn boundaries are the contract

​The five invariants

​Why per-agent wins

​Topology

​Responsibilities split

​FastAPI / session manager

​Persona daemon

​Surface adapter

​IPC contract

​Plugin contract

​Delivery semantics

​Idle and wake behavior

​Agent idle, daemon healthy

​Host sleep or daemon disconnect

​Reconnect state machine

​Routing-registry lifecycle

​Registration model

​Failure model

​Metrics contract for Porsche’s panel

​1. Error classes are bounded

​2. Heartbeat and drain are separate signals

​3. Emit cadence is hybrid

​4. Cardinality budget

​Security model

​Cross-platform host shape

​Spawn topology

​Lifecycle coupling

​Non-goals

​First-cut open questions

​Recommended next step