Per-agent daemon sketch — first-cut contract
First-cut architecture sketch for Donna-first review.
This is a hand-authored draft, not the canonical exported SPEC row yet.
This note fixes the contract shape for Prism’s always-on listener tier.
The problem is now empirically clear: session-stream push can fail to
become agent-visible while the durable signal record remains correct.
Tier-1 BIOS auto-drain reduces exposure, but it does not close the gap.
The structural answer is a per-agent daemon that keeps a local delivery
loop alive across surface idleness, sleep/wake, and session restarts.
Motivation: turn boundaries are the contract
Prism coordination is organized around turn boundaries. Signals drain at
turns, memory recall happens between turns, deltas are captured per
turn, and peer interrupts only become agent-visible at turn boundaries.
That makes the boundary itself load-bearing.
The batching finding and the daemon problem sit on the same axis:
- batching is the voluntary failure mode, where an agent collapses
N
decision points into one long tool burst and elides coordination
breakpoints that Prism depends on
- suspended/backgrounded sessions are the involuntary failure mode,
where the turn boundary does not happen at all for a while even though
the project continues to move around the agent
Tier-1 BIOS auto-drain is already a visible tax we are paying because of
this gap. As of PR #45 / commit 5df6679, every substantive turn is
expected to call prism_signals_pending when the last drain is stale.
That rule is useful, but it is still a behavioral polling workaround:
if a surface is backgrounded for 30 minutes, nothing becomes visible
until focus returns and the next substantive turn happens.
The daemon is the structural answer. It preserves the same turn
discipline the batching finding defended, but it keeps listening when
the human-facing surface is idle, suspended, or waiting for input.
The five invariants
These are the non-negotiables that the implementation must preserve.
- Per-agent, not per-machine. One daemon process per persona. No
machine-level fan-out router.
- Process boundary is the isolation wall. Multi-tenant separation
comes from OS process boundaries, not in-process routing discipline.
- Push is a doorbell, not the payload. A notification wakes the
agent/daemon path; the durable queue remains the source of truth.
- Delivery discipline stays one-per-turn. If
N signals queue
during idleness, the daemon may observe all N, but it must expose
them to the agent surface one channel-push at a time, not as a batch.
- Routing-registry lifecycle is explicit. Peer lifecycle must
invalidate stale bindings on
PeerLeft, rebind on PeerJoined, and
return structured failure when a target session is stale or wrapped.
Why per-agent wins
The daemon owns one persona and one local surface contract. That keeps
the mental model aligned with the thing being supervised:
- one daemon owns one persona’s queue, wake-ups, and local injection
- crash scope is one persona, not the whole machine’s fleet
- OS permissions become the security model for local IPC
- no shared in-memory routing table across personas
- local metrics and health state map cleanly to one dashboard card
The resource tradeoff is accepted. A few extra idle processes are cheaper
than a machine-local multiplexer that recreates cross-tenant routing and
failure coupling inside one process.
Topology
Prism FastAPI / Session Stream
|
| authenticated WS + heartbeat contract
v
persona-daemon (one process per identity)
|
| local IPC only
v
surface adapter (Codex, Claude Code, Desktop, ...)
|
| surface-native notification/injection
v
agent thread/session
The daemon is a local receiver bridge and supervisor, not the global
router. FastAPI remains the authoritative signal service and durable
queue owner.
Responsibilities split
FastAPI / session manager
- persist every signal durably before any volatile fan-out
- own session registration, heartbeat acceptance, and project broadcast
- publish lifecycle events (
PeerJoined, PeerLeft, preemption, etc.)
- expose a drain verb that returns the authoritative pending envelope
- reject or mark stale targets with structured send-side failure
Persona daemon
- maintain the long-lived local connection to Prism’s session stream
- publish a distinct daemon heartbeat to Prism; this is separate from
any agent-surface lifecycle heartbeat
- preserve local delivery continuity across agent idle periods
- translate a push event into exactly one agent-visible doorbell
- maintain a small local pending index as a cache only for already-
observed but not-yet surfaced events
- reconnect after sleep/wake or transient network loss
- supervise the local surface shim process with restart-on-crash and
bounded backoff where the surface architecture uses a shim
- expose health/metrics for observability
Surface adapter
- own surface-specific injection/notification mechanics
- accept fire-and-forget doorbells without becoming the delivery ACK
- never become the source of truth for pending signals
- degrade to “agent must drain explicitly” without data loss
IPC contract
The daemon should expose a local-only control socket per persona, not a
shared machine bus.
Recommended shape:
- macOS/Linux: Unix domain socket in a persona-scoped runtime dir with
owner-only permissions
- Windows: named pipe with equivalent single-user ACLs
The control surface only needs a small verb set:
status — liveness, connected/disconnected, last heartbeat age, last
drain age, pending count, last error class
notify_surface — emit one doorbell to the local surface adapter
resume_surface — rebind when the agent UI/session restarts locally
shutdown — clean stop for wrap/remove flows
This is intentionally not a general messaging API. The daemon is not a
second Prism backend.
Plugin contract
The daemon should be one common binary with thin surface plugins.
Plugin surface:
notify_surface(payload) — fire-and-forget doorbell enqueue into the
surface-side path
resume_surface() — rebind hook when the local agent UI/session
restarts
status() — surface liveness probe
Anything richer than this is a smell that surface protocol concerns are
leaking into the daemon.
Delivery semantics
The daemon must preserve the “doorbell, then drain” model while keeping
per-turn semantics intact.
- Server-side push arrives for signal
S1.
- Daemon records
S1 in its local pending index.
- Daemon emits one surface-visible prompt for
S1.
- Agent drains through
prism_signals_pending.
- If
S2..S5 are also pending, they remain queued until subsequent
turns; the daemon does not compress them into one mega-notification.
This is the critical lesson from the surface-comparison research: Prism
needs deliberation boundaries between deliveries. The daemon may stay
awake continuously, but the agent experience remains per-turn.
The local pending index is a cache only. It is useful for pacing and
rate-limiting doorbells, but it is never authoritative. On uncertainty
or restart, the daemon re-queries the durable queue and may drop the
local index without correctness loss.
Surface adapter acceptance is also not delivery acknowledgment. A
surface plugin returning success only means “I accepted the doorbell
into my local surface-side queue.” The actual ACK is still the agent’s
drain via prism_signals_pending.
Idle and wake behavior
Two cases matter:
Agent idle, daemon healthy
The daemon can continue receiving push events while the agent is not
actively taking turns. It should queue locally and release one doorbell
per turn opportunity, never aggregate the whole backlog into one push.
Host sleep or daemon disconnect
On reconnect, the daemon must not assume local state is complete. It
should reconcile by draining the authoritative pending queue and then
resume one-per-turn notification from that source. Reconnect is a repair
path, not a trust-the-cache path.
Reconnect does not inject immediately. The daemon waits for the next
agent activity boundary, then emits one doorbell indicating pending
work. If 12 signals accumulated during background time, the first turn
back may drain all 12, but the daemon still emits one doorbell rather
than storming the surface with 12 independent notifications.
Reconnect state machine
Doorbell emission must be gated by an explicit daemon state machine:
DISCONNECTED -> RECONCILING -> CONNECTED -> EMITTING
Rules:
- reconnect enters
RECONCILING
- authoritative pending drain completes before
EMITTING
- no mid-turn injection while not in
EMITTING
- local doorbell release is paced only from
EMITTING
This gate is load-bearing. Without it, reconnect races can cause
duplicate or mistimed doorbells in the middle of active agent work.
Routing-registry lifecycle
Postmortem 0660f88e turns this from an implementation detail into an
architectural requirement.
Rules:
PeerJoined creates or refreshes the active binding for
(identity, surface).
PeerLeft(reason=wrap|expire|preempt) invalidates that binding
immediately.
prism_signal must not silently route to a wrapped/stale session.
- If resolution fails health checks, return structured failure so the
sender knows the target was not live-routable.
The daemon depends on this contract but does not own it. Registry truth
stays server-side.
Registration model
The daemon registers with Prism as a sibling runtime kind, not as a
normal speaking agent session.
Rules:
- registration key shape extends to distinguish
kind=daemon from
kind=agent
- daemon rows are counted separately from speaking agents in status and
routing views
- daemons are not master-eligible
- daemons do not preempt or speak on behalf of a persona
- daemon auth is scoped to the same operator/persona context as the
surface it serves
This preserves observability without letting supervision processes leak
into project control-plane semantics.
Failure model
The daemon is allowed to fail closed on local delivery while preserving
durability upstream.
- if local IPC to the surface fails, keep the signal pending and report
a bounded local error
- if the daemon dies, supervisor/launcher restarts only that persona’s
daemon
- if Prism WS disconnects, daemon reconnects and reconciles pending
- if the agent surface is absent, the daemon remains healthy but reports
degraded delivery state
The key rule is that local failure must not mutate the global queue into
thinking the agent saw something it did not.
For Codex specifically, the daemon should supervise the existing MCP
surface adapter rather than speaking app-server protocol directly. The
same architectural shape applies to other surfaces: the daemon owns WS
continuity and supervised-process lifecycle; the shim owns
surface-specific protocol translation.
Metrics contract for Porsche’s panel
Three calls should be fixed in the sketch now so Porsche is not drawing
against fog.
1. Error classes are bounded
Expose:
daemon_errors_total{tenant_id,identity,error_class}
daemon_last_error_timestamp_seconds{tenant_id,identity,error_class}
error_class must be a bounded enum. Unknown local failures collapse to
unclassified, not a free-string label.
Initial enum:
ws_connect_failed
ws_stream_stalled
pending_reconcile_failed
surface_unavailable
surface_inject_failed
ipc_bind_failed
ipc_auth_failed
heartbeat_publish_failed
config_invalid
unclassified
2. Heartbeat and drain are separate signals
Expose distinct gauges:
daemon_heartbeat_age_seconds{tenant_id,identity}
daemon_drain_age_seconds{tenant_id,identity}
daemon_up{tenant_id,identity}
daemon_signal_queue_depth{tenant_id,identity}
They answer different questions.
- heartbeat age = “is the process and WS loop alive?”
- drain age = “has work actually been surfaced/drained recently?”
The 0660f88e class of failure is exactly why they cannot collapse into
one metric.
3. Emit cadence is hybrid
Use event-driven internal state changes plus a periodic 10s export tick.
Reasoning:
- event-driven updates keep state accurate when something changes fast
- 10s periodic export keeps Porsche’s existing scraper/panel cadence
compatible and avoids inventing a bespoke viewer rhythm
So the daemon should update local state immediately, but publish the
observable metric set on a 10s cadence with opportunistic immediate push
allowed for severe state transitions if the telemetry path already
supports it.
4. Cardinality budget
At a 50-persona ceiling, the current bounded metric set is still safe.
daemon_errors_total: 10 error_class * 50 identity = 500 series
daemon_last_error_timestamp_seconds: 10 * 50 = 500 series
daemon_up, daemon_heartbeat_age_seconds,
daemon_drain_age_seconds, daemon_signal_queue_depth:
4 * 50 = 200 series
Total: roughly 1200 active series at the v1 ceiling, comfortably under
the 10k/min budget with substantial headroom. This should be stated
explicitly so the dashboard contract is honest about its actual shape.
Security model
Local IPC is per persona and single-user.
- no LAN listener in v1
- no machine-global unauthenticated daemon port
- no cross-persona command channel
- launcher/service manager owns daemon spawn with explicit identity
- filesystem/socket ACLs are the first security boundary
If a future remote-control mode is desired, it should be treated as a
new design, not an extension slipped into v1.
The daemon contract should stay stable while host supervision varies:
- macOS: LaunchAgent
- Linux: systemd user service or template unit
- Windows: per-user service or equivalent managed process wrapper
Spawn topology
Spawn topology is per persona, not per machine and not per user.
- one daemon process per persona registered to a project on a host
- multiple personas on one host imply multiple daemons
- each daemon is supervised independently
- no daemon-per-machine multiplexer
- no daemon-per-user umbrella process
Frank’s multi-persona host shape is the reference case here; the design
must scale by repeating isolated units, not by centralizing them.
Lifecycle coupling
Daemon lifecycle is hard-coupled to persona/project lifecycle.
prism_wrap for the persona stops that persona’s daemon
- persona archive/removal stops that persona’s daemon
- project destroy stops all daemons attached to that project on the host
- uninstall flows must expose clean daemon shutdown/removal
The shutdown IPC verb is therefore not just internal; operator-facing
Prism lifecycle verbs must be able to trigger it deterministically.
Lafonda’s install lane can vary by platform, but each installer must
preserve the same behavioral contract:
- one daemon per persona
- restart on crash
- explicit start/stop on persona lifecycle verbs
- logs and status inspectable locally
Non-goals
- not a replacement for
prism_signals_pending
- not a general multi-persona machine bus
- not a source of truth for queue durability
- not a bypass around FastAPI/session-manager routing
- not a batching layer for agent-visible signals
First-cut open questions
These are the seams worth Donna review before broader fan-out:
- Does the daemon report its own heartbeat to Prism directly, or does
it only supervise the surface’s existing lifecycle heartbeat path?
- For Codex specifically, should the daemon talk to the app-server
directly, or should it hand off to the existing MCP surface adapter
so Codex-specific protocol logic stays in one place?
- On reconnect, should one recovered signal be surfaced immediately and
the rest held for later turns, or should reconnect merely mark local
pending and wait for the next agent activity boundary?
- Do we want one common daemon binary with surface plugins, or separate
surface-specialized daemons behind a shared contract?
Recommended next step
Donna reviews this note for invariant fidelity and implementation seams.
If it holds, the next artifact should be the canonical daemon SPEC draft
plus a thin contract table for Porsche/Lafonda:
- control socket verbs
- exported metrics
- lifecycle state machine
- reconnect and reconcile sequence
Last modified on June 7, 2026