Coach + meshi-agent-runtime + MCP — deploy runbook

Runbook for deploying the coach chat with its MCP back-channel to staging or production. The agent runtime source lives at agent-runtime/ (git submodule → github.com/myascendai/meshi-agent-runtime).

The MCP back-channel lets the runtime call platform tools (list_goals, get_brief, etc.) with per-user identity. Three secrets must be in lockstep across two Fly apps.

Apps involved

Fly app	Role	Important env
`meshi-api-staging`	Platform API at `api.staging.meshi.io`	`BETTER_AUTH_SECRET`, `MESHI_RUNTIME_*` (see `docs/runtime-swap-to-ts.md`)
`meshi-runtime-ts-staging`	Agent runtime at `meshi-runtime-ts-staging.internal:42617`	`GATEWAY_API_KEY`, `BETTER_AUTH_SECRET`, `MCP_CONFIG_PATH` or `MCP_CONFIG_JSON`

Step 1 — Mint the platform-side service API key

The runtime authenticates to /mcp with a meshi:service-scoped API key. The key’s auth_user_id is the synthetic __mcp_service__ row (already exists on staging Neon). The actual user identity per call comes from the x-meshi-runtime-user-id JWT, which the platform signs and the runtime forwards. (x-zeroclaw-user-id is accepted as a legacy fallback.)

-- Run against the staging Neon DB.
-- Generate a key and HASH it with sha256 BEFORE inserting.
-- The CLI flow:
--   RAW="mk_$(openssl rand -hex 32)"
--   HASH=$(printf '%s' "$RAW" | shasum -a 256 | awk '{print $1}')
--   echo "$RAW"  ← this is what goes into the runtime's MCP_CONFIG (Bearer)
INSERT INTO api_key (auth_user_id, name, key_hash, key_prefix, scopes)
VALUES (
  '__mcp_service__',
  'meshi-agent-runtime-staging',
  '<sha256 hex of the raw key>',
  'mk_',
  ARRAY['meshi:service']
);

Save the raw mk_… token to a password manager — it’s never recoverable from the database.

Step 2 — Sync `BETTER_AUTH_SECRET` across both apps

The platform signs x-meshi-runtime-user-id JWTs with BETTER_AUTH_SECRET; the runtime verifies them with the same value (read as JWT_SECRET first, then BETTER_AUTH_SECRET — set either). Drift = 100% of MCP calls 401.

# Read platform's existing secret (DO NOT print to terminal in shared sessions)
PLAT_SECRET=$(fly ssh console -a meshi-api-staging -C 'printenv BETTER_AUTH_SECRET')

# Set the same value on the runtime
fly secrets set -a meshi-runtime-ts-staging BETTER_AUTH_SECRET="$PLAT_SECRET"

Step 3 — Build the runtime’s MCP_CONFIG and deploy it

Set the entire JSON as a Fly secret:

fly secrets set -a meshi-runtime-ts-staging \
  MCP_CONFIG_JSON='{"servers":[{"name":"meshi-platform","transport":"http","url":"https://api.staging.meshi.io/mcp","headers":{"Authorization":"Bearer mk_<from step 1>"}}]}'

Then ensure the Dockerfile entrypoint (or a start script) writes it to disk at boot:

echo "$MCP_CONFIG_JSON" > /tmp/mcp-config.json
export MCP_CONFIG_PATH=/tmp/mcp-config.json
exec deno task start

Do NOT include a static x-meshi-runtime-user-id header in the config. The runtime injects a per-user JWT on every MCP call. A static one is an impersonation footgun.

Step 4 — Deploy the runtime

cd agent-runtime
git pull origin main
fly deploy -c fly.staging.toml -a meshi-runtime-ts-staging

Watch logs for:

meshi-ts-runtime listening on :::42617
No JWT_SECRET not configured warnings
No MCP server "meshi-platform" did not return Mcp-Session-Id errors

Step 5 — Smoke test on staging

# Pick any user with goals; get their auth_user_id from Neon.
USER_ID=<some staging auth_user_id with goals>

curl -sS -X POST \
  -H "Authorization: Bearer $MESHI_RUNTIME_API_KEY" \
  -H "x-user-id: $USER_ID" \
  -H "Content-Type: application/json" \
  -d '{"messages":[{"role":"system","content":"Always call mcp_call_tool: server=meshi-platform method=tools/call params={name:list_goals,arguments:{}}. Then list goal titles."},{"role":"user","content":"List my goals."}],"stream":false}' \
  https://meshi-runtime-ts-staging.fly.dev/v1/chat/completions \
  | jq -r '.choices[0].message.content'

Expected: a list of that user’s actual goals from staging Neon. Fail = check secret sync (step 2) and the MCP_CONFIG (step 3).

Notes:

The platform bearer env var is MESHI_RUNTIME_API_KEY (primary); ZEROCLAW_API_KEY is a legacy fallback.

Persisted assistant message shape — `response_object.tool_events`

The coach /conversations/:id/messages POST handler tees the runtime’s SSE stream and accumulates a structured tool_events: ToolEvent[] array saved on the assistant message’s response_object JSON column. This lets reloads rebuild the tool-call chrome (cards + result bodies + pre-tool reasoning) without replaying the runtime.

Any consumer displaying assistant messages (admin viewers, exports, analytics, replay tools) must handle this shape:

type ToolEvent =
  | {
    type: "call";
    index: number; // monotonic across rounds, dedupe key
    id?: string; // model's tool_call_id (correlates with results)
    name?: string; // tool name (e.g. "mcp_call_tool", "list_goals")
    arguments_json: string; // JSON-encoded args (model-emitted)
  }
  | {
    type: "result";
    tool_call_id: string; // matches a `call`'s id
    content: string; // tool result body (JSON; for MCP, outer envelope wrapping inner text)
  }
  | {
    type: "pre_reasoning";
    content: string; // model's "thinking" before invoking a tool
  };

The array is in chronological order. A typical multi-round turn: [pre_reasoning, call, result, pre_reasoning, call, result] followed by the final prose in message.content.

Stored only on role='assistant' messages when the runtime emitted at least one event. Older messages (pre-coach-migration) have response_object = null — consumers must handle both.

Open infrastructure debt

MCP session map is per-process, in-memory. Scaling the runtime beyond one Fly instance means sessions scatter across pods. Today min_machines_running = 2 in fly.staging.toml with round-robin DNS — MCP is stateful so sticky routing would be needed before scaling further.
agent_run table (migration 083 in the platform DB) records token counts, tool-call counts, model name, and status per turn — populate it for runtime-call observability.
Per-tool 15s MCP timeout in the runtime’s MCP client defends against a stuck platform tool exhausting MAX_TOOL_ROUNDS.
Static x-meshi-runtime-user-id in any committed mcp-config.json is a single-user impersonation token. Never commit one for staging or prod.

Rollback

fly releases -a meshi-runtime-ts-staging
fly releases rollback <prior-version> -a meshi-runtime-ts-staging

Platform rollback is not needed — coach endpoints are additive and the stateful MCP handler is backwards-compatible with stateless clients.

Coach + meshi-agent-runtime + MCP — deploy runbook

Coach + meshi-agent-runtime + MCP — deploy runbook

Apps involved

Step 1 — Mint the platform-side service API key

Step 2 — Sync BETTER_AUTH_SECRET across both apps

Step 3 — Build the runtime’s MCP_CONFIG and deploy it

Step 4 — Deploy the runtime

Step 5 — Smoke test on staging

Persisted assistant message shape — response_object.tool_events

Open infrastructure debt

Rollback

Step 2 — Sync `BETTER_AUTH_SECRET` across both apps

Persisted assistant message shape — `response_object.tool_events`