Deep Research Pipeline — Architecture & Design Reference

Last updated: 2026-04-30 Status: Live on staging. Validated at PCMA-1000 (1,009 members) and PCMA-5000 (4,998 members) scale. Default execution path for the CSV-import pipeline’s web-research stage.

1. Purpose and Context

The deep research pipeline (also called evidence enrichment) is Meshi’s autonomous web-discovery layer. Its job is to locate publicly available web sources about a person already in the system, verify each source belongs to the correct individual, and distill the content into structured chunks that enrich that person’s profile.

It is not the primary import path. Organizer CSV imports and LinkedIn data are imported deterministically up front. The research pipeline is the second pass — an asynchronous background job that runs after a baseline person row exists, with three operational entrypoints:

Per-entity manual / API trigger (single contact research).
Auto-trigger on LinkedIn miss — entities that fail LinkedIn discovery flow into web_researching automatically as part of the CSV import pipeline.
Bulk operational backfill — direct executor scripts (see §13) for batch runs against thousands of entities, bypassing the Inngest scheduler while reusing the same compute + persist path.

Why it exists

LinkedIn profiles alone are sparse and self-reported. The research pipeline surfaces corroborating, contradicting, and complementary evidence from third-party sources — interviews, podcasts, news articles, company websites, association directories.
It is the only path that can rescue an entity that the LinkedIn discovery stage failed to match. Discovered LinkedIn URLs are promoted back into the LinkedIn enrichment lane (see §10).
It enables discovery of external identity handles (Twitter/X, GitHub, personal website, ORCID) that the user has not explicitly provided.
It turns unstructured web content into structured source_chunk rows, which feed the LLM trait inference pass.

Where it fits in the import pipeline

Organizer CSV import (or single-entity create)
        │
        ▼
   linkedin_searching ──── match found ──► enriching ──► inferring ──► synthesizing ──► complete
        │                                      ▲
        │ no match                             │ web research surfaced a LinkedIn URL
        ▼                                      │   (migration 066: match_method='web_research_discovered')
   web_researching ◄──────────────────────────┤
        │
        ├─ accepted sources ──► inferring (Pass 2)
        ├─ candidate only ────► research_review
        ├─ LinkedIn evidence ─► enriching | inferring  (promoted)
        └─ no useful sources ─► unresearchable

Pipeline state transitions live in reconcilePipelineAfterEvidenceEnrichment (see §10).

2. Trigger Mechanisms

2.1 HTTP API (production manual trigger)

POST /api/v0/me/contacts/:entityId/research

Handler: triggerResearch() in packages/core/src/services/research.service.ts.

Guards applied before job creation:

Guard	Rule
Entity existence	Entity must exist and belong to the requesting user
Idempotency	Returns existing `queued` or `running` job ID if one exists
Staleness window	Blocks re-research within 30 days of last completed job (production only)
Monthly budget cap	Max 10 manual runs per user per calendar month

On success: inserts a research_job row (status = 'queued') and writes an EVIDENCE_ENRICHMENT_REQUESTED event to the domain_event_outbox table. Returns { jobId, entityId, status: "queued" } (HTTP 202).

Additional production endpoints in packages/api/src/routes/research.ts:

Research status:

GET /api/v0/me/contacts/:entityId/research

Research detail (sources/identities/evaluations):

GET /api/v0/me/contacts/:entityId/research/detail

Source rejection (operator/user rejects a specific source):

DELETE /api/v0/me/contacts/:entityId/research/sources/:sourceId

2.2 CSV-import auto-trigger

Entities that complete LinkedIn discovery without a confident match (or whose import-time match-method requires web verification) are advanced to web_researching by the import-pipeline state machine. The pipeline-sweep cron (see §11) re-emits EVIDENCE_ENRICHMENT_REQUESTED for any row that has been stuck in web_researching past the stale threshold.

2.3 Bulk operational scripts

For backfill or recovery on large groups (PCMA-1000, PCMA-5000):

scripts/backfill-pcma-web-research.ts          # selects + queues via Inngest events
scripts/run-pcma-web-research-jobs.ts          # direct executor — bypasses Inngest scheduler
scripts/repair-pcma-deepinfra-artifacts.ts     # idempotent embedding artifact repair after provider outages

run-pcma-web-research-jobs.ts is the workhorse: it pulls batches of queued / stale research_job rows and runs runEvidenceEnrichment directly with controlled concurrency, search RPS, queries-per-person, and an optional --heuristic-eval mode. Each job still writes the same research_job, source_record, source_chunk, and outbox rows as the Inngest path; only the scheduler is bypassed.

2.4 Domain event → Inngest

The outbox relay picks up evidence-enrichment.requested and dispatches it to Inngest:

DomainEventType.EVIDENCE_ENRICHMENT_REQUESTED = "evidence-enrichment.requested"

2.5 Developer dry-run API

POST /api/v0/dev/research/dry-run/:entityId
GET  /api/v0/dev/research/dry-run/:dryRunJobId

Gated by MESHI_ENABLE_DEV_ENDPOINTS=1. Calls computeResearch directly as a detached promise. No DB writes, no research_job rows, no outbox events. Results stored in-memory with a 1-hour TTL. Used for prompt iteration and eval harness work.

Auth nuance (matches packages/api/src/routes/dev-research.ts):

POST performs the ownership/connection check for the target entityId.
GET (polling) requires only a valid session; it does not re-check entity ownership and polls by the returned dryRunJobId.

3. Inngest Function Structure

File: packages/workers/src/functions/evidence-enrichment.ts

Function ID:   "meshi/evidence-enrichment.run"
Trigger:       EVIDENCE_ENRICHMENT_REQUESTED
Concurrency:   5 simultaneous runs (env: MESHI_EVIDENCE_ENRICHMENT_CONCURRENCY, Inngest plan cap = 5)
Throttle:      300 starts / minute (env: MESHI_EVIDENCE_ENRICHMENT_THROTTLE_LIMIT/_PERIOD)
Retries:       3 attempts
Timeouts:      start: 24h    finish: 60m

The 5/300 defaults reflect Inngest free-plan limits hit during the PCMA-1000 finalization. Local stack runs can override (MESHI_EVIDENCE_ENRICHMENT_CONCURRENCY=15, throttle 120/min) for batch drains against staging Neon — committed code stays at the plan cap.

Implementation note: after compute + persist, evidence enrichment also writes readiness facts via writeEvidenceEnrichmentReadiness(...) (e.g. external_evidence, activity_log_emitted) as best-effort observability / downstream-routing witnesses.

Step layout

runEvidenceEnrichment (idempotency guard + startJob)
   │
   ├─ buildPersonContextFromDb       (person + timeline + identities + traits + seed_data CSV cols)
   ├─ loadPreviouslyFetchedUrls      (Set<source_object_key> for the entity's web_research sources)
   │
   ├─ computeResearch
   │     ├─ generateQueries          (deterministic by default; LLM if MESHI_RESEARCH_QUERY_GENERATOR=llm)
   │     ├─ step: query-0            (search → fetch → eval → distill, durably memoized)
   │     ├─ step: query-1
   │     │   ...
   │     └─ step: query-N            (up to BUDGET.maxSearches)
   │
   ├─ step: persist                  (all DB writes + terminal outbox event in one txn)
   │
   └─ step: update-pipeline-stage    (reconcilePipelineAfterEvidenceEnrichment)

Each step: query-N is wrapped in Inngest step.run(...). runStep is plumbed through ComputeResearchDeps (see packages/workers/src/research/compute.ts:139) so dry-run, tests, and the bulk scripts can opt out by passing no runStep — all iterations then run inline. On Inngest retry, completed query steps return their cached QueryIterationResult and applyQueryIterationResult rebuilds in-memory ComputeState deterministically.

persistResearch is also wrapped in its own step.run("persist", ...) so a crash after queries complete but before DB writes finish completes cleanly on retry without re-running any LLM call.

Progress reporting

After each evaluated URL, an onProgress callback fires and writes cumulative counter columns to research_job using absolute SET semantics. Both the live-progress path (onProgress in the Inngest handler) and the final persist path (persistResearch) converge idempotently via SET — counters are always a consistent snapshot, never over-counted. The agent_log JSONB is also flushed on every callback so the admin UI’s PipelineSheet can poll for live evaluation traces.

4. Core Compute: `computeResearch`

File: packages/workers/src/research/compute.ts

This function is pure compute with zero DB access. All side effects are injected as dependencies (searchProvider, webFetch, generateObject, generateText, onProgress, runStep). This makes it directly testable and reusable by the dry-run API, eval harness, and bulk scripts without needing a real database.

Hard budget limits (per run)

Limit	Value	Notes
Max search API calls	16	`BUDGET.maxSearches` (was 10 prior to 04-27)
Max URLs fetched	32	`BUDGET.maxFetches`
Max LLM calls (total)	48	`BUDGET.maxLlmCalls`
Max accepted sources	10	`BUDGET.maxAccepted`
Results per query	8	`DEFAULT_RESULTS_PER_QUERY`
Domain diversity cap	2	accepted/candidate sources per domain per run

Any limit hit → stop_reason: "budget_exhausted".

The earlier “5 consecutive rejects → converged” stop reason has been removed. The only terminal stop reasons emitted by compute are budget_exhausted and completed. The consecutiveRejects counter is still tracked on ComputeState (and reported in QueryIterationResult.finalConsecutiveRejects) but no longer drives termination — eval failures deliberately do not advance it (see compute.ts:812 comment, finding A12). The convergence heuristic was dropped because it stranded too many entities at low budgets when judges legitimately rejected a noisy first page of search results.

Per-query iteration flow

1. generateQueries(personContext)
   └─ deterministic from name/role/company/email-domain/seed CSV context
      OR LLM when MESHI_RESEARCH_QUERY_GENERATOR=llm
      └─ returns up to 16 queries

2. For each query (wrapped in step.run("query-N", ...)):
   a. searchProvider.search(query, 8 results)

   b. For each URL (within maxFetches budget):
      i.   webFetch(url)
           └─ SSRF validation (string + DNS, IPv6 prefix blocks, IPv4-mapped guards)
           └─ Manual redirect follow (max 5 hops)
           └─ 512KB raw body cap, 15s timeout
           └─ extractText() → 10KB text cap

      ii.  if fetch failed OR text < 100 chars:
           └─ tryStoreSearchMetadataCandidate(): if title+snippet ≥ 80 chars,
              evaluate the metadata as a candidate (linkConfidence: "medium")
              so admin review can still surface useful URLs we couldn't crawl
              (e.g. LinkedIn 999, blocked HTTP 403). New since the 04-27 PCMA
              push — directly addresses the ~37% fetch_failed rate observed
              at scale.

      iii. evaluateIdentity(person, url, content[0..6000])
           └─ LLM structured output with 7-field evidence rubric
           └─ Returns ACCEPT | REVIEW | REJECT + discovered identifiers

      iv.  if ACCEPT (and domain cap not exceeded):
           └─ distillSource(person, url, content[0..10000])
              └─ LLM → 1–3 summary chunks with topic lists

3. Return QueryIterationResult (JSON-serializable, memoized by Inngest)

Mutable accumulator + deterministic replay

ComputeState is the running total across query iterations. applyQueryIterationResult(state, delta) merges each QueryIterationResult into it. On Inngest replay, completed step results are re-applied deterministically to reconstruct final state without re-running any computation. consecutiveRejects is set absolutely (not as a delta) since a successful accept resets it to 0.

5. Identity Evaluation

File: packages/workers/src/research/identity-eval.ts

The core quality gate. Content is truncated to 6,000 characters before the LLM call.

7-field evidence rubric

Field	Values	Notes
`nameEvidence`	`{ found: bool, quote: str }`	counts toward positive evidence
`affiliationEvidence`	`{ found: bool, quote: str }`	counts toward positive evidence
`roleEvidence`	`{ found: bool, quote: str }`	counts toward positive evidence
`crossReference`	`{ found: bool, quote: str }`	counts toward positive evidence
`temporalConsistency`	`{ found: bool, evidence: str }`	counts toward positive evidence
`uniqueContext`	`{ found: bool, quote: str }`	counts toward positive evidence
`contradictoryEvidence`	`{ found: bool, detail: str }`	negative signal — does NOT count
`discoveredIdentifiers`	`{ identityType, identityValue }[]`	linkedin_url, github_username, twitter_handle, mastodon_handle, personal_website, nickname, orcid, email

The schema (evidenceField Zod union) is tolerant: nulls, plain strings, and bare booleans are coerced to { found, quote/evidence/detail }. This survives noisy or schema-drifting LLM output across the model router fallback chain.

Post-processing rules

ACCEPT with fewer than 2 positive evidence fields → downgrade to REJECT
ACCEPT with contradictoryEvidence.found = true → downgrade to REVIEW
REVIEW is stored as candidate (linkConfidence medium)
ACCEPT is stored as accepted (linkConfidence high)

Discovered LinkedIn URLs

shouldStoreDiscoveredIdentity validates linkedin_url discoveries through normalizeLinkedinUrl before persistence. Bare usernames are normalized into full https://www.linkedin.com/in/<username> URLs in storeIdentityAnchor. The linkedin_username identity type recurses through storeIdentityAnchor to register the canonical linkedin_url row while preserving the linkedin_username row as the provenance trail back to the originating source_record (see also .beans/...-linkedin-pipeline-correctness*).

Design doctrine: identity integrity over recall

A false positive (wrong person accepted) is much more damaging than a missed valid source. The rubric is calibrated to prefer misses over false matches. Empirically this lands at ~97% precision / ~75% recall on the eval-harness fixtures.

6. Content Distillation

File: packages/workers/src/research/distill.ts

Runs only on ACCEPT-classified sources. Content truncated to 10,000 characters.

Output schema

{
  chunks: Array<{
    summary: string; // 2–4 sentence person-focused summary
    topics: string[]; // e.g. ["career", "education", "ventures"]
  }>; // min: 1, max: 3 chunks per source
}

Settings: temperature: 0.1, maxTokens: 4000. The 4000-token budget is required for reasoning models (Modal GLM-FP8, Cerebras zai-glm) that emit a long <thinking> block before output — cleanThinkingTags strips it before parsing. A lower budget silently produced empty chunks.

These chunks become source_chunk rows in the database (chunk_type: "research_summary", chunker_version: "research_distill_v1"). They are the primary text corpus that trait inference reads.

7. External APIs and Services

Web search providers

createSearchProvider() selects in this priority order, returning the first configured backend:

Provider	Selection	Notes
Mock	`MOCK_SEARCH=true`	Deterministic results for stress tests without API credits.
RapidAPI Google (primary)	`RAPIDAPI_SERP_KEY` present	Google search via google-search116. ~$0.0003/query. ~91% recall for people discovery.
Jina (fallback)	`JINA_API_KEY` present	GET-based, free. Same recall as Google but ~4× slower; 422 no-results edge case.
Exa (last resort)	`EXA_API_KEY` present	Neural search, autoprompt. Fast but ~9% recall for entity lookup — wrong tool.
DuckDuckGo	n/a	Available via `createAllProviders` but excluded from `createSearchProvider` priority — bot detection blocks all automated requests in practice.

createAllProviders() exposes every configured backend as a labeled map for benchmark / eval use (e.g. tests/integration/eval/...). detectSearchProviderLabel() mirrors the priority order and is used by the cost tracker to bill queries under the correct provider name.

LLM providers — model router

LLM calls flow through the weighted model router in packages/workers/src/llm/model-router.ts. The router selects from a tier-aware candidate pool with per-provider weights, concurrency caps, request-rate limits, and circuit breakers. Failure kinds (rate_limit, retryable, capability, non_retryable) drive the circuit and disable incompatible operations (e.g. schemaObject for Cerebras, which returns 400s on generateObject — the router falls back to text_json automatically).

Tiers used by the research pipeline:

basic — query generation, identity evaluation, distillation. Default tier in computeResearch.
linkedin-identity — used by the upstream LinkedIn candidate-eval judge (separate path: packages/workers/src/linkedin/candidate-eval.ts). Configured via the LINKEDIN_IDENTITY_MODELS array (formerly JUDGE_MODELS).

Current routing weights (MODEL_ROUTING_CONFIGS):

Label	basic	advanced	eloquent	linkedin-identity	maxConcurrency	RPM
`codex:gpt-5.4-mini`	35	45	50	5	8	120
`cerebras:zai-glm-4.7`	30	20	20	0	10	50
`deepinfra:deepseek-ai/DeepSeek-V4-Flash`	15	15	10	3	8	120
`xiaomi:mimo-v2.5`	10	—	10	20	5	60
`xiaomi:mimo-v2.5-pro`	—	10	—	10	3	30
`minimax:MiniMax-M2.7`	5	5	5	1	4	60
`zai:glm-5.1`	5	5	5	1	2	50
`codex:gpt-5.5`	—	—	—	45	4	60
`google:gemini-3-flash-preview`	10	15	10	15	8	120

Concurrency / RPM limits for Cerebras and Google are tunable via env (MESHI_LLM_CEREBRAS_*, MESHI_LLM_GOOGLE_*). The router observes x-ratelimit-* and Retry-After response headers, opening the circuit until the dynamic reset window passes.

Typical cost

PCMA-5000 snapshot (2026-04-29): ~$35.30 total recorded cost across 4,816 jobs that ran ≈ 0.7¢/person averaged, dominated by search API calls. Cost rounding to integer cents happens at both the llmCostCents() accumulation point and the actual_cost_cents snapshot boundary (research_job) — a fractional cents bug surfaced and was fixed in the PCMA-1000 push when unrounded floats were rejected by the integer column.

Storage hardening

storeResearchSource strips \u0000 null bytes from url, title, and content before insert. Scraped HTML/PDF noise containing literal nulls previously crashed the JSONB write with unsupported Unicode escape sequence.

8. Persistence Model

File: packages/workers/src/research/persist.ts

Phase 1 — Per accepted source (idempotent)

For each AcceptedSource, four writes occur:

storeResearchSource() → INSERT INTO source_record
- source_type = 'web_fetch', source_origin = 'research_agent', source_namespace = 'web_research'
- Idempotent on (namespace, object_key) where key = webResearchSourceKey(url)
storeResearchLink() → INSERT INTO source_record_entity_link
- link_basis = 'research_evaluation'
- link_confidence: high for ACCEPT, medium for REVIEW / metadata-fallback candidates
- accepted_by = 'research_agent' (vs 'human_review' for admin-confirmed candidates)
- Stores identity_evidence JSONB (the full rubric output)
- Idempotent on (source_record_id, entity_id, link_basis, status='active')
storeResearchChunks() → INSERT INTO source_chunk
- Distilled summaries with topic lists
- Skipped when distilledChunks.length === 0 (e.g. budget cut off before distill, or distill parse failure)
- Via sourceChunkRepo.insertChunkSet
storeIdentityAnchor() → INSERT INTO person_external_identity
- Status: 'observed' for new discoveries
- If an existing 'observed' row exists from a different source → upgrade to 'verified'
- linkedin_username discoveries recurse to register the canonical linkedin_url row
- H2: identities_discovered counter only increments for wasNew || wasUpgraded rows

Phase 2 — Job finalization (single transaction)

UPDATE research_job SET
  queries_executed, sources_fetched, sources_accepted, sources_candidate,
  sources_rejected, llm_calls, identities_discovered = <persist-layer tally>

-- terminal status
UPDATE research_job SET status = 'completed' | 'budget_exhausted',
  stop_reason, actual_cost_cents, agent_log, completed_at = NOW()

-- only if any accepted OR candidate sources exist
INSERT INTO domain_event_outbox (
  event_type = 'evidence-enrichment.completed',
  idempotency_key = 'evidence-enrichment-completed:{jobId}'
)

The terminal status update and the outbox event are in the same transaction (B4 invariant).

Idempotency short-circuit

persistResearch reads research_job.status at entry. If the job is already in a terminal state (crash recovery after Phase 2 completed in a prior attempt), persist returns the persisted counters from the row immediately — without this, completeJob / exhaustBudget would throw NoResultError (their WHERE status = 'running' clause matches nothing) and bubble up as an Inngest retry failure.

When result.acceptedSources.length === 0 && c.sourcesCandidate === 0, the evidence-enrichment.completed outbox event is not enqueued — there is no downstream work for trait inference to do. Pipeline reconciliation handles the no-evidence case directly via the Inngest function’s update-pipeline-stage step.

Database tables involved

Table	Role
`research_job`	Job lifecycle, counters, `agent_log` JSONB
`source_record`	Raw fetched page content (50KB cap, null-byte stripped)
`source_record_entity_link`	Person → source link with rubric evidence JSONB
`source_chunk`	Distilled summary chunks (`research_summary` type)
`person_external_identity`	Discovered handles/URLs (`observed` → `verified`)
`person_pipeline`	State machine row (see §10 reconciliation)
`domain_event_outbox`	Async event bus for triggering downstream work

9. Downstream: Trait Inference

File: packages/workers/src/functions/trait-inference.ts

traitInferenceFunction (meshi/trait-inference.run) can be triggered by multiple domain events with a 5-second debounce per entityId:

EVIDENCE_ENRICHMENT_COMPLETED
ENRICHMENT_COMPLETED
PERSON_UPDATED

Retries: 8 (Inngest retries: 8).

What trait inference reads

SELECT * FROM source_chunk WHERE chunk_type = 'research_summary'

These are exactly the chunks written by storeResearchChunks. They are combined with LinkedIn timeline records as the text corpus for LLM inference.

Two-pass inference

Pass 1 (after LinkedIn import): fast, deterministic, cheap (~$0.03/entity). Infers traits from structured LinkedIn data.
Pass 2 (after evidence enrichment): slower, uses the full corpus including research chunks (~$0.10–0.30/entity additional). Re-infers all traits with richer evidence.

Users see results from Pass 1 within seconds of import. Pass 2 enriches those results asynchronously.

Claim replacement contract

traitClaimRepo.replaceMachineClaims(db, entityId, newClaims) — full idempotent replacement of all machine-generated trait_claim rows for the entity. Same value with different evidence_refs → supersede old, insert fresh. Same value with same evidence → keep (no churn).

10. Pipeline Reconciliation

File: packages/workers/src/functions/evidence-enrichment.ts → reconcilePipelineAfterEvidenceEnrichment

After every research run, the Inngest function’s update-pipeline-stage step decides what stage the entity’s person_pipeline row should advance to. The decision tree, in order:

Stale-job guard — if pipeline.research_job_id doesn’t match the completing jobId, skip (a newer job has superseded this one).
Already-advanced guard — if the pipeline is at inferring or synthesizing, skip (a delayed duplicate completion must not move rows backward).
LinkedIn-discovered promotion — if a linkedin_url identity anchor exists (observed, verified, or trusted_for_autolink) and the pipeline is in a promotable stage (linkedin_searching | linkedin_review | web_researching | research_review | unresearchable):
- If a linkedin_profile source already exists → advance to inferring and emit a synthetic ENRICHMENT_COMPLETED event.
- Otherwise → advance to enriching with match_method = 'web_research_discovered' and match_confidence = 0.9 (if anchor verified) or 0.75 (if anchor observed). Emit PERSON_LINKEDIN_READY.
- Migration 066 (066_allow_web_research_discovered_match_method.ts) adds 'web_research_discovered' to the allowed person_pipeline.match_method CHECK constraint values: auto_accepted | web_search_accepted | web_research_discovered | admin_confirmed | admin_provided | none.
Accepted sources or human-confirmed evidence → advance to inferring. If the human-confirmed path triggered with result.sourcesAccepted === 0, emit a synthetic EVIDENCE_ENRICHMENT_COMPLETED so trait inference runs.
LinkedIn evidence already present (LinkedIn anchor or linkedin_profile source) but no accepted web sources → advance to inferring and emit ENRICHMENT_COMPLETED. The required research pass ran; a confident LinkedIn source is enough to infer.
Candidate-only outcome → advance to research_review (unless the entity is already unresearchable with no active candidates).
No useful sources → advance to unresearchable, set error_detail, and invalidate any stale inferred pipeline artifacts via pipelineArtifactRepo.invalidateInferredPipelineArtifacts.

Implementation note: stage transitions are conflict-safe and use optimistic concurrency via CAS (compare-and-swap) with personPipelineRepo.updateStageCAS(...) (expected stage + expected row_version) to avoid clobbering concurrent advances.

This reconciliation is the only path that mutates person_pipeline based on research outcomes; the import-side LinkedIn discovery has its own state machine.

11. Pipeline Sweep (Recovery Cron)

File: packages/workers/src/functions/pipeline-sweep.ts

Cron */10 * * * * (every 10 minutes). Reconciles persistent pipeline state with durable evidence in the database — distinct from event replay. Recovery scenarios:

Stuck intermediate stages (>30 min, env: MESHI_PIPELINE_SWEEP_STALE_MINUTES): re-emits the appropriate Inngest event for stages linkedin_searching | enriching | web_researching | inferring | synthesizing. All downstream functions are idempotent (content hashes, replaceMachineClaims, run_key dedup), so re-triggering is safe.
Missing import pipelines: group members with organizer-import sources but no person_pipeline row are re-emitted into LinkedIn discovery.
Completed web research, stuck stage: terminal research_job rows still sitting at web_researching are advanced to inferring, research_review, enriching, or unresearchable directly from durable counters/evidence (no event replay needed).
LinkedIn review without web research (>30 min): queues the required web-research pass so an open admin review queue does not block downstream inference.
False terminal unresearchable rows: rows with existing LinkedIn evidence are promoted back into enrichment; rows with active candidate research evidence are restored to research_review.
Stale linkedin_review (>48 hours): auto-confirms the best candidate when confidence ≥ AUTO_CONFIRM_THRESHOLD = 0.85. Do not lower this floor — it matches classifyEvaluations’s auto_accept floor and anything lower silently confirms LinkedIn matches the LLM explicitly classified review, poisoning downstream enrichment. Rows with no usable candidate are terminally marked unresearchable; rows that have other LinkedIn evidence are promoted back into enrichment.

MAX_PER_SWEEP = 500 (env: MESHI_PIPELINE_SWEEP_MAX_PER_RUN) caps work per cron tick.

12. Rate Limiting and Cost Controls

Inngest-level

Concurrency:  5 simultaneous runs   (Inngest plan cap; env override possible up to plan limit)
Throttle:     300 starts / minute   (env: MESHI_EVIDENCE_ENRICHMENT_THROTTLE_LIMIT/_PERIOD)

Local-stack drains against staging Neon override these to 15 / 120-per-minute and were ~8× faster than cloud Inngest at PCMA-1000 scale (filed in meshi-platform-0vtp for the next 5K push).

Application-level (per user, manual trigger only)

Staleness window:   30 days   — cannot re-research same entity within 30 days (production)
Monthly cap:        10 runs   — manual trigger type, per user per calendar month

Both are skipped in development (MESHI_ENV !== "production").

Per-run hard budget (compute layer)

maxSearches:   16   (search API calls)
maxFetches:    32   (URLs fetched)
maxLlmCalls:   48   (total LLM calls across query gen, identity eval, distill)
maxAccepted:   10   (accepted sources)
results/query:  8
domain cap:     2   (accepted/candidate per domain per run)

Any limit hit → job marked budget_exhausted. Trait inference triggers in both completed and budget_exhausted cases.

13. Operational Scripts

Script	Role
`scripts/backfill-pcma-web-research.ts`	Select group members lacking active web_research; create / reuse research_job rows; advance pipeline to `web_researching`; emit `evidence-enrichment.requested` events to local Inngest.
`scripts/run-pcma-web-research-jobs.ts`	Direct executor — pulls queued/stale jobs and runs `runEvidenceEnrichment` directly with `--concurrency`, `--search-rps`, `--queries-per-person`, `--results-per-query`, `--search-retries`, `--heuristic-eval`, `--include-existing-web-research` flags. Bypasses Inngest scheduler; same compute + persist path.
`scripts/repair-pcma-deepinfra-artifacts.ts`	Idempotent embedding-artifact repair after a transient DeepInfra outage. Recomputes deterministic embeddings only — does not re-run inference.

These scripts require an inline DATABASE_URL=... and an --env-file=.env. Do not append staging DATABASE_URL to .env — deno --watch reloads pick up the polluted env and the local API will silently serve unauthorized 401s on the next code change. Keep the staging URL on the command line only.

14. Key Assumptions

Open-web fidelity assumption. Reputable public sources (interview articles, podcast transcripts, company bios) are more reliable identity signals than social media noise. The identity rubric weights structured affiliation and name evidence above social-graph proximity.
Name uniqueness assumption. The rubric is calibrated for relatively unique names and roles. For common names or low-context individuals, precision will be lower. The 7-field / ≥2-positive-fields threshold is the main guard. The PCMA-5000 snapshot showed a 27% reject rate and 36.5% fetch_failed rate — both expected at population scale, both addressable by the metadata-fallback path (§4) and pipeline-sweep (§11).
Content truncation assumption. Truncating pages to 6,000 chars (identity eval) or 10,000 chars (distillation) captures the most signal-dense beginning of most articles. Known failure mode for pages with dense boilerplate headers (YouTube, Reddit).
Source count as cost primitive. maxAccepted = 10 is the primary cost-control lever, not token counting. Simpler and predictable. Validated at ~0.7¢/person at PCMA-5000 scale.
Two-independent-source verification. Discovered identities are upgraded from 'observed' to 'verified' when the same (identityType, identityValue) is corroborated by two independent sources. Heuristic — does not require human confirmation.
Distillation is person-focused. Topic metadata may not capture the broader significance of a source — acceptable for trait inference, not for general knowledge retrieval.
Progressive enrichment. Re-running research on the same entity (after the 30-day window) accumulates new sources rather than replacing existing ones. Supersession happens only at the trait_claim layer (via replaceMachineClaims), not at source_record.
LLM model selection. All three compute-layer LLM steps (query gen, identity eval, distillation) use tier basic. The pipeline accepts lower semantic quality from these models in exchange for cost. The rubric’s post-processing rules compensate for model tendency to accept borderline sources. The LinkedIn judge uses tier linkedin-identity (separate weights).
Step checkpointing > naive idempotency. runStep per query-iteration is the primary crash-resume mechanism. Storage-layer idempotency (lookup-or-create) is the belt-and-suspenders guard for partial-write retries; without step checkpointing, the entire query loop would replay on every retry.

15. Known Limitations and Open Work

Issue	Priority	Notes
Prompt injection (untrusted web text in prompts)	P0	Bean `meshi-platform-24dn` — not resolved
LinkedIn HTTP 999 / 403 / fetch_failed dominates rejection mix at 5K scale	P0	Metadata-fallback path partially mitigates; provider rotation TBD
Stale `running` research_jobs at 5K scale (1,586 / >6h in 04-29 snapshot)	P0	Pipeline-sweep advances them; root cause is plan-cap + LLM exhaust
Empty `canonical_brief.source_refs` for new 5K group	P1	Brief-synthesis traceability gap; not a research-pipeline bug
Onboarding submissions miss deterministic Layer 2 extraction	P1	Goals/offers from onboarding not imported as claims
Evidence provenance over-attributed (all source IDs, not chunk-specific)	P1	Audit traceability unreliable for multi-source entities
`replaceMachineClaims` can insert intra-batch duplicate claims	P1	Dedup checks DB but not the incoming batch
Temporal query bias (recency vs. historical coverage)	Eval	Tested in `loqp` branch; not promoted yet
Eval harness not merged to staging	Infra	Intentionally isolated in loqp worktree
Research quality signals not surfaced in UI	Future	PipelineSheet shows counters; per-source rubric breakdown TBD
Stronger snapshot creation idempotency	Future	Bean `meshi-platform-tbv2`

16. Key Files Reference

File	Purpose
`packages/workers/src/research/compute.ts`	Pure compute entrypoint — zero DB; per-query step checkpointing
`packages/workers/src/research/persist.ts`	All DB writes + terminal outbox event; idempotency short-circuit
`packages/workers/src/research/identity-eval.ts`	7-field LLM identity rubric + post-processing
`packages/workers/src/research/distill.ts`	Content distillation to chunks
`packages/workers/src/research/cost.ts`	Token cost tracking; integer-cent rounding
`packages/workers/src/research/storage.ts`	Low-level DB storage helpers; null-byte sanitizer
`packages/workers/src/research/web-search.ts`	Search-provider factories + priority order
`packages/workers/src/research/web-fetch.ts`	SSRF-safe fetcher (IPv4 + IPv6 prefix blocks, redirect cap)
`packages/workers/src/functions/evidence-enrichment.ts`	Inngest function + step orchestration + reconciliation
`packages/workers/src/functions/pipeline-sweep.ts`	10-min recovery cron
`packages/workers/src/functions/trait-inference.ts`	Downstream consumer of research output
`packages/workers/src/llm/model-router.ts`	Weighted-router across providers + circuit breakers
`packages/core/src/services/research.service.ts`	API-layer guards and job creation
`packages/api/src/routes/research.ts`	HTTP trigger endpoint
`packages/api/src/routes/dev-research.ts`	Dry-run route factory
`packages/domain/src/research.ts`	All canonical types (`ResearchResult`, `PersonContext`, etc.)
`packages/domain/src/domain-events.ts`	Event type constants and payloads
`packages/db/migrations/066_allow_web_research_discovered_match_method.ts`	Adds `web_research_discovered` to `person_pipeline.match_method` CHECK
`scripts/run-pcma-web-research-jobs.ts`	Direct executor for bulk backfill
`scripts/backfill-pcma-web-research.ts`	Inngest-event backfill
`docs/evidence-enrichment-report.md`	Earlier quantitative eval results (3 runs, 6 subjects)
`docs/pcma-5000-nonmember-import-research-overview-2026-04-29.md`	Population-scale snapshot

17. End-to-End Flow Summary

POST /api/v0/me/contacts/:entityId/research          (or auto-trigger from import / sweep)
   │
   ├─ Guards: exists? idempotent? stale? budget?
   ├─ INSERT research_job (status='queued')
   └─ INSERT outbox (EVIDENCE_ENRICHMENT_REQUESTED)
           │
           ▼ outbox relay → Inngest
evidenceEnrichmentFunction (concurrency=5, throttle=300/min, retries=3, finish=60m)
   │
   ├─ runEvidenceEnrichment
   │   ├─ idempotency guard (terminal jobs return early)
   │   ├─ buildPersonContextFromDb     (timeline + identities + traits + seed_data)
   │   ├─ loadPreviouslyFetchedUrls    (skip set)
   │   │
   │   └─ computeResearch
   │       ├─ generateQueries          (deterministic by default; LLM optional)
   │       ├─ step: query-0 [memoized via runStep]
   │       │   ├─ search → 8 results
   │       │   └─ for each URL:
   │       │       ├─ webFetch (SSRF-safe, 512KB cap)
   │       │       ├─ if fetch_failed/short → tryStoreSearchMetadataCandidate
   │       │       ├─ identity_eval (LLM, 6K chars, 7-field rubric)
   │       │       └─ if ACCEPT (domain cap OK): distill (LLM, 10K chars) → 1–3 chunks
   │       └─ step: query-1 ... query-N (up to BUDGET.maxSearches=16)
   │
   ├─ step: persist [single txn, idempotent]
   │   ├─ source_record (one per accepted/candidate URL, null bytes stripped)
   │   ├─ source_record_entity_link (with identity_evidence JSONB)
   │   ├─ source_chunk (research_summary, up to 3 per source)
   │   ├─ person_external_identity (observed → verified; linkedin_username → linkedin_url)
   │   └─ UPDATE research_job (terminal) + INSERT outbox (COMPLETED)   ← only if any sources
   │
   └─ step: update-pipeline-stage
       └─ reconcilePipelineAfterEvidenceEnrichment:
           ├─ LinkedIn discovered? → enriching (match_method='web_research_discovered')
           ├─ accepted/human-confirmed? → inferring
           ├─ LinkedIn evidence already present? → inferring
           ├─ candidate only? → research_review
           └─ nothing useful? → unresearchable + invalidate inferred artifacts
                   │
                   ▼ outbox relay → Inngest (debounce 5s)
        traitInferenceFunction
           │
           ├─ Load source_chunk (research_summary) + timeline records
           ├─ extractFieldTraits()    → deterministic traits from structured data
           ├─ extractLlmTraits()      → LLM inference per dimension
           └─ replaceMachineClaims()  → full idempotent replacement of trait_claim rows

   ┌─────────────────────────────────────────────────────────────────┐
   │ pipelineSweepFunction (cron */10 * * * *)                        │
   │  - Re-emits stuck stages (>30 min)                               │
   │  - Advances completed-but-stuck research jobs                    │
   │  - Rescues false-terminal `unresearchable` rows                  │
   │  - Auto-confirms stale linkedin_review at confidence ≥ 0.85     │
   └─────────────────────────────────────────────────────────────────┘

Deep Research Pipeline — Architecture & Design Reference

Deep Research Pipeline — Architecture & Design Reference

1. Purpose and Context

Why it exists

Where it fits in the import pipeline

2. Trigger Mechanisms

2.1 HTTP API (production manual trigger)

2.2 CSV-import auto-trigger

2.3 Bulk operational scripts

2.4 Domain event → Inngest

2.5 Developer dry-run API

3. Inngest Function Structure

Step layout

Progress reporting

4. Core Compute: computeResearch

Hard budget limits (per run)

Per-query iteration flow

Mutable accumulator + deterministic replay

5. Identity Evaluation

7-field evidence rubric

Post-processing rules

Discovered LinkedIn URLs

Design doctrine: identity integrity over recall

6. Content Distillation

Output schema

7. External APIs and Services

Web search providers

LLM providers — model router

Typical cost

Storage hardening

8. Persistence Model

Phase 1 — Per accepted source (idempotent)

Phase 2 — Job finalization (single transaction)

Idempotency short-circuit

Database tables involved

9. Downstream: Trait Inference

What trait inference reads

Two-pass inference

Claim replacement contract

10. Pipeline Reconciliation

11. Pipeline Sweep (Recovery Cron)

12. Rate Limiting and Cost Controls

Inngest-level

Application-level (per user, manual trigger only)

Per-run hard budget (compute layer)

13. Operational Scripts

14. Key Assumptions

15. Known Limitations and Open Work

16. Key Files Reference

17. End-to-End Flow Summary

4. Core Compute: `computeResearch`