Deep Research Pipeline — Architecture & Design Reference
Deep Research Pipeline — Architecture & Design Reference
Last updated: 2026-04-30 Status: Live on staging. Validated at PCMA-1000 (1,009 members) and PCMA-5000 (4,998 members) scale. Default execution path for the CSV-import pipeline’s web-research stage.
1. Purpose and Context
The deep research pipeline (also called evidence enrichment) is Meshi’s autonomous web-discovery layer. Its job is to locate publicly available web sources about a person already in the system, verify each source belongs to the correct individual, and distill the content into structured chunks that enrich that person’s profile.
It is not the primary import path. Organizer CSV imports and LinkedIn data are imported deterministically up front. The research pipeline is the second pass — an asynchronous background job that runs after a baseline person row exists, with three operational entrypoints:
- Per-entity manual / API trigger (single contact research).
- Auto-trigger on LinkedIn miss — entities that fail LinkedIn discovery flow into
web_researchingautomatically as part of the CSV import pipeline. - Bulk operational backfill — direct executor scripts (see §13) for batch runs against thousands of entities, bypassing the Inngest scheduler while reusing the same compute + persist path.
Why it exists
- LinkedIn profiles alone are sparse and self-reported. The research pipeline surfaces corroborating, contradicting, and complementary evidence from third-party sources — interviews, podcasts, news articles, company websites, association directories.
- It is the only path that can rescue an entity that the LinkedIn discovery stage failed to match. Discovered LinkedIn URLs are promoted back into the LinkedIn enrichment lane (see §10).
- It enables discovery of external identity handles (Twitter/X, GitHub, personal website, ORCID) that the user has not explicitly provided.
- It turns unstructured web content into structured
source_chunkrows, which feed the LLM trait inference pass.
Where it fits in the import pipeline
Organizer CSV import (or single-entity create) │ ▼ linkedin_searching ──── match found ──► enriching ──► inferring ──► synthesizing ──► complete │ ▲ │ no match │ web research surfaced a LinkedIn URL ▼ │ (migration 066: match_method='web_research_discovered') web_researching ◄──────────────────────────┤ │ ├─ accepted sources ──► inferring (Pass 2) ├─ candidate only ────► research_review ├─ LinkedIn evidence ─► enriching | inferring (promoted) └─ no useful sources ─► unresearchablePipeline state transitions live in reconcilePipelineAfterEvidenceEnrichment (see §10).
2. Trigger Mechanisms
2.1 HTTP API (production manual trigger)
POST /api/v0/me/contacts/:entityId/researchHandler: triggerResearch() in packages/core/src/services/research.service.ts.
Guards applied before job creation:
| Guard | Rule |
|---|---|
| Entity existence | Entity must exist and belong to the requesting user |
| Idempotency | Returns existing queued or running job ID if one exists |
| Staleness window | Blocks re-research within 30 days of last completed job (production only) |
| Monthly budget cap | Max 10 manual runs per user per calendar month |
On success: inserts a research_job row (status = 'queued') and writes an
EVIDENCE_ENRICHMENT_REQUESTED event to the domain_event_outbox table. Returns
{ jobId, entityId, status: "queued" } (HTTP 202).
Additional production endpoints in packages/api/src/routes/research.ts:
- Research status:
GET /api/v0/me/contacts/:entityId/research- Research detail (sources/identities/evaluations):
GET /api/v0/me/contacts/:entityId/research/detail- Source rejection (operator/user rejects a specific source):
DELETE /api/v0/me/contacts/:entityId/research/sources/:sourceId2.2 CSV-import auto-trigger
Entities that complete LinkedIn discovery without a confident match (or whose import-time
match-method requires web verification) are advanced to web_researching by the import-pipeline
state machine. The pipeline-sweep cron (see §11) re-emits EVIDENCE_ENRICHMENT_REQUESTED for any
row that has been stuck in web_researching past the stale threshold.
2.3 Bulk operational scripts
For backfill or recovery on large groups (PCMA-1000, PCMA-5000):
scripts/backfill-pcma-web-research.ts # selects + queues via Inngest eventsscripts/run-pcma-web-research-jobs.ts # direct executor — bypasses Inngest schedulerscripts/repair-pcma-deepinfra-artifacts.ts # idempotent embedding artifact repair after provider outagesrun-pcma-web-research-jobs.ts is the workhorse: it pulls batches of queued / stale
research_job rows and runs runEvidenceEnrichment directly with controlled concurrency, search
RPS, queries-per-person, and an optional --heuristic-eval mode. Each job still writes the same
research_job, source_record, source_chunk, and outbox rows as the Inngest path; only the
scheduler is bypassed.
2.4 Domain event → Inngest
The outbox relay picks up evidence-enrichment.requested and dispatches it to Inngest:
DomainEventType.EVIDENCE_ENRICHMENT_REQUESTED = "evidence-enrichment.requested"2.5 Developer dry-run API
POST /api/v0/dev/research/dry-run/:entityIdGET /api/v0/dev/research/dry-run/:dryRunJobIdGated by MESHI_ENABLE_DEV_ENDPOINTS=1. Calls computeResearch directly as a detached promise. No
DB writes, no research_job rows, no outbox events. Results stored in-memory with a 1-hour TTL.
Used for prompt iteration and eval harness work.
Auth nuance (matches packages/api/src/routes/dev-research.ts):
- POST performs the ownership/connection check for the target
entityId. - GET (polling) requires only a valid session; it does not re-check entity ownership and
polls by the returned
dryRunJobId.
3. Inngest Function Structure
File: packages/workers/src/functions/evidence-enrichment.ts
Function ID: "meshi/evidence-enrichment.run"Trigger: EVIDENCE_ENRICHMENT_REQUESTEDConcurrency: 5 simultaneous runs (env: MESHI_EVIDENCE_ENRICHMENT_CONCURRENCY, Inngest plan cap = 5)Throttle: 300 starts / minute (env: MESHI_EVIDENCE_ENRICHMENT_THROTTLE_LIMIT/_PERIOD)Retries: 3 attemptsTimeouts: start: 24h finish: 60mThe 5/300 defaults reflect Inngest free-plan limits hit during the PCMA-1000 finalization. Local
stack runs can override (MESHI_EVIDENCE_ENRICHMENT_CONCURRENCY=15, throttle 120/min) for batch
drains against staging Neon — committed code stays at the plan cap.
Implementation note: after compute + persist, evidence enrichment also writes readiness facts via
writeEvidenceEnrichmentReadiness(...) (e.g. external_evidence, activity_log_emitted) as
best-effort observability / downstream-routing witnesses.
Step layout
runEvidenceEnrichment (idempotency guard + startJob) │ ├─ buildPersonContextFromDb (person + timeline + identities + traits + seed_data CSV cols) ├─ loadPreviouslyFetchedUrls (Set<source_object_key> for the entity's web_research sources) │ ├─ computeResearch │ ├─ generateQueries (deterministic by default; LLM if MESHI_RESEARCH_QUERY_GENERATOR=llm) │ ├─ step: query-0 (search → fetch → eval → distill, durably memoized) │ ├─ step: query-1 │ │ ... │ └─ step: query-N (up to BUDGET.maxSearches) │ ├─ step: persist (all DB writes + terminal outbox event in one txn) │ └─ step: update-pipeline-stage (reconcilePipelineAfterEvidenceEnrichment)Each step: query-N is wrapped in Inngest step.run(...). runStep is plumbed through
ComputeResearchDeps (see packages/workers/src/research/compute.ts:139) so dry-run, tests, and
the bulk scripts can opt out by passing no runStep — all iterations then run inline. On Inngest
retry, completed query steps return their cached QueryIterationResult and
applyQueryIterationResult rebuilds in-memory ComputeState deterministically.
persistResearch is also wrapped in its own step.run("persist", ...) so a crash after queries
complete but before DB writes finish completes cleanly on retry without re-running any LLM call.
Progress reporting
After each evaluated URL, an onProgress callback fires and writes cumulative counter columns to
research_job using absolute SET semantics. Both the live-progress path (onProgress in the
Inngest handler) and the final persist path (persistResearch) converge idempotently via SET —
counters are always a consistent snapshot, never over-counted. The agent_log JSONB is also flushed
on every callback so the admin UI’s PipelineSheet can poll for live evaluation traces.
4. Core Compute: computeResearch
File: packages/workers/src/research/compute.ts
This function is pure compute with zero DB access. All side effects are injected as dependencies
(searchProvider, webFetch, generateObject, generateText, onProgress, runStep). This
makes it directly testable and reusable by the dry-run API, eval harness, and bulk scripts without
needing a real database.
Hard budget limits (per run)
| Limit | Value | Notes |
|---|---|---|
| Max search API calls | 16 | BUDGET.maxSearches (was 10 prior to 04-27) |
| Max URLs fetched | 32 | BUDGET.maxFetches |
| Max LLM calls (total) | 48 | BUDGET.maxLlmCalls |
| Max accepted sources | 10 | BUDGET.maxAccepted |
| Results per query | 8 | DEFAULT_RESULTS_PER_QUERY |
| Domain diversity cap | 2 | accepted/candidate sources per domain per run |
Any limit hit → stop_reason: "budget_exhausted".
The earlier “5 consecutive rejects → converged” stop reason has been removed. The only terminal stop reasons emitted by compute are
budget_exhaustedandcompleted. TheconsecutiveRejectscounter is still tracked onComputeState(and reported inQueryIterationResult.finalConsecutiveRejects) but no longer drives termination — eval failures deliberately do not advance it (seecompute.ts:812comment, finding A12). The convergence heuristic was dropped because it stranded too many entities at low budgets when judges legitimately rejected a noisy first page of search results.
Per-query iteration flow
1. generateQueries(personContext) └─ deterministic from name/role/company/email-domain/seed CSV context OR LLM when MESHI_RESEARCH_QUERY_GENERATOR=llm └─ returns up to 16 queries
2. For each query (wrapped in step.run("query-N", ...)): a. searchProvider.search(query, 8 results)
b. For each URL (within maxFetches budget): i. webFetch(url) └─ SSRF validation (string + DNS, IPv6 prefix blocks, IPv4-mapped guards) └─ Manual redirect follow (max 5 hops) └─ 512KB raw body cap, 15s timeout └─ extractText() → 10KB text cap
ii. if fetch failed OR text < 100 chars: └─ tryStoreSearchMetadataCandidate(): if title+snippet ≥ 80 chars, evaluate the metadata as a candidate (linkConfidence: "medium") so admin review can still surface useful URLs we couldn't crawl (e.g. LinkedIn 999, blocked HTTP 403). New since the 04-27 PCMA push — directly addresses the ~37% fetch_failed rate observed at scale.
iii. evaluateIdentity(person, url, content[0..6000]) └─ LLM structured output with 7-field evidence rubric └─ Returns ACCEPT | REVIEW | REJECT + discovered identifiers
iv. if ACCEPT (and domain cap not exceeded): └─ distillSource(person, url, content[0..10000]) └─ LLM → 1–3 summary chunks with topic lists
3. Return QueryIterationResult (JSON-serializable, memoized by Inngest)Mutable accumulator + deterministic replay
ComputeState is the running total across query iterations.
applyQueryIterationResult(state, delta) merges each QueryIterationResult into it. On Inngest
replay, completed step results are re-applied deterministically to reconstruct final state without
re-running any computation. consecutiveRejects is set absolutely (not as a delta) since a
successful accept resets it to 0.
5. Identity Evaluation
File: packages/workers/src/research/identity-eval.ts
The core quality gate. Content is truncated to 6,000 characters before the LLM call.
7-field evidence rubric
| Field | Values | Notes |
|---|---|---|
nameEvidence | { found: bool, quote: str } | counts toward positive evidence |
affiliationEvidence | { found: bool, quote: str } | counts toward positive evidence |
roleEvidence | { found: bool, quote: str } | counts toward positive evidence |
crossReference | { found: bool, quote: str } | counts toward positive evidence |
temporalConsistency | { found: bool, evidence: str } | counts toward positive evidence |
uniqueContext | { found: bool, quote: str } | counts toward positive evidence |
contradictoryEvidence | { found: bool, detail: str } | negative signal — does NOT count |
discoveredIdentifiers | { identityType, identityValue }[] | linkedin_url, github_username, twitter_handle, mastodon_handle, personal_website, nickname, orcid, email |
The schema (evidenceField Zod union) is tolerant: nulls, plain strings, and bare booleans are
coerced to { found, quote/evidence/detail }. This survives noisy or schema-drifting LLM output
across the model router fallback chain.
Post-processing rules
ACCEPTwith fewer than 2 positive evidence fields → downgrade to REJECTACCEPTwithcontradictoryEvidence.found = true→ downgrade to REVIEWREVIEWis stored ascandidate(linkConfidencemedium)ACCEPTis stored asaccepted(linkConfidencehigh)
Discovered LinkedIn URLs
shouldStoreDiscoveredIdentity validates linkedin_url discoveries through normalizeLinkedinUrl
before persistence. Bare usernames are normalized into full https://www.linkedin.com/in/<username>
URLs in storeIdentityAnchor. The linkedin_username identity type recurses through
storeIdentityAnchor to register the canonical linkedin_url row while preserving the
linkedin_username row as the provenance trail back to the originating source_record (see also
.beans/...-linkedin-pipeline-correctness*).
Design doctrine: identity integrity over recall
A false positive (wrong person accepted) is much more damaging than a missed valid source. The rubric is calibrated to prefer misses over false matches. Empirically this lands at ~97% precision / ~75% recall on the eval-harness fixtures.
6. Content Distillation
File: packages/workers/src/research/distill.ts
Runs only on ACCEPT-classified sources. Content truncated to 10,000 characters.
Output schema
{ chunks: Array<{ summary: string; // 2–4 sentence person-focused summary topics: string[]; // e.g. ["career", "education", "ventures"] }>; // min: 1, max: 3 chunks per source}Settings: temperature: 0.1, maxTokens: 4000. The 4000-token budget is required for reasoning
models (Modal GLM-FP8, Cerebras zai-glm) that emit a long <thinking> block before output —
cleanThinkingTags strips it before parsing. A lower budget silently produced empty chunks.
These chunks become source_chunk rows in the database (chunk_type: "research_summary",
chunker_version: "research_distill_v1"). They are the primary text corpus that trait inference
reads.
7. External APIs and Services
Web search providers
createSearchProvider() selects in this priority order, returning the first configured backend:
| Provider | Selection | Notes |
|---|---|---|
| Mock | MOCK_SEARCH=true | Deterministic results for stress tests without API credits. |
| RapidAPI Google (primary) | RAPIDAPI_SERP_KEY present | Google search via google-search116. ~$0.0003/query. ~91% recall for people discovery. |
| Jina (fallback) | JINA_API_KEY present | GET-based, free. Same recall as Google but ~4× slower; 422 no-results edge case. |
| Exa (last resort) | EXA_API_KEY present | Neural search, autoprompt. Fast but ~9% recall for entity lookup — wrong tool. |
| DuckDuckGo | n/a | Available via createAllProviders but excluded from createSearchProvider priority — bot detection blocks all automated requests in practice. |
createAllProviders() exposes every configured backend as a labeled map for benchmark / eval use
(e.g. tests/integration/eval/...). detectSearchProviderLabel() mirrors the priority order and is
used by the cost tracker to bill queries under the correct provider name.
LLM providers — model router
LLM calls flow through the weighted model router in packages/workers/src/llm/model-router.ts.
The router selects from a tier-aware candidate pool with per-provider weights, concurrency caps,
request-rate limits, and circuit breakers. Failure kinds (rate_limit, retryable, capability,
non_retryable) drive the circuit and disable incompatible operations (e.g. schemaObject for
Cerebras, which returns 400s on generateObject — the router falls back to text_json
automatically).
Tiers used by the research pipeline:
basic— query generation, identity evaluation, distillation. Default tier incomputeResearch.linkedin-identity— used by the upstream LinkedIn candidate-eval judge (separate path:packages/workers/src/linkedin/candidate-eval.ts). Configured via theLINKEDIN_IDENTITY_MODELSarray (formerlyJUDGE_MODELS).
Current routing weights (MODEL_ROUTING_CONFIGS):
| Label | basic | advanced | eloquent | linkedin-identity | maxConcurrency | RPM |
|---|---|---|---|---|---|---|
codex:gpt-5.4-mini | 35 | 45 | 50 | 5 | 8 | 120 |
cerebras:zai-glm-4.7 | 30 | 20 | 20 | 0 | 10 | 50 |
deepinfra:deepseek-ai/DeepSeek-V4-Flash | 15 | 15 | 10 | 3 | 8 | 120 |
xiaomi:mimo-v2.5 | 10 | — | 10 | 20 | 5 | 60 |
xiaomi:mimo-v2.5-pro | — | 10 | — | 10 | 3 | 30 |
minimax:MiniMax-M2.7 | 5 | 5 | 5 | 1 | 4 | 60 |
zai:glm-5.1 | 5 | 5 | 5 | 1 | 2 | 50 |
codex:gpt-5.5 | — | — | — | 45 | 4 | 60 |
google:gemini-3-flash-preview | 10 | 15 | 10 | 15 | 8 | 120 |
Concurrency / RPM limits for Cerebras and Google are tunable via env (MESHI_LLM_CEREBRAS_*,
MESHI_LLM_GOOGLE_*). The router observes x-ratelimit-* and Retry-After response headers,
opening the circuit until the dynamic reset window passes.
Typical cost
PCMA-5000 snapshot (2026-04-29): ~$35.30 total recorded cost across 4,816 jobs that ran ≈
0.7¢/person averaged, dominated by search API calls. Cost rounding to integer cents happens at
both the llmCostCents() accumulation point and the actual_cost_cents snapshot boundary
(research_job) — a fractional cents bug surfaced and was fixed in the PCMA-1000 push when
unrounded floats were rejected by the integer column.
Storage hardening
storeResearchSource strips \u0000 null bytes from url, title, and content before insert.
Scraped HTML/PDF noise containing literal nulls previously crashed the JSONB write with
unsupported Unicode escape sequence.
8. Persistence Model
File: packages/workers/src/research/persist.ts
Phase 1 — Per accepted source (idempotent)
For each AcceptedSource, four writes occur:
-
storeResearchSource()→INSERT INTO source_recordsource_type = 'web_fetch',source_origin = 'research_agent',source_namespace = 'web_research'- Idempotent on
(namespace, object_key)where key =webResearchSourceKey(url)
-
storeResearchLink()→INSERT INTO source_record_entity_linklink_basis = 'research_evaluation'link_confidence:highfor ACCEPT,mediumfor REVIEW / metadata-fallback candidatesaccepted_by = 'research_agent'(vs'human_review'for admin-confirmed candidates)- Stores
identity_evidenceJSONB (the full rubric output) - Idempotent on
(source_record_id, entity_id, link_basis, status='active')
-
storeResearchChunks()→INSERT INTO source_chunk- Distilled summaries with topic lists
- Skipped when
distilledChunks.length === 0(e.g. budget cut off before distill, or distill parse failure) - Via
sourceChunkRepo.insertChunkSet
-
storeIdentityAnchor()→INSERT INTO person_external_identity- Status:
'observed'for new discoveries - If an existing
'observed'row exists from a different source → upgrade to'verified' linkedin_usernamediscoveries recurse to register the canonicallinkedin_urlrowH2:identities_discoveredcounter only increments forwasNew || wasUpgradedrows
- Status:
Phase 2 — Job finalization (single transaction)
UPDATE research_job SET queries_executed, sources_fetched, sources_accepted, sources_candidate, sources_rejected, llm_calls, identities_discovered = <persist-layer tally>
-- terminal statusUPDATE research_job SET status = 'completed' | 'budget_exhausted', stop_reason, actual_cost_cents, agent_log, completed_at = NOW()
-- only if any accepted OR candidate sources existINSERT INTO domain_event_outbox ( event_type = 'evidence-enrichment.completed', idempotency_key = 'evidence-enrichment-completed:{jobId}')The terminal status update and the outbox event are in the same transaction (B4 invariant).
Idempotency short-circuit
persistResearch reads research_job.status at entry. If the job is already in a terminal state
(crash recovery after Phase 2 completed in a prior attempt), persist returns the persisted counters
from the row immediately — without this, completeJob / exhaustBudget would throw NoResultError
(their WHERE status = 'running' clause matches nothing) and bubble up as an Inngest retry failure.
When result.acceptedSources.length === 0 && c.sourcesCandidate === 0, the
evidence-enrichment.completed outbox event is not enqueued — there is no downstream work for
trait inference to do. Pipeline reconciliation handles the no-evidence case directly via the Inngest
function’s update-pipeline-stage step.
Database tables involved
| Table | Role |
|---|---|
research_job | Job lifecycle, counters, agent_log JSONB |
source_record | Raw fetched page content (50KB cap, null-byte stripped) |
source_record_entity_link | Person → source link with rubric evidence JSONB |
source_chunk | Distilled summary chunks (research_summary type) |
person_external_identity | Discovered handles/URLs (observed → verified) |
person_pipeline | State machine row (see §10 reconciliation) |
domain_event_outbox | Async event bus for triggering downstream work |
9. Downstream: Trait Inference
File: packages/workers/src/functions/trait-inference.ts
traitInferenceFunction (meshi/trait-inference.run) can be triggered by multiple domain events
with a 5-second debounce per entityId:
EVIDENCE_ENRICHMENT_COMPLETEDENRICHMENT_COMPLETEDPERSON_UPDATED
Retries: 8 (Inngest retries: 8).
What trait inference reads
SELECT * FROM source_chunk WHERE chunk_type = 'research_summary'These are exactly the chunks written by storeResearchChunks. They are combined with LinkedIn
timeline records as the text corpus for LLM inference.
Two-pass inference
- Pass 1 (after LinkedIn import): fast, deterministic, cheap (~$0.03/entity). Infers traits from structured LinkedIn data.
- Pass 2 (after evidence enrichment): slower, uses the full corpus including research chunks (~$0.10–0.30/entity additional). Re-infers all traits with richer evidence.
Users see results from Pass 1 within seconds of import. Pass 2 enriches those results asynchronously.
Claim replacement contract
traitClaimRepo.replaceMachineClaims(db, entityId, newClaims) — full idempotent replacement of all
machine-generated trait_claim rows for the entity. Same value with different evidence_refs →
supersede old, insert fresh. Same value with same evidence → keep (no churn).
10. Pipeline Reconciliation
File: packages/workers/src/functions/evidence-enrichment.ts →
reconcilePipelineAfterEvidenceEnrichment
After every research run, the Inngest function’s update-pipeline-stage step decides what stage the
entity’s person_pipeline row should advance to. The decision tree, in order:
- Stale-job guard — if
pipeline.research_job_iddoesn’t match the completingjobId, skip (a newer job has superseded this one). - Already-advanced guard — if the pipeline is at
inferringorsynthesizing, skip (a delayed duplicate completion must not move rows backward). - LinkedIn-discovered promotion — if a
linkedin_urlidentity anchor exists (observed, verified, or trusted_for_autolink) and the pipeline is in a promotable stage (linkedin_searching | linkedin_review | web_researching | research_review | unresearchable):- If a
linkedin_profilesource already exists → advance toinferringand emit a syntheticENRICHMENT_COMPLETEDevent. - Otherwise → advance to
enrichingwithmatch_method = 'web_research_discovered'andmatch_confidence = 0.9(if anchor verified) or0.75(if anchor observed). EmitPERSON_LINKEDIN_READY. - Migration 066 (
066_allow_web_research_discovered_match_method.ts) adds'web_research_discovered'to the allowedperson_pipeline.match_methodCHECK constraint values:auto_accepted | web_search_accepted | web_research_discovered | admin_confirmed | admin_provided | none.
- If a
- Accepted sources or human-confirmed evidence → advance to
inferring. If the human-confirmed path triggered withresult.sourcesAccepted === 0, emit a syntheticEVIDENCE_ENRICHMENT_COMPLETEDso trait inference runs. - LinkedIn evidence already present (LinkedIn anchor or
linkedin_profilesource) but no accepted web sources → advance toinferringand emitENRICHMENT_COMPLETED. The required research pass ran; a confident LinkedIn source is enough to infer. - Candidate-only outcome → advance to
research_review(unless the entity is alreadyunresearchablewith no active candidates). - No useful sources → advance to
unresearchable, seterror_detail, and invalidate any stale inferred pipeline artifacts viapipelineArtifactRepo.invalidateInferredPipelineArtifacts.
Implementation note: stage transitions are conflict-safe and use optimistic concurrency via CAS
(compare-and-swap) with personPipelineRepo.updateStageCAS(...) (expected stage + expected
row_version) to avoid clobbering concurrent advances.
This reconciliation is the only path that mutates person_pipeline based on research outcomes; the
import-side LinkedIn discovery has its own state machine.
11. Pipeline Sweep (Recovery Cron)
File: packages/workers/src/functions/pipeline-sweep.ts
Cron */10 * * * * (every 10 minutes). Reconciles persistent pipeline state with durable evidence
in the database — distinct from event replay. Recovery scenarios:
-
Stuck intermediate stages (>30 min, env:
MESHI_PIPELINE_SWEEP_STALE_MINUTES): re-emits the appropriate Inngest event for stageslinkedin_searching | enriching | web_researching | inferring | synthesizing. All downstream functions are idempotent (content hashes,replaceMachineClaims,run_keydedup), so re-triggering is safe. -
Missing import pipelines: group members with organizer-import sources but no
person_pipelinerow are re-emitted into LinkedIn discovery. -
Completed web research, stuck stage: terminal
research_jobrows still sitting atweb_researchingare advanced toinferring,research_review,enriching, orunresearchabledirectly from durable counters/evidence (no event replay needed). -
LinkedIn review without web research (>30 min): queues the required web-research pass so an open admin review queue does not block downstream inference.
-
False terminal
unresearchablerows: rows with existing LinkedIn evidence are promoted back into enrichment; rows with active candidate research evidence are restored toresearch_review. -
Stale
linkedin_review(>48 hours): auto-confirms the best candidate whenconfidence ≥ AUTO_CONFIRM_THRESHOLD = 0.85. Do not lower this floor — it matchesclassifyEvaluations’sauto_acceptfloor and anything lower silently confirms LinkedIn matches the LLM explicitly classifiedreview, poisoning downstream enrichment. Rows with no usable candidate are terminally markedunresearchable; rows that have other LinkedIn evidence are promoted back into enrichment.
MAX_PER_SWEEP = 500 (env: MESHI_PIPELINE_SWEEP_MAX_PER_RUN) caps work per cron tick.
12. Rate Limiting and Cost Controls
Inngest-level
Concurrency: 5 simultaneous runs (Inngest plan cap; env override possible up to plan limit)Throttle: 300 starts / minute (env: MESHI_EVIDENCE_ENRICHMENT_THROTTLE_LIMIT/_PERIOD)Local-stack drains against staging Neon override these to 15 / 120-per-minute and were ~8× faster
than cloud Inngest at PCMA-1000 scale (filed in meshi-platform-0vtp for the next 5K push).
Application-level (per user, manual trigger only)
Staleness window: 30 days — cannot re-research same entity within 30 days (production)Monthly cap: 10 runs — manual trigger type, per user per calendar monthBoth are skipped in development (MESHI_ENV !== "production").
Per-run hard budget (compute layer)
maxSearches: 16 (search API calls)maxFetches: 32 (URLs fetched)maxLlmCalls: 48 (total LLM calls across query gen, identity eval, distill)maxAccepted: 10 (accepted sources)results/query: 8domain cap: 2 (accepted/candidate per domain per run)Any limit hit → job marked budget_exhausted. Trait inference triggers in both completed and
budget_exhausted cases.
13. Operational Scripts
| Script | Role |
|---|---|
scripts/backfill-pcma-web-research.ts | Select group members lacking active web_research; create / reuse research_job rows; advance pipeline to web_researching; emit evidence-enrichment.requested events to local Inngest. |
scripts/run-pcma-web-research-jobs.ts | Direct executor — pulls queued/stale jobs and runs runEvidenceEnrichment directly with --concurrency, --search-rps, --queries-per-person, --results-per-query, --search-retries, --heuristic-eval, --include-existing-web-research flags. Bypasses Inngest scheduler; same compute + persist path. |
scripts/repair-pcma-deepinfra-artifacts.ts | Idempotent embedding-artifact repair after a transient DeepInfra outage. Recomputes deterministic embeddings only — does not re-run inference. |
These scripts require an inline DATABASE_URL=... and an --env-file=.env. Do not append staging
DATABASE_URL to .env — deno --watch reloads pick up the polluted env and the local API will
silently serve unauthorized 401s on the next code change. Keep the staging URL on the command line
only.
14. Key Assumptions
-
Open-web fidelity assumption. Reputable public sources (interview articles, podcast transcripts, company bios) are more reliable identity signals than social media noise. The identity rubric weights structured affiliation and name evidence above social-graph proximity.
-
Name uniqueness assumption. The rubric is calibrated for relatively unique names and roles. For common names or low-context individuals, precision will be lower. The 7-field / ≥2-positive-fields threshold is the main guard. The PCMA-5000 snapshot showed a 27% reject rate and 36.5% fetch_failed rate — both expected at population scale, both addressable by the metadata-fallback path (§4) and pipeline-sweep (§11).
-
Content truncation assumption. Truncating pages to 6,000 chars (identity eval) or 10,000 chars (distillation) captures the most signal-dense beginning of most articles. Known failure mode for pages with dense boilerplate headers (YouTube, Reddit).
-
Source count as cost primitive.
maxAccepted = 10is the primary cost-control lever, not token counting. Simpler and predictable. Validated at ~0.7¢/person at PCMA-5000 scale. -
Two-independent-source verification. Discovered identities are upgraded from
'observed'to'verified'when the same(identityType, identityValue)is corroborated by two independent sources. Heuristic — does not require human confirmation. -
Distillation is person-focused. Topic metadata may not capture the broader significance of a source — acceptable for trait inference, not for general knowledge retrieval.
-
Progressive enrichment. Re-running research on the same entity (after the 30-day window) accumulates new sources rather than replacing existing ones. Supersession happens only at the
trait_claimlayer (viareplaceMachineClaims), not atsource_record. -
LLM model selection. All three compute-layer LLM steps (query gen, identity eval, distillation) use tier
basic. The pipeline accepts lower semantic quality from these models in exchange for cost. The rubric’s post-processing rules compensate for model tendency to accept borderline sources. The LinkedIn judge uses tierlinkedin-identity(separate weights). -
Step checkpointing > naive idempotency.
runStepper query-iteration is the primary crash-resume mechanism. Storage-layer idempotency (lookup-or-create) is the belt-and-suspenders guard for partial-write retries; without step checkpointing, the entire query loop would replay on every retry.
15. Known Limitations and Open Work
| Issue | Priority | Notes |
|---|---|---|
| Prompt injection (untrusted web text in prompts) | P0 | Bean meshi-platform-24dn — not resolved |
| LinkedIn HTTP 999 / 403 / fetch_failed dominates rejection mix at 5K scale | P0 | Metadata-fallback path partially mitigates; provider rotation TBD |
Stale running research_jobs at 5K scale (1,586 / >6h in 04-29 snapshot) | P0 | Pipeline-sweep advances them; root cause is plan-cap + LLM exhaust |
Empty canonical_brief.source_refs for new 5K group | P1 | Brief-synthesis traceability gap; not a research-pipeline bug |
| Onboarding submissions miss deterministic Layer 2 extraction | P1 | Goals/offers from onboarding not imported as claims |
| Evidence provenance over-attributed (all source IDs, not chunk-specific) | P1 | Audit traceability unreliable for multi-source entities |
replaceMachineClaims can insert intra-batch duplicate claims | P1 | Dedup checks DB but not the incoming batch |
| Temporal query bias (recency vs. historical coverage) | Eval | Tested in loqp branch; not promoted yet |
| Eval harness not merged to staging | Infra | Intentionally isolated in loqp worktree |
| Research quality signals not surfaced in UI | Future | PipelineSheet shows counters; per-source rubric breakdown TBD |
| Stronger snapshot creation idempotency | Future | Bean meshi-platform-tbv2 |
16. Key Files Reference
| File | Purpose |
|---|---|
packages/workers/src/research/compute.ts | Pure compute entrypoint — zero DB; per-query step checkpointing |
packages/workers/src/research/persist.ts | All DB writes + terminal outbox event; idempotency short-circuit |
packages/workers/src/research/identity-eval.ts | 7-field LLM identity rubric + post-processing |
packages/workers/src/research/distill.ts | Content distillation to chunks |
packages/workers/src/research/cost.ts | Token cost tracking; integer-cent rounding |
packages/workers/src/research/storage.ts | Low-level DB storage helpers; null-byte sanitizer |
packages/workers/src/research/web-search.ts | Search-provider factories + priority order |
packages/workers/src/research/web-fetch.ts | SSRF-safe fetcher (IPv4 + IPv6 prefix blocks, redirect cap) |
packages/workers/src/functions/evidence-enrichment.ts | Inngest function + step orchestration + reconciliation |
packages/workers/src/functions/pipeline-sweep.ts | 10-min recovery cron |
packages/workers/src/functions/trait-inference.ts | Downstream consumer of research output |
packages/workers/src/llm/model-router.ts | Weighted-router across providers + circuit breakers |
packages/core/src/services/research.service.ts | API-layer guards and job creation |
packages/api/src/routes/research.ts | HTTP trigger endpoint |
packages/api/src/routes/dev-research.ts | Dry-run route factory |
packages/domain/src/research.ts | All canonical types (ResearchResult, PersonContext, etc.) |
packages/domain/src/domain-events.ts | Event type constants and payloads |
packages/db/migrations/066_allow_web_research_discovered_match_method.ts | Adds web_research_discovered to person_pipeline.match_method CHECK |
scripts/run-pcma-web-research-jobs.ts | Direct executor for bulk backfill |
scripts/backfill-pcma-web-research.ts | Inngest-event backfill |
docs/evidence-enrichment-report.md | Earlier quantitative eval results (3 runs, 6 subjects) |
docs/pcma-5000-nonmember-import-research-overview-2026-04-29.md | Population-scale snapshot |
17. End-to-End Flow Summary
POST /api/v0/me/contacts/:entityId/research (or auto-trigger from import / sweep) │ ├─ Guards: exists? idempotent? stale? budget? ├─ INSERT research_job (status='queued') └─ INSERT outbox (EVIDENCE_ENRICHMENT_REQUESTED) │ ▼ outbox relay → InngestevidenceEnrichmentFunction (concurrency=5, throttle=300/min, retries=3, finish=60m) │ ├─ runEvidenceEnrichment │ ├─ idempotency guard (terminal jobs return early) │ ├─ buildPersonContextFromDb (timeline + identities + traits + seed_data) │ ├─ loadPreviouslyFetchedUrls (skip set) │ │ │ └─ computeResearch │ ├─ generateQueries (deterministic by default; LLM optional) │ ├─ step: query-0 [memoized via runStep] │ │ ├─ search → 8 results │ │ └─ for each URL: │ │ ├─ webFetch (SSRF-safe, 512KB cap) │ │ ├─ if fetch_failed/short → tryStoreSearchMetadataCandidate │ │ ├─ identity_eval (LLM, 6K chars, 7-field rubric) │ │ └─ if ACCEPT (domain cap OK): distill (LLM, 10K chars) → 1–3 chunks │ └─ step: query-1 ... query-N (up to BUDGET.maxSearches=16) │ ├─ step: persist [single txn, idempotent] │ ├─ source_record (one per accepted/candidate URL, null bytes stripped) │ ├─ source_record_entity_link (with identity_evidence JSONB) │ ├─ source_chunk (research_summary, up to 3 per source) │ ├─ person_external_identity (observed → verified; linkedin_username → linkedin_url) │ └─ UPDATE research_job (terminal) + INSERT outbox (COMPLETED) ← only if any sources │ └─ step: update-pipeline-stage └─ reconcilePipelineAfterEvidenceEnrichment: ├─ LinkedIn discovered? → enriching (match_method='web_research_discovered') ├─ accepted/human-confirmed? → inferring ├─ LinkedIn evidence already present? → inferring ├─ candidate only? → research_review └─ nothing useful? → unresearchable + invalidate inferred artifacts │ ▼ outbox relay → Inngest (debounce 5s) traitInferenceFunction │ ├─ Load source_chunk (research_summary) + timeline records ├─ extractFieldTraits() → deterministic traits from structured data ├─ extractLlmTraits() → LLM inference per dimension └─ replaceMachineClaims() → full idempotent replacement of trait_claim rows
┌─────────────────────────────────────────────────────────────────┐ │ pipelineSweepFunction (cron */10 * * * *) │ │ - Re-emits stuck stages (>30 min) │ │ - Advances completed-but-stuck research jobs │ │ - Rescues false-terminal `unresearchable` rows │ │ - Auto-confirms stale linkedin_review at confidence ≥ 0.85 │ └─────────────────────────────────────────────────────────────────┘