Enrichment pipeline architecture: scale from 1k to 100k rows/month (2026)
Seven deep-dives into the engineering choices that separate a hobby script from a system your team can actually depend on — including the architecture patterns nobody documents.
Once you understand what enrichment is, the next question is how to build it so it scales without quietly breaking. This cluster covers the seven technical decisions that matter: batch vs real-time, sync vs async, waterfall logic, pipeline velocity, predictive enrichment, and the multimedia layer most teams overlook.
Get every new Pipeline & techniques guide in your inbox
We ship ~1 new Pipeline & techniques guide per month. One email per release, nothing else. No spam, no resale.
Why this cluster matters
Enrichment that works in a demo and enrichment that holds up at 50k rows/month are two different beasts. The gap is filled by 7 decisions you have to make about the pipeline shape — batch vs real-time, sync vs async, how the waterfall ranks providers. Get these right and the system runs unattended. Get them wrong and your CRM fills with garbage in 90 days.
For who · RevOps engineers, technical sales ops, founders who'll touch the pipeline themselves.
What you'll learn
- How to chain 3-5 data providers in a waterfall and lift match rates from 50% to 80%+
- When to enrich on demand (real-time) vs overnight (batch) — and what each costs you
- How to design async API calls that double throughput past 5k rows
- The metrics that tell you a pipeline is healthy vs quietly dying
- Where AI (LLM) enrichment beats classic enrichment — and where it wastes credits
All 7 guides in this cluster
Build a reliable enrichment pipeline
From raw spreadsheet to enriched CRM record: blueprint of a production-grade pipeline.
Read the guide → Pipeline & techniquesBatch vs real-time
When to enrich on demand vs. overnight — latency, cost and CRM hygiene trade-offs.
Read the guide → Pipeline & techniquesSynchronous vs asynchronous calls
API design choice that doubles your throughput when batch volume crosses 5k rows.
Read the guide → Pipeline & techniquesPredictive enrichment with AI
Use an LLM to fill the fields no provider has — and when it pays off vs. classic enrichment.
Read the guide → Pipeline & techniquesWaterfall logic explained
Chain 3-5 providers to lift match rates from 50% to 80%+ — with the cost-per-match math.
Read the guide → Pipeline & techniquesMeasure pipeline velocity
The metrics that tell you whether your enrichment workflow is healthy or quietly dying.
Read the guide → Pipeline & techniquesMultimedia enrichment
Logos, screenshots, OG images — enriching the visual layer that powers personalized outreach.
Read the guide →5 pipeline architecture patterns nobody writes about (until they break in production)
Public guides will tell you to chain providers in a waterfall and call it a day. They won't tell you what happens when one of those providers introduces a 4-second latency spike at row 4,200, or when your sync HTTP client silently throttles itself at concurrency 12. The five patterns below come from watching a few dozen enrichment pipelines hit production scale — and the specific moments they almost died. Use them as a checklist before you ship.
1. Pick your default mode: batch-first or real-time-first (you can't have both)
Most teams want both batch and real-time enrichment from day one. The trap is that the right architecture for each is genuinely different. Batch-first systems optimize for throughput and cost — they process tens of thousands of records in a single nightly pass, are idempotent by design, and can retry failed lookups overnight without anyone noticing. Real-time-first systems optimize for latency — they target sub-2-second response on a single record, can't afford to retry, and need degraded-mode fallbacks when a provider is slow.
You can layer real-time on top of a batch-first system (request → if cached, return; else queue + return placeholder). You can't easily go the other way without rewriting your queue logic. Pick the mode that matches 80% of your traffic, build for it, then layer the other on as needed.
How Derrick handles it: our Sheet integration is batch-first by default (whole columns enriched at once), with on-demand real-time available via the API/MCP for single-row lookups. Same waterfall, two pipelines under the hood.
2. Synchronous HTTP dies at 5,000 rows. Switch to async earlier than you think.
The first version of every enrichment script uses a for-loop with `requests.get()` — and it works fine, until it doesn't. The breaking point is reliably around 5,000 records: at that volume, even with a fast provider, you're spending 30+ minutes on lookups that should take 5 minutes if you ran them in parallel. The Python script eats memory because it holds all the responses in scope, the connection pool exhausts, and one slow provider blocks all the others.
The fix is async (asyncio + httpx in Python, or any worker queue). But the gotcha is provider rate limits. Most enrichment APIs allow 10-50 concurrent requests; go past and you get 429s in waves. Real-world pattern: async with a semaphore set to 80% of the provider's documented concurrency limit, plus exponential backoff on 429. This gives you 5-10x throughput vs sync, while staying inside the provider's good graces.
How Derrick handles it: we batch and parallelize requests per provider with adaptive concurrency — when 429s spike, we throttle automatically. The Sheet user sees one progress bar, not the dozen worker processes underneath.
3. Waterfall ordering: cheapest-first is wrong
The naive waterfall orders providers by cost: try the cheapest first, fall through to expensive ones only when needed. It sounds rational. It's actively bad.
The reason: enrichment cost isn't dominated by per-record price — it's dominated by match rate × latency × the cost of the calls that fail. A €0.02 provider with 25% match rate costs you €0.08 per match plus 4x the latency of a 75%-match-rate €0.05 provider that would have found it on the first try. Cheap-first wastes both money and time.
The correct rule: order providers by expected match rate on YOUR data, with cost as a tiebreaker. Run a 500-record benchmark per provider on a representative sample before you decide the order. Most teams discover the order they would have guessed is wrong by at least one position.
How Derrick handles it: our default waterfall is ordered by observed match rate across our customer base, then re-optimized per ICP cluster behind the scenes. You don't touch the ordering — the system learns from outcomes.
4. Idempotency + retry: the two-line fix that saves your weekend
Every production pipeline eventually has a provider go down mid-run. Without idempotency, that means re-running the whole batch — paying again for the records that already succeeded. With idempotency, you re-run safely and only the failed records get re-attempted.
The implementation is simple: tag every enrichment request with a deterministic key (e.g. `sha256(email + provider + date_bucket)`) and have your enrichment client check a local cache before making the network call. Failed records get a separate retry queue with exponential backoff (1s, 4s, 16s, 64s, then dead-letter). On Monday morning, the dead-letter queue tells you exactly which records need manual attention.
Teams that skip this typically discover it after the third "oh no, the script crashed at row 8,400 — do we re-run the whole thing?" Saturday evening. Build it in week one, save yourself fifty hours over the year.
How Derrick handles it: all enrichments are cached by (input, provider, freshness window) — re-running an enrichment on an unchanged Sheet returns instantly without re-billing. Failed records are flagged in the Sheet and retryable in one click.
5. Observability: track these 3 metrics or fly blind
A pipeline you can't measure is a pipeline you can't fix. Three metrics, tracked over rolling 7-day windows, will catch 95% of the things that go wrong:
- Match rate per provider — if any provider drops more than 10 points week-over-week, they've changed their data sourcing or you've changed your ICP. Investigate before billing piles up.
- p95 latency per provider — the average hides the slow tail that's blowing up your real-time SLO. Track p95, alert on 2x baseline.
- Cost-per-verified-match — the true unit economics. If this is rising while match rate is flat, providers are getting more expensive or your data is getting harder. Either way, you need to know.
These three plug into any monitoring stack (Datadog, Grafana, even a Sheet refreshed nightly). The teams that fly blind eventually rediscover them after a quarter of unexplained cost overruns.
How Derrick handles it: these three live in the in-Sheet dashboard, refreshed in real time as enrichments run. No extra monitoring stack to set up, no extra subscription to pay.
The pattern across the five: production pipelines fail in mundane ways that no public tutorial covers. Mode mismatch, sync at scale, wrong waterfall order, missing idempotency, no observability — these are the failure modes you only learn about after they cost you a weekend. The 7 guides below cover each of these decisions in tactical depth.
If you'd rather not architect from scratch, install Derrick free — all five patterns are built in by default, and you keep your team focused on revenue instead of pipeline plumbing.
FAQs about this cluster
Do I need to know how to code to use this cluster?
No. The guides are written for non-developers — they cover the decisions, not the implementation. If you're choosing between batch vs real-time, you can apply the framework without writing a line of code.
What's the most important guide to read first?
Read 'Waterfall logic explained' first. It's the single change that lifts match rates the most. Then 'Build a reliable enrichment pipeline' to understand how it fits in the larger system.
Does Derrick handle these techniques natively?
Yes. Derrick chains a 10-source waterfall by default and supports both batch and real-time enrichment in Google Sheets and via API/MCP.
Explore the other clusters
Foundations
Six core guides that establish what enrichment is, why it matters, and how a process works end-to-end — plus the 5 misconceptions that kill most projects in their third month.
6 guides →Sources & use cases
Eight concrete plays — webform inbound, anonymous web traffic, business card OCR, LinkedIn URLs — each with the provider stack that actually works for it, and the costs spelled out.
8 guides →Pitfalls & deep-dives
Five honest reads on what goes wrong — plus the 12 anti-patterns that have torpedoed real enrichment projects, with the fix for each one.
5 guides →Start enriching your sheet in 30 seconds
Free for 100 credits/month. No credit card.
Install Derrick free →