Enrichment pipeline architecture: scale from 1k to 100k rows/month (2026)

Once you understand what enrichment is, the next question is how to build it so it scales without quietly breaking. This cluster covers the seven technical decisions that matter: batch vs real-time, sync vs async, waterfall logic, pipeline velocity, predictive enrichment, and the multimedia layer most teams overlook.

Newsletter

Get every new Pipeline & techniques guide in your inbox

We ship ~1 new Pipeline & techniques guide per month. One email per release, nothing else. No spam, no resale.

Why this cluster matters

Enrichment that works in a demo and enrichment that holds up at 50k rows/month are two different beasts. The gap is filled by 7 decisions you have to make about the pipeline shape - batch vs real-time, sync vs async, how the waterfall ranks providers. Get these right and the system runs unattended. Get them wrong and your CRM fills with garbage in 90 days.

For who · RevOps engineers, technical sales ops, founders who'll touch the pipeline themselves.

What you'll learn

How to chain 3-5 data providers in a waterfall and lift match rates from 50% to 80%+
When to enrich on demand (real-time) vs overnight (batch) - and what each costs you
How to design async API calls that double throughput past 5k rows
The metrics that tell you a pipeline is healthy vs quietly dying
Where AI (LLM) enrichment beats classic enrichment - and where it wastes credits

All 7 guides in this cluster

Pipeline & techniques

Build a reliable enrichment pipeline

From raw spreadsheet to enriched CRM record: blueprint of a production-grade pipeline.

Read the guide → Pipeline & techniques

Batch vs real-time

When to enrich on demand vs. overnight - latency, cost and CRM hygiene trade-offs.

Read the guide → Pipeline & techniques

Synchronous vs asynchronous calls

API design choice that doubles your throughput when batch volume crosses 5k rows.

Read the guide → Pipeline & techniques

Predictive enrichment with AI

Use an LLM to fill the fields no provider has - and when it pays off vs. classic enrichment.

Read the guide → Pipeline & techniques

Waterfall logic explained

Chain 3-5 providers to lift match rates from 50% to 80%+ - with the cost-per-match math.

Read the guide → Pipeline & techniques

Measure pipeline velocity

The metrics that tell you whether your enrichment workflow is healthy or quietly dying.

Read the guide → Pipeline & techniques

Multimedia enrichment

Logos, screenshots, OG images - enriching the visual layer that powers personalized outreach.

Read the guide →

5 pipeline architecture patterns nobody writes about (until they break in production)

Public guides will tell you to chain providers in a waterfall and call it a day. They won't tell you what happens when one of those providers introduces a 4-second latency spike at row 4,200, or when your sync HTTP client silently throttles itself at concurrency 12. The five patterns below come from watching a few dozen enrichment pipelines hit production scale - and the specific moments they almost died. Use them as a checklist before you ship.

1. Pick your default mode: batch-first or real-time-first (you can't have both)

Most teams want both batch and real-time enrichment from day one. The trap is that the right architecture for each is genuinely different. Batch-first systems optimize for throughput and cost - they process tens of thousands of records in a single nightly pass, are idempotent by design, and can retry failed lookups overnight without anyone noticing. Real-time-first systems optimize for latency - they target sub-2-second response on a single record, can't afford to retry, and need degraded-mode fallbacks when a provider is slow.

You can layer real-time on top of a batch-first system (request → if cached, return; else queue + return placeholder). You can't easily go the other way without rewriting your queue logic. Pick the mode that matches 80% of your traffic, build for it, then layer the other on as needed.

How Derrick handles it: our Sheet integration is batch-first by default (whole columns enriched at once), with on-demand real-time available via the API/MCP for single-row lookups. Same multi-source lookup, two pipelines under the hood.

2. Synchronous HTTP dies at 5,000 rows. Switch to async earlier than you think.

The first version of every enrichment script uses a for-loop with `requests.get()` - and it works fine, until it doesn't. The breaking point is reliably around 5,000 records: at that volume, even with a fast provider, you're spending 30+ minutes on lookups that should take 5 minutes if you ran them in parallel. The Python script eats memory because it holds all the responses in scope, the connection pool exhausts, and one slow provider blocks all the others.

The fix is async (asyncio + httpx in Python, or any worker queue). But the gotcha is provider rate limits. Most enrichment APIs allow 10-50 concurrent requests; go past and you get 429s in waves. Real-world pattern: async with a semaphore set to 80% of the provider's documented concurrency limit, plus exponential backoff on 429. This gives you 5-10x throughput vs sync, while staying inside the provider's good graces.

How Derrick handles it: we batch and parallelize requests per provider with adaptive concurrency - when 429s spike, we throttle automatically. The Sheet user sees one progress bar, not the dozen worker processes underneath.

3. Waterfall ordering: cheapest-first is wrong

The naive waterfall orders providers by cost: try the cheapest first, fall through to expensive ones only when needed. It sounds rational. It's actively bad.

The reason: enrichment cost isn't dominated by per-record price - it's dominated by match rate × latency × the cost of the calls that fail. A €0.02 provider with 25% match rate costs you €0.08 per match plus 4x the latency of a 75%-match-rate €0.05 provider that would have found it on the first try. Cheap-first wastes both money and time.

The correct rule: order providers by expected match rate on YOUR data, with cost as a tiebreaker. Run a 500-record benchmark per provider on a representative sample before you decide the order. Most teams discover the order they would have guessed is wrong by at least one position.

How Derrick handles it: our default multi-source lookup is ordered by observed match rate across our customer base, then re-optimized per ICP cluster behind the scenes. You don't touch the ordering - the system learns from outcomes.

4. Idempotency + retry: the two-line fix that saves your weekend

Every production pipeline eventually has a provider go down mid-run. Without idempotency, that means re-running the whole batch - paying again for the records that already succeeded. With idempotency, you re-run safely and only the failed records get re-attempted.

The implementation is simple: tag every enrichment request with a deterministic key (e.g. `sha256(email + provider + date_bucket)`) and have your enrichment client check a local cache before making the network call. Failed records get a separate retry queue with exponential backoff (1s, 4s, 16s, 64s, then dead-letter). On Monday morning, the dead-letter queue tells you exactly which records need manual attention.

Teams that skip this typically discover it after the third "oh no, the script crashed at row 8,400 - do we re-run the whole thing?" Saturday evening. Build it in week one, save yourself fifty hours over the year.

How Derrick handles it: all enrichments are cached by (input, provider, freshness window) - re-running an enrichment on an unchanged Sheet returns instantly without re-billing. Failed records are flagged in the Sheet and retryable in one click.

The pattern across the five: production pipelines fail in mundane ways that no public tutorial covers. Mode mismatch, sync at scale, wrong waterfall order, missing idempotency, no observability - these are the failure modes you only learn about after they cost you a weekend. The 7 guides below cover each of these decisions in tactical depth.

If you'd rather not architect from scratch, install Derrick free - all five patterns are built in by default, and you keep your team focused on revenue instead of pipeline plumbing.

FAQs about this cluster

Do I need to know how to code to use this cluster?

No. The guides are written for non-developers - they cover the decisions, not the implementation. If you're choosing between batch vs real-time, you can apply the framework without writing a line of code.

What's the most important guide to read first?

Read 'Waterfall logic explained' first. It's the single change that lifts match rates the most. Then 'Build a reliable enrichment pipeline' to understand how it fits in the larger system.

Does Derrick handle these techniques natively?

Yes. Derrick runs a 10-source multi-source lookup by default and supports both batch and real-time enrichment in Google Sheets and via API/MCP.

← Back to Data Enrichment overview

Start enriching your sheet in 30 seconds

Free for 100 credits/month. No credit card.

Install Derrick free →

Enrichment pipeline architecture: scale from 1k to 100k rows/month (2026)

Get every new Pipeline & techniques guide in your inbox

Why this cluster matters

What you'll learn

All 7 guides in this cluster

Build a reliable enrichment pipeline

Batch vs real-time

Synchronous vs asynchronous calls

Predictive enrichment with AI

Waterfall logic explained

Measure pipeline velocity

Multimedia enrichment

5 pipeline architecture patterns nobody writes about (until they break in production)

1. Pick your default mode: batch-first or real-time-first (you can't have both)

2. Synchronous HTTP dies at 5,000 rows. Switch to async earlier than you think.

3. Waterfall ordering: cheapest-first is wrong

4. Idempotency + retry: the two-line fix that saves your weekend

5. Observability: track these 3 metrics or fly blind

FAQs about this cluster

Do I need to know how to code to use this cluster?

What's the most important guide to read first?

Does Derrick handle these techniques natively?

Explore the other clusters

Foundations

Sources & use cases

Pitfalls & deep-dives

Start enriching your sheet in 30 seconds