A data engineer said it would take an hour. Three weeks later, everyone knows why it didn't.
There's a moment every data team hits.
New records came in. Customers updated their details. Some old entries got deleted. The manager leans over and says: "Just re-run the matching. Should be quick, right?"
The engineer nods. Fires off the job. An hour later, prod is down. Every ID changed. The CRM is broken. The fraud model is broken. The dashboard is broken. Everything is on fire.
That moment — and the three weeks of reckoning that follow — is what separates teams that understand entity resolution from teams that think entity resolution is a string comparison.
We made a comic about it. But this post is about the real decision underneath: whether to build your own incremental entity resolution system, or use something purpose-built for the problem.
It is, bluntly, a harder problem than most teams expect when they start.
Most build-your-own approaches treat entity resolution as a batch process. Match all the records. Assign cluster IDs. Write to a table. Done.
This works fine the first time. The problems start the moment your data changes — which is continuously, in any real business.
Take two clusters that have never matched:
A new record arrives: "John Smith" / jsmith@gmail.com / +1-423-4343.
This single record matches Cluster A via email and Cluster B via phone. It is the bridge between them. Transitivity applies — if A matches the bridge, and B matches the bridge, then A and B must be the same entity.
The two clusters must merge. One ID has to die.
This is not an edge case. It is the normal behavior of identity data at scale. People share email addresses across accounts. Phone numbers get reassigned. A person signs up via Google OAuth and later creates an account directly. The bridge record is how you find out they are the same person.
What breaks when you re-run instead of running incrementally: Every cluster gets re-evaluated from scratch. Every ID changes. Your CRM, fraud model, loyalty platform, analytics dashboards — all of them hold references to the old IDs. Those references are now invalid. You have just invalidated every foreign key pointing to your identity table across every downstream system, simultaneously. This is what "prod is down" looks like in practice.
Now consider a cluster that was correctly formed:
A record update comes in: "Timothy Chen, Apt 3, Wood St, Sector 48."
Same name. Different apartment. Different sector. This is a different person, or the same person who has moved to a genuinely different address that no longer matches the original cluster.
The cluster must split. ID:3301 stays with the original three records. A new ID — say, 7744 — is issued for the updated record.
Downstream systems holding ID:3301 are fine. But they need to know ID:7744 now exists and what it represents.
What a re-run does here: It sees the updated record and re-clusters everything from scratch. Maybe it gets the split right. But it assigns new IDs to everything in the process. You still have the invalidation problem — just with different cluster assignments.
These two scenarios — the bridge record merge and the correction split — represent the core of the incremental matching problem. There are more.
Deleted records can destabilize clusters. If a record that was the sole link between two sub-clusters gets deleted, those sub-clusters should split. Does your system track that dependency?
Human feedback compounds the difficulty. A data steward reviews two records and marks them as definitely not the same entity. The model agrees, scores them below threshold, keeps them separate. Three weeks later, one of those records gets updated. The similarity score rises above threshold. Does your system re-merge them? If it does, you have silently overridden the steward's decision. Trust in the system collapses. If it doesn't, you need to have tracked that human approval as a hard constraint and propagated it through every subsequent incremental run.
Model upgrades break everything if you haven't planned for ID continuity. You retrain your matching model on better training data. The new model is more accurate. You run it — and every cluster shifts slightly. Some records that were together are now separate. Some that were separate are now together. Every ID changes again. Your data steward's approvals from the last six months are now pointing at stale cluster IDs.
None of these problems are unsolvable. But each one requires deliberate engineering. Together, they constitute a system that most teams significantly underestimate before they start building.
Let's be honest about what "build" means here.
Building entity resolution from scratch means building:
Each of these is a real engineering project. Together, they are a product.
That is the build option.
The buy option is using a purpose-built system that has already solved these problems — one that runs natively on the infrastructure you already have, so there is no new operational surface to manage.
Build makes sense when your entity resolution problem is genuinely narrow and stable.
If you are matching records within a single system — say, deduplicating a product catalog with a predictable structure — a custom rule-based or lightweight ML approach may be entirely sufficient. If the data doesn't change much, and you can afford to re-run periodically without ID stability mattering, the batch approach works.
Build also makes sense when you have deep domain expertise and the problem genuinely requires custom logic that a general-purpose system cannot express. Some industries have specific identifiers — regulatory entity codes, national ID schemes, proprietary matching signals — that benefit from domain-specific rules layered on top of statistical matching.
Buy makes sense when any of the following are true:
Your data changes continuously. If new customers, supplier updates, and regulatory filings arrive daily, you need an incremental system. Building one correctly is a six-to-twelve month engineering project. That is time not spent on the downstream analytics and AI that the entity resolution is supposed to enable.
Downstream systems depend on stable IDs. If your CRM, fraud model, compliance platform, and data warehouse all hold references to entity IDs, those IDs must survive re-runs, model upgrades, and incremental updates. This requires deliberate ID lifecycle management. It is not a feature you add later.
Human review needs to be preserved. If data stewards are investing time in reviewing and approving entity assignments, those decisions must be treated as ground truth. A system that silently overrides them on the next run is worse than no system at all.
You are deploying AI on top of entity data. Agents making autonomous decisions about customers, suppliers, or transactions need to know who they are dealing with. An agent that acts on a fragmented entity view makes wrong decisions confidently and at machine speed. The cost of bad identity data is not linear in an agentic context — it is multiplicative.
You need to run at scale. Entity resolution on 10 million records with daily incremental updates is a different problem from entity resolution on 100,000 records run monthly. The comparison space, the infrastructure requirements, and the operational complexity are all different. Purpose-built systems have solved this already.
The comic follows a manager who asks for something that sounds trivial — "just re-run it" — and an engineer who learns, over three weeks, why it is not.
The lesson is not that incremental matching is impossibly hard. The lesson is that it is a specific, well-defined problem that has specific, well-defined solutions — and that the gap between "we'll just re-run the batch job" and "we have a production-grade incremental entity resolution system" is not a gap you want to discover for the first time when prod is down.
It's not a re-run. It's a living graph mutation. Every change has a consequence.
That is true whether you build or buy. The difference is whether you spend your engineering time building the graph mutation engine, or spend it on the business problems the engine enables.