Entity Resolution: Build Vs Buy

Entity Resolution
March 24, 2026

A look at why one of data engineering's most "obvious" problems turns out to be one of its hardest.

There's a reason entity resolution keeps showing up on data team backlogs, half-finished, with a comment like


#TODO: handle edge cases.

The problem statement is deceptively simple: given two records, do they refer to the same real-world entity? Yes or no.

That simplicity is a trap.

The String-Matching Fallacy

The first instinct is exact string matching. It fails immediately.

Real-world data doesn't agree on how to spell things. Names get abbreviated, transposed, truncated, phonetically respelled, or just plain typo'd. The same person appears as "John Smith" in your CRM, "J. Smith" in billing, "jon_smith" in your support tickets, and "JOHNSMITH" in a legacy import from 2014. These are all the same humans. Your == operator does not know this.

So you reach for fuzzy matching.

The Metric Zoo

The moment you go beyond exact matching, you enter a zoo of similarity metrics, each with different strengths and failure modes.

Levenshtein distance counts the minimum number of character edits — insertions, deletions, substitutions — to turn one string into another. It's intuitive and well-understood, but it's character-blind: it treats swapping "a" for "b" the same as swapping "S" for "Z", and it has no concept of what a name actually is.

Jaro-Winkler was designed specifically for short strings and names. It gives extra weight to matching prefixes and is more forgiving of transpositions. It's better for names than Levenshtein, but still operates purely on surface-level character patterns. It cannot know that "Johnny" and "Jonathan" might be the same person, or that "Jon" and "John" are essentially equivalent in most Western name conventions.

Phonetic algorithms — Soundex, Metaphone, Double Metaphone — encode how a string sounds rather than how it's spelled. "Smith" and "Smyth" map to the same code. So do "Katherine", "Catherine", and "Kathryn". This helps with names significantly, but introduces its own false positive surface: things that sound alike but aren't the same entity.

Token-based methods like TF-IDF treat strings as bags of tokens rather than character sequences. More useful for longer strings like company names or addresses than for short name fields.

ML embeddings take a different approach entirely: train a model to place similar strings close together in a vector space, then measure distance. This can capture semantic similarity that character-level methods miss, but it requires labeled training data — and getting good labels means a human reviewing pairs of records, which brings its own costs.

No single metric is right for all fields, all domains, or all datasets. Most real-world entity resolution systems end up composing several of them.

The Threshold Problem

Once you have a similarity score, you need a decision to threshold: above this, they match; below it, they don't.

This sounds like detail. It isn't.

Set your threshold too high, and you have high precision but miss genuine matches — your recall suffers. Too low and you start linking records that shouldn't be linked — false positives accumulate. The relationship between these two is almost always a trade-off, and where you set the threshold depends on what's more costly for your use case: missing a true match or incorrectly merging two distinct entities.

Making it worse: the optimal threshold is usually different for different fields. The distribution of name similarity scores is not the same as the distribution of address similarity scores. A threshold tuned on one doesn't transfer cleanly to the other.

Tuning this correctly requires labeled examples, iterative testing, and a clear sense of what "good enough" means for your specific problem. That clarity is often harder to get than the engineering itself.

Scale Makes Everything Harder

Everything above assumes you're comparing a manageable number of pairs. You probably aren't.

Comparing every record against every other record is O(n²). At a million records, that's 500 billion comparisons. At 10 million, it's 50 trillion. This is not a pipeline you run before your morning standup.

The standard solution is blocking: before comparing records, group them into candidate buckets using some cheap heuristic — same first letter of last name, same ZIP code, same soundex code — and only compare within buckets. This cuts the comparison space dramatically but introduces a new failure mode: records that should match but don't share any blocking key never get compared at all. False negatives from blocking are silent and hard to detect.

Designing blocking strategies that are both selective enough to be fast and broad enough not to miss true matches is its own sub-problem, and it interacts with all the matching decisions above.

The Graph Problem No One Warned You About

Matching produces pairs. But what you usually want is clusters: all the records that refer to the same entity, grouped together.

Getting from pairs to clusters means solving a transitive closure problem. If A matches B, and B matches C, then A, B, and C are probably the same entity — even if A and C don't directly score as a match against each other. Propagating this correctly without creating false chains (where a sequence of plausible-but-wrong matches links unrelated entities into one cluster) is genuinely tricky.

And then there are the borderline cases. Two records with a score of 0.71 against a threshold of 0.70. Do you merge them? What if merging is irreversible in your downstream system? What if one of them is a high-value customer record? Suddenly you need a review workflow, confidence tiers, and a human-in-the-loop process — none of which were in the original ticket.

Why It Keeps Coming Back

Entity resolution is one of those problems that appears solved as soon as someone ships a first version, and then silently degrades as data accumulates and evolves. New sources get added with different conventions. Names change. Companies merge. Typos propagate. The model that worked eighteen months ago starts generating false positives no one notices until something downstream breaks.

This is why it shows up on so many backlogs. Not because it's unsolvable, but because it's never fully solved — it requires ongoing attention, labeled feedback, and a system that can learn from corrections over time.

What Actually Helps

Getting entity resolution right at scale requires thinking about it as a system, not a script. That means:

  • Composing multiple similarity signals rather than relying on one
  • Designing blocking strategies that balance speed with coverage
  • Building an active learning loop where human-reviewed pairs improve the model over time
  • Treating the threshold as a tunable parameter with explicit precision/recall trade-offs, not a fixed constant
  • Handling the cluster formation step explicitly, not as an afterthought

This is the design space that Zingg was built to address — an open-source framework that handles the blocking, the multi-signal matching, and the active learning loop, so your team can focus on the decisions that require domain knowledge rather than reinventing the infrastructure from scratch.

Entity resolution sits at the intersection of data quality, machine learning, and distributed systems. It's the kind of problem that reveals itself slowly: approachable at first glance, then increasingly complex the closer you look. The data engineers who've been through it tend to recognize each other by a certain look in their eyes.

If you've built something interesting in this space — or are currently staring at a threshold you've been tuning for three weeks — we'd love to hear from you.

Zingg is open source. Find us on GitHub or reach out at zingg.ai.

Recent posts