Fuzzy Matching at Scale, Part 4: Thresholds, Scores, and Active Learning

Part 4 of 5. ← Part 3: Blocking · Part 5: Incremental Flow → · Series index

Parts 1–3 covered how to compare records well (multi-signal matching), how to clean the data before comparison (noise and stopwords), and how to make comparison tractable at scale (blocking). This part covers the decision layer: given a similarity score, when do you call two records a match — and how do you build the model that produces those scores in the first place without drowning in labeling work?

The Threshold Is Not a Number You Can Just Pick

Every fuzzy matching system produces a similarity score between 0 and 1. Every matching system needs a threshold: above this score, the records are a match; below it, they're not. It's tempting to pick something like 0.8 or 0.9 and be done with it.

Here's why that doesn't work.

Scores aren't universal. A score of 0.85 means something different for name similarity than for address similarity. The distribution of similarity scores varies by field type, data quality, and the specific patterns in your data. A threshold calibrated on one field is not transferable to another.

Scores aren't stable across sources. If you have two source systems with different data quality — one with clean, consistently formatted names and one with legacy data full of abbreviations and legacy encoding artifacts — the same threshold will behave differently across them. Pairs from the cleaner source that are genuine matches will score higher than pairs from the messier source, even if they refer to the same entities.

The precision-recall tradeoff is asymmetric for your use case. Raise your threshold and you get fewer false positives (incorrectly merged entities) but more false negatives (missed true matches). Lower it and you get the reverse. Which direction is more costly depends on the application — for fraud detection, missing a match is catastrophic; for a marketing deduplication, a few false positives are tolerable. There's no threshold that's right for everyone.

The score is not a linear function. As the Zingg scoring documentation explains: "The classifier finds the best curve to represent the data and gives a final score. This score is dependent on the individual features but is not a linear function but a curve fit." A score of 0.6 is not necessarily "60% likely to be a match" — it's a point on a learned curve shaped by your data, and what it means depends on where the mass of true matches and non-matches falls in your distribution.

How Zingg Scoring Works

Rather than applying a single global similarity function, Zingg computes multiple features per field and feeds them to a classifier. For a name field, this might include: string length difference, character-level edit distance, which specific characters differed, token overlap, phonetic similarity, and null/blank indicators. The same field generates multiple features, each capturing a different dimension of similarity.

The classifier — trained on your labeled pairs — learns which combination of these features predicts a true match in your specific data. This is why the model is specific to your data: the weight it assigns to, say, token overlap vs. character edit distance will reflect the actual patterns in your dataset, not a generic assumption.

A few practical notes from the scoring docs that matter in production:

Matching is transitive. If record A matches record B, and record B matches record C, Zingg puts A, B, and C in the same cluster — even if A and C don't directly match each other. This is the correct behavior for entity resolution (they all represent the same real-world entity), but it means your cluster can contain pairs with varying confidence levels. A cluster of five records might contain one high-confidence core match and a peripheral record that joined via transitivity.

Low-scoring clusters deserve attention. The docs recommend inspecting clusters with z_minScore of 0 — these are clusters where the lowest-scoring pair within the cluster scored at the bottom of the distribution. They're most likely to contain errors. For the same reason, very large clusters (above 4–5 members) warrant a review pass, especially early in your deployment.

The threshold is automatically chosen. Zingg optimizes the threshold to balance accuracy and recall rather than requiring you to set it manually: "Unlike traditional systems, we do not want the user to worry too much about the cut-off." This doesn't mean you never examine it — but it means you're not expected to hand-tune it from first principles.

The Training Data Bottleneck

All of the above — the multi-signal comparison, the learned blocking, the calibrated scoring — depends on labeled training data: pairs of records that a human has confirmed as match or non-match. This is where most teams get stuck.

Getting labeled pairs is slow and expensive if done naively. If you ask a data analyst to manually review 5,000 record pairs selected at random, you'll spend a week getting labels that are mostly redundant (obvious matches and obvious non-matches) and miss the edge cases that actually matter for the model.

The specific pairs that most improve a model's performance are the ones near the decision boundary — records that are similar enough to be plausible matches but where the correct label is genuinely uncertain. Random sampling rarely surfaces these efficiently.

Active learning solves this by inverting the process. Instead of presenting you with random record pairs and asking you to label them, the system identifies which record pairs would be most informative for the model — specifically, the pairs it's currently most uncertain about — and presents those for labeling.

This is how Zingg's findTrainingData phase works. It samples candidate pairs from the blocking output, prioritizing pairs where the model's current confidence is lowest. The label phase then presents these to a human reviewer with a simple interface:

Record 1: John Smith, 12 Oak St, Manchester, M1 1AA
Record 2: J. Smith, 12 Oak Street, Manchester, M11AA

Are these the same person? [yes / no / can't say]

You run findTrainingData and label iteratively — typically three to five rounds — until the model's confidence stabilizes. The step-by-step guide suggests that 30–50 labeled matching pairs is usually sufficient to train to useful accuracy.

That's a small number. It's small because active learning is extremely efficient at selecting the pairs that maximize information gain. Each labeled pair teaches the model something it didn't know, rather than confirming what it already did.

What Good Labels Look Like

The quality of your training data determines the quality of your model. A few guidelines that matter in practice:

Include non-matches as well as matches. The model needs to learn both the positive and negative cases. If all your labels are "yes, these match", the model learns nothing about what non-matches look like. Zingg's findTrainingData selects pairs at various similarity levels specifically to surface both.

Include edge cases, not just easy cases. If you only label obvious matches and obvious non-matches, the model will have poor calibration near the decision boundary — exactly where the hard decisions are. The findTrainingData phase is designed to surface these edge cases, but it only works if you actually label them rather than skipping them.

Label consistently across your team. If multiple people label the same kinds of pairs differently, the model receives contradictory signal and its accuracy suffers. Establishing clear labeling guidelines before you start — especially for genuinely ambiguous cases — is worth the upfront investment.

Use "can't say" genuinely. The can't say option exists for cases where you genuinely cannot determine the correct label from the available fields. Using it correctly — rather than defaulting to "no" when uncertain — gives the model accurate information about its own uncertainty.

What This Looks Like in Production: Two Case Studies

Canadian Football League — Fan 360 Across Nine Teams

The CFL needed to unify fan data across nine separate teams, each with its own ticketing, email, and e-commerce systems. The same fan might appear in multiple systems with different name formats, different email addresses (one registered at a personal Gmail, another at a work account), and different IDs entirely.

The labeling challenge was that the CFL's data team needed a model that generalized across all nine teams' data — which had different data quality characteristics, different naming conventions, and different null rates. The active learning approach let them build a training set from pairs sampled across the full dataset, ensuring the model encountered the variation present in all nine systems. The result was a fan 360 view that consolidated 10–15% of records into unified profiles, directly enabling targeted marketing segmentation they previously couldn't attempt.

Fortnum & Mason — Omnichannel Customer Identity

Fortnum & Mason had customer data across restaurant bookings, email sign-ups, online orders, phone orders, and in-store transactions — five distinct systems with different field schemas and different data entry conventions. A previous third-party identity resolution vendor had produced non-persistent identifiers and limited control over the matching logic.

The active learning approach let them build a model on their specific data — data that included the particular ways Fortnum & Mason customers varied across channels — rather than a generic model trained on someone else's data. The case study notes that they were for the first time able to understand how individual customers were shopping across all channels: online, in-store, by phone, and in restaurants. That capability fed directly into a new membership program and a personalization strategy that required knowing whether a restaurant guest and an online buyer were the same person.

Both examples illustrate the same point: the value of active learning isn't just efficiency. It's that the model learns from your data, capturing the specific patterns of variation in your systems — patterns that a generic pre-trained model or a hand-tuned threshold would miss.

Interpreting Scores in Downstream Systems

One practical question that comes up in production: what do you do with the match output? Specifically, how do you handle records at different confidence levels differently?

A few patterns that work well:

High-confidence matches (well above the threshold, large z_minScore in the cluster) — route directly to the golden record or identity graph without human review. These are the easy cases.

Mid-confidence matches — route to a review queue. These are the cases where the model has less certainty, and a human reviewer can add signal the model couldn't infer from the data.

Low-confidence or unusual clusters — flag for inspection. Clusters with z_minScore of 0, very large clusters, or clusters that formed entirely via transitivity (no high-confidence anchor pair) deserve a look before being trusted.

This tiered approach lets you automate the high-volume easy cases while focusing human attention where it adds the most value. The score output Zingg provides per cluster — including z_minScore across the cluster's pairs — gives you the signal to implement this routing.

Up next: Part 5 — The Hardest Part: Incremental Flow and Living Clusters

← Part 3: Blocking · Series index