Fuzzy Matching at Scale, Part 5: The Hardest Part — Incremental Flow and Living Clusters

Part 5 of 5. ← Part 4: Thresholds and Active Learning · Series index

Parts 1–4 covered how to build a fuzzy matching system that works: multi-signal comparison, noise removal, learned blocking, and active learning to calibrate the model. This final part covers what happens after it works — what happens when data changes.

This is the problem most teams don't fully anticipate. You build a matching system, it runs successfully on your initial dataset, and you ship it. Then new records arrive. Then existing records get updated. Then you discover that two entities you thought were separate are actually the same person. Then someone from compliance asks what happened to a cluster that existed last month.

The incremental problem — keeping a live, production identity graph consistent as data evolves — is arguable harder than the matching problem itself. It's the part that separates a proof of concept from a system a business can rely on. And it's the part where most in-house builds eventually fail quietly.

Why Full Reruns Don't Work

The obvious approach to keeping a matching system current is to rerun the full match job on a schedule — nightly, weekly. On small datasets, this is fine. At scale, it has two serious problems.

The cost problem is the one people notice first. Reprocessing 80 million records every night — even with efficient blocking — is expensive and slow. At some dataset size, a nightly full rerun simply doesn't finish before the next one starts. The matching job becomes the bottleneck for the entire data pipeline.

The lineage problem is subtler and more damaging. When you rerun from scratch, the cluster IDs you assigned on Monday bear no relationship to the IDs you assign on Tuesday. Every downstream system that references those IDs — your CRM integration, your data warehouse's golden record table, your fraud models, your reporting dashboards — is now pointing at stale identifiers. There's no automatic way to reconcile "cluster 1001 from Monday" with "cluster 2734 from Tuesday" if the records shifted around.

The first time this breaks something in production, teams usually respond with one of two workarounds: making the cluster IDs a hash of the canonical record's primary key (which still breaks when clusters merge or split), or writing custom reconciliation logic after every run (which is maintenance debt that compounds indefinitely).

What you actually need is an incremental flow: process only the new and changed records, integrate them into the existing resolved state, update cluster membership accordingly, and preserve the identifiers that everything downstream depends on.

The Three Hard Problems in Incremental Matching

Problem 1: New Records Joining an Existing Cluster

A new record arrives in your incremental batch. It matches an existing cluster — a new CRM entry for a customer already present in your billing system. The system should assign the new record the same cluster ID as the existing entity and update the resolved view.

This is the easy case. Most incremental systems handle it. You index the new record against the existing clusters, find the best match, assign the existing cluster ID if the match is above threshold, or create a new cluster if it isn't.

Problem 2: Updated Records Triggering Reassignment

A record that was previously matched gets updated in the source system. A corrected email address. A changed surname. A phone number added. The updated record now presents new signals. If those signals make it a better match for a different cluster than the one it was originally assigned to, the record needs to be reassigned.

This is where most append-only incremental systems break down. They add new records correctly but never revisit previous assignments. The result is stale cluster membership that silently accumulates over time: records that belong in cluster A but are still sitting in cluster B because they were initially matched before their data was corrected.

Zingg's incremental phase handles this explicitly. From the documentation: "If a record gets updated and Zingg Enterprise discovers that it is a more suitable match with another cluster, it will be reassigned." The system doesn't treat incremental processing as append-only — it re-evaluates updated records against the full existing cluster state.

The incremental phase is configured with a reference to the base configuration and the new data batch:‍

{  "config": "config.json",  "incrementalData": [{    "name": "customers_incr",    "format": "csv",    "props": {      "path": "test-incr.csv",      "delimiter": ",",      "header": false    },    "schema": "recId string, fname string, lname string, stNo string, add1 string, add2 string, city string, state string, areacode string, dob string, ssn string"  }],  "outputTmp": {    "name": "customers_incr_temp",    "format": "csv",    "props": {      "location": "/tmp/zinggOutput_febrl_tmp",      "delimiter": ",",      "header": true    }  }}

‍

Run as:

./scripts/zingg.sh --phase runIncremental --conf incrementalConf.json

Problem 3: Cluster Merge and Unmerge

This is the most complex case — the one most teams don't anticipate until it bites them — and it requires careful handling of the downstream consequences.

Cluster merge happens when a new or updated record acts as a bridge between two clusters that were previously separate.

Concretely: cluster A contains records for "J. Smith, 12 Oak St, Manchester" and cluster B contains records for "John Smith, 12 Oak Avenue, Manchester". They didn't match each other directly — the address formats were too different to score above threshold. Then a new record arrives: "John Smith, j.smith@acme.com, 12 Oak St, Manchester". This record matches both A (same address, similar name) and B (same full name, same city). Transitivity now demands that A and B are the same entity. The two clusters must merge into one.

This is not just a relabeling operation. It means that every downstream system holding a reference to cluster A's ID or cluster B's ID now needs to converge on a single canonical ID. Depending on your architecture, this propagates through golden record tables, reporting aggregations, BI dashboards, CRM syncs, and operational systems. The merge must be handled atomically and its effects propagated correctly — or you end up with a split view of the same entity across different parts of your stack.

Cluster unmerge (or split) is the reverse. An update reveals that two records previously grouped together don't actually belong in the same cluster. Perhaps a name was corrected, revealing that two entries were different people with similar names at the same address. The cluster splits into two, and each sub-cluster needs its own stable identifier going forward.

Both merge and unmerge are automatic in Zingg's incremental flow. From the incremental flow product page: "Cluster assignment, merge, and unmerge happens automatically in the flow." Poorly populated records get resolved on limited signals, new information enriches those records, and Zingg updates the resolved identities accordingly — merging separate clusters when new signals connect them, splitting clusters when new data contradicts old assumptions.

Human Feedback Must Survive Incremental Runs

In any production matching system, humans review cluster assignments. A data steward confirms that two records belong to the same entity. An analyst flags a cluster as incorrectly merged. These decisions represent ground truth that the automated system discovered through additional signals that the model couldn't.

This creates a direct conflict with an incremental system that continuously re-evaluates cluster membership. If your model re-evaluates a record pair that a human previously marked as a non-match and the similarity score has increased (because the record got updated), a naive system will re-merge them. The human's decision is silently overridden. Trust in the system erodes — reviewers stop trusting their own edits.

Zingg's incremental flow preserves human decisions as hard constraints: the system "takes care of human feedback on previously matched data to ensure that it does not override the approved records." Manual approvals (confirmed matches) and explicit separations (confirmed non-matches) are treated as fixed points that incremental runs must respect, not soft preferences the model can override when it gains confidence.

This is the correct behavior for a system that humans are expected to trust and act on. The model's job is to make decisions where humans haven't yet weighed in — not to second-guess decisions where they have.

ZINGG_ID: The Stable Identifier Everything Depends On

All of the above only works if each resolved entity has a durable, stable identifier — one that survives incremental runs, cluster merges, record updates, and model improvements.

This is the role of the ZINGG_ID: a globally unique, persistent identifier assigned to each resolved cluster. Its behavior across the incremental scenarios:

New record joins existing cluster → the new record inherits the cluster's existing ZINGG_ID
New record matches nothing → a new ZINGG_ID is generated and assigned
Updated record reassigned to different cluster → the record moves to the target cluster's ZINGG_ID; the source cluster retains its ID (potentially with fewer members)
Cluster merge → one ZINGG_ID is designated canonical (typically the larger cluster's), the other is retired; all records in the merged cluster carry the canonical ID
Cluster split → the original ZINGG_ID is assigned to one sub-cluster; a new ID is generated for the other

The key property is persistence: downstream systems anchor on a ZINGG_ID and trust that it will keep pointing to the same real-world entity even as the underlying records change. Without this guarantee, every incremental run risks breaking the foreign keys, dashboard filters, and integration mappings that depend on entity identity.

Keeping IDs Stable Across Model Changes

There is one more scenario that every production team eventually faces: you want to improve the matching model itself.

You've gathered more labeled data and want to retrain. You want to add nickname support. You've learned that a particular field has high null rates and want to adjust the blocking strategy. You're migrating from Spark to Snowflake. In each case, you rerun the full dataset through the new model — and the cluster assignments shift. Records that were separate are now together. Records that were together are now split differently. If this causes a wholesale reissuance of ZINGG_IDs, everything downstream breaks.

The reassignZinggId phase addresses this directly. It takes two inputs:

Your new model's output — the clusters as they look after the model change
Your original production model's output — the clusters and IDs as downstream systems currently know them

It then compares them by primary key overlap: for each cluster in the new output, it finds the cluster in the original output that shares the most records (by primary key), and assigns that cluster's original ZINGG_ID. Only clusters that have no counterpart in the original output receive new IDs.

./scripts/zingg.sh --phase reassignZinggId \ --conf examples/febrl/sparkIncremental/configReassign5M.json \ --originalZinggId examples/febrl5M/config.json \ --properties-file config/zingg.conf

The effect: a model upgrade causes minimal downstream disruption. Systems that already know about an entity continue to find it under the same identifier. Only genuinely new entities — clusters that have no counterpart in the original output — get new IDs. The ID space evolves incrementally rather than wholesale.

The same mechanism covers:

Infrastructure migrations (Spark → Snowflake, Databricks → Fabric) — run the new platform's output through reassignZinggId to preserve IDs
Schema changes (new fields added, old fields removed) — the primary key matching preserves IDs for records that carried over
Platform upgrades (new Spark version, new Snowflake features) — ID continuity through the upgrade process

The Full Incremental Lifecycle

Putting it all together, a production incremental flow looks like this:

Initial full match run → assigns ZINGG_IDs to all initial clusters → downstream systems begin using these IDs Periodic incremental runs (daily / hourly / streaming) → new records matched against existing clusters → updated records re-evaluated, reassigned if needed → clusters merged where new signals connect previously separate groups → clusters split where new data contradicts old assumptions → human feedback preserved as hard constraints throughout → ZINGG_IDs updated minimally (new IDs only for new entities) Occasional model improvement cycle → retrain on expanded labeled data → run full dataset through new model → reassignZinggId maps new clusters to original IDs → downstream systems see minimal disruption

This is what a living identity graph looks like. Not a static snapshot run weekly and reconciled manually, but a continuously maintained view of entity identity that reflects the current state of your data, respects human decisions, and provides stable identifiers that the rest of your stack can depend on.

Why This Is Harder Than It Looks

Teams who build fuzzy matching in-house typically focus on the matching problem: the algorithms, the blocking, the threshold. They get the initial matching working and ship it. Then the data changes. They run a full reprocess. IDs change. Things break downstream. They add reconciliation logic. It grows. Eventually, the reconciliation code is harder to maintain than the matching code.

The incremental problem — and specifically the cluster lifecycle problem (merge, split, reassignment, human feedback, ID stability) — is the part that requires treating entity resolution as infrastructure, not just a batch job. It requires a system designed from the start to manage the full lifecycle of a cluster: its creation, its evolution as new data arrives, its merging with other clusters, its occasional splitting, and its stable identity as seen by downstream systems.

That's the difference between a proof of concept and a production system. The matching accuracy is the proof of concept. The incremental lifecycle is the production system.

Series Summary

This series has traced the arc that most production fuzzy matching systems follow:

Part 1 — A single similarity metric can't capture how real names vary. Compose multiple signals via field-level match types.
Part 2 — Common tokens destroy your signal. Remove stopwords and normalize before comparison.
Part 3 — Pairwise comparison is intractable at scale. Learn a blocking model from labeled data to cut comparisons to 0.05–1% of the problem space.
Part 4 — Thresholds aren't universal. Use active learning to train a calibrated model from 30–50 labeled pairs.
Part 5 (this post) — Static runs don't work in production. Build an incremental flow that handles cluster merge, split, reassignment, human feedback, and stable IDs across model changes.

Each layer compounds on the previous ones. Getting all five right is what separates a fuzzy matching system that works in a demo from one that works in production — accurately, at scale, and sustainably over time.

← Back to series index

‍