The What and Why of Entity Resolution

Entity Resolution

April 9, 2026

What Is Entity Resolution?

Entity resolution is the process of identifying records across one or more data sources that refer to the same real-world entity — and linking them together. The entity could be a customer, a patient, a supplier, a product, a company, a drug compound, or any other object your business needs to reason about.

The challenge is that real-world data is not clean. The same customer might appear as "Jon Smith" in your CRM, "Jonathan A. Smith" in your billing system, and "J. Smith" in your loyalty program. No shared identifier ties them together. A database join finds no match. But a human — or a well-designed entity resolution system — recognizes all three as the same person.

Here are two variant customer records that illustrate the problem:


Field	 Record 1	           Record 2
Name	Mrs. Kamala Harris	  Harris Kamala
Address	350 Fifth Ave, NY	  350 5th Avenue, New York
Phone	(212) 736-3100	          212-736-3100
DOB	20/10/1964	          Oct 20, 1964

Names are divided into first and last, or combined, or reversed. Salutations appear or disappear. Phone numbers are formatted differently. Addresses use abbreviations. Dates are represented in different formats. Both records describe the same person. A computer evaluating field equality would see two different people.

Entity resolution is also known by several related terms: record linkage when connecting records across separate sources, deduplication when removing duplicates within a single source, fuzzy matching when referring to the matching technique itself, and identity resolution specifically when the entity is a person. When used in the context of corporate entities, it is often called company name matching or organization deduplication.

Why Data Gets Fragmented

The fragmentation is structural, not accidental. Every organization accumulates specialized systems built to serve specific operational purposes — a CRM, a billing platform, an e-commerce store, a physical POS, a support ticketing system. Each records entities in the way that makes sense for its own purpose and its own team.

A company grows. A business line builds its own systems. A regional office operates under different regulatory requirements. An acquisition brings an entirely different technology stack. Partners supply data in their own formats. The result: the same customer, supplier, or product exists differently in every system. There is no single authoritative record. There is no shared identifier that travels across all of them.

Companies use entity resolution to connect these disparate data sources, clean the data, see non-obvious relationships across silos, and build a unified view of their core business entities. In doing so, they build what is variously called a master data management system (MDM), a single source of truth, or — in the context of customers specifically — a Customer 360.

Industry Use Cases

Life Sciences and Healthcare

The life sciences and healthcare industries arguably have the highest stakes for entity resolution of any sector. The consequences of fragmented patient data are not just operational — they are clinical.

Health care systems require a connected, comprehensive picture of every patient's medical journey. They use a real-time 360° view of their patients to provide personalized intelligent medical recommendations, improve the patient experience, and build better treatment protocols. When a patient's records are fragmented across hospitals, clinics, insurance providers, and pharmacies, a care provider making a treatment decision is working with incomplete information. An allergic reaction documented at one facility may not be visible to a provider at another. An AI care coordinator reasoning about an incomplete patient record may recommend the wrong intervention.

Pharmaceutical companies face the same challenge in research and analytics. They need to consolidate biological and medicinal data from studies conducted by scientists in global labs — compounds and reactions across thousands of spreadsheets with millions of rows, all with different column names, formats, and inconsistencies (insulin glargine versus insulin-glargine, for example). Fuzzy matching reconciles these variations so the underlying data can be compared and analyzed.

At the commercial level, pharmaceutical companies need a single view of all their purchasers — hospitals and clinics — to understand who needs new supplies, manage outstanding accounts, and analyze prescribing patterns across regions. When different systems record the same hospital with different name formats and address abbreviations, this analysis is impossible without entity resolution.

The life sciences and healthcare industries also have complex compliance requirements. Previously, organizations spent extraordinary effort aggregating data from multiple systems to prepare regulatory submissions. Entity resolution across hospitals, internal records, third-party sources, and partner data enables continuous regulatory readiness rather than periodic scrambles.

Redica Systems, a life sciences data company serving the pharmaceutical and MedTech industries, used Zingg to unify over 10 million records from global health agencies and regulatory bodies. Their global regulatory intelligence platform depends on accurate entity resolution to help clients make compliance and vendor risk decisions.

Insurance

Insurance companies operate multiple product lines — car, health, home, life — each with their own systems, each maintaining their own record of the customer. The result is that the same person exists differently across every product system in the insurer's portfolio.

Without a unified customer view, the insurer cannot reconcile a customer's demographics, preferences, past policies, and credit ratings. Personalized marketing is impossible. Cross-selling opportunities are missed. Risk exposure cannot be accurately calculated at the individual or household level.

The consequences extend to both revenue and compliance. If an insurance provider doesn't know that a new applicant for health insurance is already a policyholder under a different product, they cannot offer a combined scheme or appropriately price the risk. If a fraudster who committed insurance fraud under one name applies under a slightly different name, a fragmented system treats them as a new applicant.

Fraud detection is one of the largest entity resolution use cases in insurance. Matching records across claims, policies, and applicant histories — despite name variations, address changes, and partial identifiers — is what makes pattern detection possible.

Claims management, policy management, and regulatory compliance also depend on the ability to consolidate policy and regulatory data across systems. The 1-10-100 rule of data quality applies acutely here: it costs $1 to verify a record as it is entered, $10 to cleanse and deduplicate it later, and $100 or more if nothing is done and the errors propagate into decisions.

Manufacturing

Manufacturing companies deal with entity fragmentation at the supply chain level. Parts, materials, and suppliers are recorded differently across business units, regions, procurement systems, and ERP instances.

With entity resolution across supplier data, manufacturers can analyze expenditure across large geographies, identify the most cost-effective suppliers for each category, and remove duplicate supplier records that create pricing inconsistencies and procurement inefficiencies. If a key supplier's delivery is disrupted, a well-maintained supplier master enables rapid identification of alternatives. Without it, procurement teams may not even know they have alternative suppliers in other divisions.

Product catalog reconciliation is another major use case. Multiple entries for the same product — with different descriptions, units of measure, or pricing — cause confusion across procurement, inventory, and customer-facing systems. An organization that shows a different price for the same product in different channels has a master data problem at its root.

Entity resolution enables the merging of product catalogs and price lists across business units and geographies, driving pricing consistency and accurate inventory management.

Financial Services

Financial services organizations face entity resolution challenges in two directions: customer-facing and compliance-driven.

Customer-facing, the challenge is building a unified portfolio view. Client data entered by different relationship managers — with the inevitable typos, abbreviations, and formatting variations — must be reconciled so that a complete picture of each client's relationship with the firm is available. Without it, customer service suffers, cross-sell opportunities are missed, and the quality of advice declines because advisors cannot see the full picture.

Personal credit ratings and financial stability assessments depend on connecting KYC data with customer investments and transaction histories across multiple systems. When the same person appears differently in different systems, these assessments are incomplete.

Anti-Money Laundering (AML) is one of the most critical entity resolution use cases in financial services. AML compliance requires not just identifying who a customer is, but mapping how specific customers are connected to higher-risk individuals and legal entities. This is impossible when entity identity is fragmented. Fraudsters frequently open accounts with slight variations in their details — a transposed digit, a different spelling — specifically to defeat identity-based detection systems. Entity resolution is what makes these patterns visible.

Know Your Customer (KYC) requirements similarly depend on reliable entity identity: the ability to confirm that a new applicant is who they claim to be, and that they have not appeared previously under a different identity.

Energy, Oil, and Mining

Global energy, mining, and oil companies need to unify data from multiple sources, geographies, and engineering disciplines to manage operations efficiently. Parts, materials, and equipment are cataloged differently across facilities, regions, and enterprise systems.

Entity resolution enables these organizations to share inventory across divisions, identify the best supplier for a given part across all procurement systems, and drive pricing consistency. An organization that does not know it has the same part under three different names in three different systems cannot optimize its inventory or negotiate effectively with suppliers.

Regulatory compliance in energy is also data-intensive, with reporting requirements that span internal records, partner data, and third-party sources. Entity resolution provides the unified data foundation that makes efficient regulatory reporting possible.

Sales and Marketing

The sales and marketing use case for entity resolution is the one most organizations encounter first, because its absence is immediately visible in customer experience failures.

Recommendations and personalized marketing depend on a complete view of the customer. When customer data is fragmented across CRM, e-commerce, offline stores, loyalty programs, and marketing platforms, personalization is generic, campaign attribution is wrong, and lifetime value calculations are inaccurate.

The failure modes are concrete. A customer who is listed multiple times across two different systems — because of a typo in their phone number or a different spelling of their name — will receive duplicate outreach. Duplicate emails are a missed sale opportunity. Duplicate direct mail is wasted spend. A customer who receives the same "welcome back" offer three times in one week because they appear as three separate records is experiencing the brand as broken.

Customer deduplication in CRMs is one of the most operationally impactful entity resolution use cases. When multiple salespersons have created separate records for the same lead, sales cycles are wasted as more than one person contacts the same individual with the same pitch. De-duplicated leads mean focused outreach and better brand experience.

Fortnum & Mason, the 300-year-old British luxury retailer, had customer data fragmented across restaurant bookings, email signups, online orders, and in-store transactions. Before building their Single Customer View on Zingg, they could not understand how any individual customer was actually shopping with them. "For the first time, we're able to understand how customers are shopping with us — online, in-store, over the phone, or in restaurants. We never had that before." Personalization, lifetime value analysis, and cross-channel attribution all depended on solving this identity problem first.

By unifying customer data from different internal and external sources, businesses can build genuine 360-degree customer views — and take marketing from generic broadcast to personalized engagement. A customer who browses winter coats in-store can receive a relevant promotion online. A customer whose lifetime value qualifies them for loyalty treatment can receive it consistently across all channels. These outcomes require accurate customer identity as the foundation.

Supply Chain and Procurement

Supplier matching is one of the most persistently underserved entity resolution use cases in large organizations. Supplier data is siloed across procurement systems, ERP instances, regional databases, and contract management platforms. The same supplier may appear dozens of times across an organization's systems, with different spellings of their name, different address formats, and different identifiers in each system.

Without supplier entity resolution, organizations cannot get an accurate picture of how much they are spending with a given vendor across all business units. They cannot negotiate effectively because they do not know their true consolidated spend. They cannot manage risk associated with vendor geography or concentration because the same supplier appears as many different suppliers.

For example, a CPG company comparing two vendors of similar products needs to match product descriptions across catalogs where the same item may be called "tissue paper," "paper handkerchief," "tissue," or "sanitary paper." Without entity resolution across these descriptions, the comparison is incomplete.

Entity resolution also blocks fraudulent suppliers who attempt to re-enroll with slight variations in their details after being blocked or de-listed. A slight change in company name, address, or registration number is enough to defeat a pure string-matching approach.

Functional Use Cases Beyond Industry

Entity resolution is not industry-specific in its value — it is entity-specific. Any business that operates across multiple data systems and cares about any of the following is a candidate:

GDPR and CCPA compliance. Responding to a Subject Access Request requires locating every record about a specific individual across every system that holds data about them. Entity resolution is what makes comprehensive, accurate data subject discovery possible. Without it, organizations risk either missing records (incomplete response, compliance failure) or over-including records (privacy violation, compliance failure in the other direction). See Zingg's GDPR solution and CCPA solution.

Master Data Management. Entity resolution is the core matching capability of any MDM system. See Zingg's MDM solution.

Supplier 360. The same principles that apply to Customer 360 apply to suppliers, products, and any other business entity. See Zingg's Supplier 360 solution.

Knowledge graphs and Identity RAG. Resolved entities with persistent identifiers are the foundational layer of any enterprise knowledge graph. When LLMs query a knowledge graph for context, the quality of the reasoning depends on whether the graph correctly represents entity relationships. See the Zingg + LangChain Identity RAG guide.

Entity Resolution and Agentic AI: The New Urgency

Every AI system that reasons about entities in your business is only as good as the entity data it reasons about. This has always been true. It has become urgent now that AI systems take autonomous action.

When analytics runs on fragmented data, the cost is inaccurate reports. A person reviews the report and may catch errors. When an AI agent acts on fragmented entity data, the cost is automated mistakes at scale — executed repeatedly, without human review, until someone notices the damage.

A retail brand deployed an AI marketing agent that could segment customers, design campaigns, and send personalized emails without human intervention. Two weeks in, their top VIP customer had received three separate "Welcome to our brand!" offers with three different discounts. Their CRM had John A. Smith, J. Smith, and Jonathan Smith as three separate customers. The agent was not broken. It was doing exactly what it was told on data that did not reflect reality.

In healthcare, an AI care coordinator acting on fragmented patient data misses the allergy record from a different facility. In financial services, a fraud detection agent fails to connect accounts opened with slight name variations. In manufacturing, a procurement agent creates duplicate supplier records and contracts with the same vendor twice.

The pattern is identical in every case: the AI is not wrong. The entity data is fragmented. And because agentic AI amplifies actions rather than simply reporting findings, the cost of bad entity data scales with every deployment.

Entity resolution is the prerequisite for trustworthy agentic AI. The ZINGG_ID — a persistent identifier assigned to each resolved entity — becomes the stable anchor that AI agents, RAG systems, and LLM applications use to retrieve a complete, consistent view of any entity they need to reason about.

The Challenges That Make Entity Resolution Hard

Scale

The naive approach to matching is to compare every record to every other record. For n records, this produces n×(n-1)/2 unique pairs. For one million records, that is approximately 500 billion comparisons. At ten million records, the problem is one hundred times larger.

Even if each comparison takes a microsecond, this is computationally impossible in any reasonable timeframe. Entity resolution at enterprise scale requires intelligent blocking — a mechanism that drastically reduces the comparison space without missing true matches.

Zingg's learned blocking model reduces actual comparisons to typically 0.05–1% of all possible pairs, by learning from training data which attributes and combinations of values most effectively partition records into candidate sets where matches are likely to exist.

Data Variation

Real-world data varies in ways that cannot be fully anticipated or exhaustively enumerated.

Names vary by spelling, transposition, abbreviation, salutation, suffix, prefix, middle name presence or absence, hyphenation, nickname, and phonetic variant. "Bob" and "Robert" are the same name. "IBM" and "International Business Machines" are the same company. Addresses vary by street type abbreviation, unit number format, city name variant, and postal code format. Dates are formatted differently by system, region, and operator. Phone numbers include or omit country codes, area codes, parentheses, dashes, and spaces.

Any entity resolution system must handle this variation. Rule-based systems require a rule for every known variation — an impossible task, because unknown variations always outnumber known ones. ML-based systems learn the pattern of variation from training examples and generalize to new cases.

Matching Definition

Even when similarity is computable for individual attributes, combining attribute-level similarities into an entity-level match decision requires judgment. Should two records be the same entity if first names are similar but phone numbers differ? The correct answer depends on the entity type, the data quality profile of each attribute in your specific data, and the business context of the matching use case.

Rule-based systems require these thresholds to be manually defined and tuned — a long and error-prone process. Zingg's active learning labeler allows a human reviewer to label record pairs as match or non-match, and the classifier learns the appropriate combination of attribute weights from those labels rather than requiring them to be specified manually.

Precision and Recall Trade-offs

Every entity resolution system faces a fundamental tension:

False positives (matching records that are actually different entities) cause downstream errors — merging two different customers, crediting a transaction to the wrong person, combining medical records for different patients.
False negatives (failing to match records that are actually the same entity) leave fragmentation in place — the very problem entity resolution is meant to solve.

The right operating point on this trade-off depends on use case. AML detection prefers higher recall (catch more potential fraud, accept more false positives for human review). Customer experience prefers higher precision (don't merge distinct customers, even if it means missing some true matches). Zingg's probabilistic output — a match score for each candidate pair rather than a binary decision — allows the operating threshold to be tuned per use case.

Schema Variations

Entity representation differs across systems. The same attribute is named differently: lastName, lName, surname, family_name. The same value is represented differently: Male/Female in one system, M/F in another, 1/0 in a third. The address is one field in one system, split across Address 1 and Address 2 in another, with Street and City as separate columns in a third.

Before matching can begin, source schemas must be mapped to a common representation. This schema mapping is itself a significant effort in any real-world deployment, and it must be maintained as source systems change.

Multiple Data Formats and Stores

Production entity resolution systems must ingest data from relational databases, NoSQL stores, cloud object storage, local filesystems, and SaaS platforms — in formats including CSV, JSON, Parquet, Avro, XML, and proprietary database formats. The entity resolution layer must be agnostic to these variations.

Zingg connects to any data source Spark supports and runs natively inside Snowflake, Databricks, BigQuery, Microsoft Fabric, and AWS Glue — meaning entity resolution can run inside your existing data platform without requiring data movement to a separate system.

Languages

Large multinational organizations record data in regional and local languages. Name matching across languages requires language-aware comparison — phonetic similarity in English is different from phonetic similarity in Chinese, where short strings share many characters across completely different names. Address matching in Japanese requires different normalization than address matching in German.

Zingg includes out-of-the-box support for English, Chinese, Thai, Japanese, Hindi, and other languages, with the ability to define custom matching functions for domain-specific cases.

Why Legacy MDM Is the Wrong Answer

Master data management has been a recognized enterprise need for decades. The established platforms — IBM, Informatica, and SAP (which acquired Reltio in 2024 to strengthen its MDM position) — have offered comprehensive solutions.

The results have been deeply mixed. Legacy MDM systems are rule-based: defining the matching rules for a single entity type across a large enterprise — covering every attribute variation, every data quality scenario, every edge case across dozens of source systems — takes months. Deployment cycles of two to three years are common. Implementation costs typically run four times the license cost. Many programs never fully deliver. The addition of each new source system requires a fresh round of rule definition and database tuning.

The root problem is that rule-based matching does not generalize. It requires explicit human specification of every matching condition. It suffers at scale due to the quadratic comparison problem. And it creates new data silos: a separate MDM system of record that your existing data platforms must feed and sync with.

Zingg is a different approach: open-source, ML-based entity resolution that runs natively inside your existing data platform. No separate system of record. No rule authoring. No data movement outside your environment. Entity resolution as a step in your data pipeline, producing a persistent ZINGG_ID that downstream systems can rely on as a stable entity reference.

The incremental flow in Zingg Enterprise keeps the identity graph current as data changes — new records, updated records, cluster merges and splits — without requiring a full re-match. The ZINGG_ID persists across runs, so every downstream system that holds a reference to an entity continues to resolve correctly even as the underlying records evolve.

How Zingg Works

A Zingg entity resolution pipeline has three core phases:

Blocking. A learned blocking model groups records into candidate sets where matches are likely to exist. This reduces the comparison space from billions of pairs to a manageable fraction — typically 0.05–1% — without missing true matches.

Matching. For each candidate pair produced by blocking, a trained classifier scores the probability that the two records represent the same entity. The output is a probabilistic score, not a binary decision, allowing downstream thresholds to be tuned per use case.

Clustering. Pairwise match scores are aggregated into entity clusters — groups of records that all refer to the same real-world entity. In Zingg Enterprise, each cluster is assigned a persistent ZINGG_ID that remains stable across subsequent runs. The cross-reference table maps every source record to its ZINGG_ID, providing a permanent linkage between source systems and the resolved identity graph.

Zingg also supports deterministic matching — joining on trusted identifiers like email or SSN where they exist — woven into the same pipeline so that records with and without trusted identifiers all resolve to the same unified entity cluster.

Getting Started

Zingg is open source. The community version handles batch matching across any data source that Spark can read, and is freely available on GitHub with documentation and examples for most major platforms.

Explore and try: - Zingg on GitHub — open source entity resolution for Spark, Databricks, Fabric, BigQuery - Zingg Enterprise — persistent ZINGG_ID, incremental flow, native Snowflake execution - Platform guides — step-by-step for Databricks, Snowflake, Fabric, BigQuery, Neo4j - Case studies — Fortnum & Mason, Redica Systems, Canadian Football League and others - Compare editions - Contact us

Further reading on this blog: - Deterministic vs. Probabilistic Matching: Why You Need Both - The ZINGG_ID: A Persistent Identifier for Your Entity Graph - Incremental Identity Resolution: Keeping Your Entity Graph Current - Why Customer 360 Still Matters — And What It Actually Takes - A Guide to Agile Data Mastering with AI - Entity Resolution with Neo4j: Why Zingg and Graph Databases Belong Together