Real-world data is far from perfect. When we enter data into a system, like a Salesforce CRM, chances are that a copy of that record exists already. These duplicates need to be removed so that our sales and marketing can work efficiently and effectively. More than a single record for a customer in the CRM hurts our outreach. Identifying duplicate records with variations and retaining a single copy of them is known as deduplication.
Deduplication is a critical step in data cleansing and involves the same entity being represented in slightly different ways. This entity could be anything an organization has a relationship with — customer, location, supplier, partner. There could be minor typographical errors in some of the fields, middle initials or names present or absent, differing ordering of words, presence of suffixes and prefixes. There could be missing fields altogether. When records are referring to the same real-world entity, they need to be identified as duplicates even with all these differences between their attributes.
Large customer lists are often a result of merging multiple files, each sourced from a different database. Some fields may be present in one file and missing in another. A typical customer record will have a name and address and sometimes an email address or phone number. Customer Names could be further separated in first and last name, initial, title, etc, all of which could be either in a field of its own or concatenated and put together in the same field. Customer data cleansing relies heavily on deduplicating such customer records to ensure that a single copy of the customer is accessed and utilized.
Similar to names, customer addresses are generally not standardized. Deduplicating addresses is very complex. There are often more variations in free-form addresses. Information could be ordered differently, locality could precede street name in one address record or be totally missing in another. Abbreviations like St or Av. abound. Flat Number could be spelled out as Apartment Number. Commas, slashes, hyphens proliferate in addresses making address deduplication even more challenging. Without proper address matching, it is impossible to discover households or remove duplicates.
Similar challenges exist while deduplicating supplier names and addresses. Supplier names are generally organization names, and they have suffixes like LLP, LLC, Pvt, Ltd, Corp or suffixes like Messrs, M/S, etc. Typos and out-of-order words can easily confuse the data matching and hamper deduplication. Phone numbers are often saved in different formats, with country codes and area codes represented differently or even missing completely.
A trained human can handle all these differences as long as the data is small, but as record sizes increase, it is not possible to do customer deduplication or address deduplication manually. Due to subtle differences between attributes and lack of a common identifier among the records, coding rules for customer matching or address matching for deduplication is difficult.
What if we could apply machine learning to find patterns for duplicates in the different datasets? Patterns about variations? Patterns about extraneous characters? Patterns about missing fields? If we could use machine learning for deduplication, there would be no need to define complex rules or algorithms for matching fields. We would no longer worry about thresholds for similarity. Training to learn the deduplication and matching rules with a simple yes and no about matches and non-matches would do the trick. Once trained, the machine learning models could easily remove duplicates in customers, addresses, and suppliers so as to create a clean and comprehensive list of customers and suppliers. One would not have to ever worry about messy unclean data in CRM and other systems.
What do you think?