Powering Sustainability Through Brand and Product Entity Resolution

Edward Bennigsen, Senior Software Engineer at Provenance.org

The Problem

To prevent brands from saying things that could be false and misleading, various governments, like that of the UK, have released multiple guidelines. The European Union, which already has strong protections, is bringing in more regulations. Some states in the US are introducing regulations as well. GDPR had a major impact on data management and privacy. Similarly, regulation is impacting brands in a major way. Brands are being increasingly scrutinized for the claims they make about their products. They are finding it difficult to work out what they can say or how to market things in a consistent way. This imposes extremely stringent standards in data integrity to secure consumer trust.

Let us say that we have a brand in the cosmetics and beauty industry that is selling facial cream. The cream is vegan, and it is cruelty-free. They have used recycled plastic in the packaging. They advertise that part of the proceeds from the sale would go to a charity. They want to communicate this to their customers. Unfortunately, customers do not trust brands when they make such claims as they have experienced greenwashing in the past.  

The User

Provenance is a tech startup that powers sustainability claims shoppers can trust and helps them shop in line with their values. They protect shoppers from greenwash by connecting ‘green’ claims to evidence from the supply chain or third-party verification. The global leader in sustainability marketing technology, Provenance helps brands and retailers share credible, compelling and fact-checked social and environmental impact information at the point of sale.  

Provenance has a specialized framework that helps brands make true, trustworthy claims. It also enables the brands to present these claims in a way that their customers can easily comprehend and understand. This helps the brand immensely. Brand sentiment and trust go up, which in turn helps with a demonstrable increase in sales revenues. Provenance’s technology is already increasing conversion rates, brand value and market share for customers including Cult Beauty, Douglas, GANNI, Napolina, Arla and Unilever.  

We talked with Edward Bennigsen, Senior Software Engineer at Provenance to learn more about the company and how they are using Zingg.

Thanks for taking the time to talk with us, Ed. Can you please tell us more about yourself and your work at Provenance?

For Provenance, the customer is the brand, and the shopper is the brands’ customer whom you are helping make the right choices. We are a B2B2C company. We also work with retailers who sell the brands and products. We must span quite a broad range of the market; we are not just a data provider.  

I have been at Provenance for about 2 ½ year in various roles. The core of our app is a Ruby on Rails app running on Postgres which enables our customers to manage and publish their products and claims information. This gets published into the customer’s e-commerce stores as embeds. I have been heavily involved with building the core product with cool features on the administrative side, and also on the customer embed side as well. I have been working full stack. In the past six months I have transitioned to a Data Engineer/Project Lead role, and we have been running a large project to integrate Databricks on a data warehouse/lake house platform. We are really trying to reduce the effort on the front end of our data funnel - bringing in products, brands, product claims and certifications into our app through a largely automated process. So that is where I spent the past six or seven months and I am going to spend the next year at least on that.  

How big is your team right now Ed?

Provenance as a company is about 30 employees. Engineering is about 10, and on my particular team there are two to three engineers any given sprint along with one product manager. So roughly my team is about 5 people. We have two software engineering squads, one focused on the shopper concerns. These are the people who would buy the products from the brands. The team that I manage is focused on the brands who sell these products. We have been heavily focusing on the data at this time because the application is very mature. The idea is to save time and effort for everyone involved and in improving the brands’ experience in a tangible way. We want to provide more data that they do not have themselves. This customer focused goal of the company is driven through data.

What is the problem you are solving with Zingg?

Sustainability information comes from a lot of disparate sources. First off, we have the brands. The list of brands comes from a biz dev team – they are trying to find brands we want to sign. Then you have the products that those brands have. And they could come from a scrape of their site. We have integrations with our retail partners where we can pull in CSVs that have all their brands and their products. We have also been paying for a service that scrapes the internet to bring the products in. That is the minimal set to get someone on our app, the brand and the product information. But the value we are providing is the impact of the product. For example, this product is vegan, this brand is a female owned business. The sources for impact information are even more complex. Oftentimes, we just have access to their website that we have to scrape first. Sometimes, we search one by one and scrape each individual page or even a list. And those are the easy ones. For something like a certification, things get complex pretty quickly. So Leaping Bunny, for instance, is an organization that certifies products and brands as cruelty free. They publish that as a very binary yes, no.  

We are also looking at things that are a bit more vague. Like packaging recyclability. Where is it recyclable? In what context? Is every cardboard created the same as every other piece of cardboard? It can get very messy there. This is very similar to the customer record mastering issue where you have a bunch of customer databases, and you want to combine them.

Our goal is to unify all these sources into a complete picture of the brand and its products. We apply Zingg to our data for this.  

What is the complexity of matching brands and products in your datasets?

Let us say we have a list of brands coming from a retailer, we have a list of brands coming from a business development list and we have a list of brands coming from certifying authorities like Leaping Bunny. We take this list of brands, and we unify them into one big list and then we try and match those brands. For example, this brand with this name here looks a bit like this brand from another dataset. They are probably the same name and the reason I think Zingg is a really helpful tool for us is because we cannot do exact name match. Some of our sources have formal business names. It might be a trading name versus a business name, or they might have the name with Inc. or Pvt Ltd at the end of it. There are many different ways in which the same brand may be spelled. Some sources will provide a website for the brand. If the websites match, that is quite a good indication that these records will match.  

Without Zingg, we would have to define some rules where if the websites match then that is like a match and if the descriptions are similar enough by some distance metric that perhaps is a match. Same for the name. We would have to capitalize everything and strip any links out of it and clean punctuation, we would have to do all that processing ourselves and fine tune these rules.

With Zingg we get a fine-tuned model and we do very minimal preprocessing. The preprocessing that we're really doing is making sure every source has the same columns and Zingg is able to learn the model to fit. Another thing Zingg does is identify training pairs for us, so we don't need to go through our data and build training sets that will fit the problem. Zingg suggests cases where it is unsure if this is a match or not. Can you please label that? So that's been really helpful. For some edge cases that we find, we might supply our own training data to prompt the model down a certain path.  

On the whole it's been a very smooth flow. Obviously there is some effort to integrate things, but it's a very smooth flow.

Thanks for that Ed! What are the volumes in terms of the data that you are processing every day and the cadence of the matching?

Brand data has been our pilot for matching. The volume is not particularly high, we've had about 15,000 rows matched down to 5000 rows for the first quarter of this year. Our goal is to scale that as we go, bring more certificates over time. Obviously, every certificate source we bring online is going to have volumes and that is an average of 5000 records per source. We also get a lot more data when we sign retail partners who bring in a lot of data. Some of those are quite big.  

I'm confident that Zingg will scale very easily. Obviously, compute times can increase and there are costs associated with that, but the Zingg models will be able to handle that and hopefully without too much extra training, even though training Zingg is a very smooth process.  

The higher volume piece is more for products which is what I'm working on this week. It has a lot of certificates that will apply to only some products of the brand. Let us say a vegan certificate might apply to lipstick, but then one out of 10 of the shades uses some animal-based product to achieve that. That particular shade Isn't vegan. You can individually specify every product as this, so we obviously must go off the list from the certificate vendor against that product and try and match that. The scale of that is far more than that of brands.  

We're currently looking at about 200,000 rows of products, which I imagine will reduce to roughly about 100,000. The level that we must operate at is the product variant. The same product but with different sizes, color variations and flavors. The nice thing about product matching is that products have barcodes which in theory makes one think that we have a key or unique identifier to the product. We can simply match on that. That was our thought last year. The issue is that you have to preprocess these first to a standard format and there can still be quality issues there. Someone has mistyped them. Or they are off by a digit, or they've been padding with zeros on the wrong side. But the main issue is that in most of these certificate datasets, we only have a list of product names, and they do not list the bar codes they apply to.  

In terms of the competitive advantage of being able to know which product has which claims and certificates, Zingg is helping us to make that join.

There are not really that many off the shelf tools that do entity resolution. There are quite a lot of desktop tools where one person will sit down and do data mastering themselves and then upload it into some database like MSSQL.  

That is obviously not what we wanted. We want to have a cloud-based tool that we can scale very easily and free our analysts from having to do stuff on their machines, manually doing pivots in Excel to try and do some of this matching.  

Zingg was the first proper tool that we tried that is very specific to the use case. And it works so well, we didn't really need to check anything else.

What happens to Zingg results? Who are the users who use them?

It is very broad in terms of users. The primary focus is users of the app. Then the people doing the import from Databricks into the app, who have a manual review step involved for quality checks at this point before it goes in. The brand customers, the shoppers, they are going to see the sustainability information that the brand has on the product.  

Very interestingly, the data is really playing into our business development as well where we're trying to pitch to a retailer. A lot of retailers find it hard to manage sustainability information. This is a very complex topic. We handle all of that. We can do a scrape of their site, or they can give us a list of their brands and we can compare that against the brands that we have in our Databricks systems. We can tell them that we have information on 90% of your brands, we can get 100% and we have sustainability and impact information on 50% of brands. Zingg is helping us there as well.  

Data plays a big part of our pitch as a startup, especially for our investors. We have established the data system and we can scale it up. So Zingg is important there as well.

Zingg is really being used by a cross functional team of internal and external stakeholders.

What are the things that you have built around Zingg to get into production?

We are running on the Databricks platform, so all the compute is done by Spark. We built a few Python notebooks using the Zingg Python API. The notebooks basically create the Zingg configuration. It is not particularly complicated. We have got one each to run the different phases. There is a very basic UI we have built to label data in the training phase. And then we have the end-to-end flow of importing identity into the app and at the end there is an export flow for data analysts.  

I am currently looking at the best way to handle model evolution over time and try different models. Figuring out how to assess the performance of models. Our focus with Zingg right now is more on the operational side, potentially evaluating things like MLFlow.

What is the matching accuracy you are getting at this stage?

We're running into very few operational issues with the output data, which is generally how I tend to measure project success. The issues we have come across are that the data sources are not comprehensive enough or are out of date as opposed to the Zingg matching itself.

With products, we have products and product variants. Some sources treat product variants as different; some just look at the product without the variance. So we may cluster many products together and then have a process to de-cluster based on variants. That might be a slightly more complicated accuracy question.  

Operationally, we are very happy with how Zingg has performed.

Zingg helps us to connect data across many different formats and locations and unify and deduplicate that. Starting with basic information on brands and products, we have been able to enrich that with where that product is sold, what is the impact information of the product, what variants that product has.

Zingg allows us to have a very big dateset of information. It provides large increases in operational efficiencies down the line because our internal and external stakeholders do not have to manually enter all this stuff themselves.

The community is really great. The slack is very active, the releases are fairly regular.

While Zingg is pre 1.0, it is very viable to use in production right away. It is being continuously improved. It is great to see this momentum.