How We Built Native Snowflake Entity Resolution with Snowpark

Engineering
April 8, 2026

For years, Zingg supported Snowflake the way most tools do: through an Apache Spark connector. You'd run Spark on your infrastructure, point it at Snowflake, pull the data, process it, push results back. It worked. But we kept hearing the same thing from users: "We don't run Spark. We're warehouse-only. Can Zingg just run inside Snowflake?"

That question eventually became a product decision. This post explains how we built Zingg's native Snowpark execution — what the engineering challenges were, how we solved them, and what we learned in the process.

Why Snowpark Changed the Calculus

Snowpark lets you run Python, Java, and Scala code directly inside Snowflake, pushing computation to where the data lives rather than pulling data out to process it elsewhere. For a warehouse-native team, this is significant: no Spark cluster to spin up, no data leaving Snowflake, no additional infrastructure to manage.

For Zingg, it meant entity resolution could become a first-class warehouse operation — part of your dbt pipeline, your Snowflake task graph, your existing data stack.

But building on Snowpark is not a straightforward port from Spark. The APIs are different, the execution model is different, and some things you take for granted in Spark simply do not exist in Snowpark.

The ML Pipeline Challenge

The first major obstacle was our ML pipeline. Zingg uses machine learning for matching — a classification model that learns from labeled examples of matching and non-matching records. On Spark, we use MLlib. Snowpark has no built-in ML library; instead, it relies on third-party packages like scikit-learn via the Snowpark Python Sandbox.

The catch: you cannot pass a Snowpark DataFrame directly into scikit-learn for training. A conversion to Pandas is necessary.

This creates two downstream problems. First, Pandas DataFrames are memory-bound — which matters if training data grows large. Fortunately, Zingg uses active learning to build training sets efficiently, so our training data stays small by design. This kept the Pandas conversion manageable.

The second problem was more architectural. Our core codebase is Java and Scala. Zingg does substantial preprocessing and feature engineering before training — enriching input records with comparison signals before the classifier ever sees them. Introducing Python for the training step meant we could not pass our in-memory enriched DataFrame directly into the model training loop.

Our solution: persist the engineered features into a temporary Snowflake table, then invoke a stored procedure written in Python to read that table and train the classifier. The trained model is persisted back into Snowflake as a named Python UDF, which our Java/Scala code calls during inference.

It's an unusual architecture, but it keeps the execution fully inside Snowflake and avoids any data movement outside the platform.

Rethinking the Codebase Architecture

The second major challenge was code architecture. Zingg's codebase was written with Apache Spark as the assumed execution engine. Spark dependencies were woven throughout — not just in the pipeline logic, but in how DataFrames were constructed, how jobs were orchestrated, how results were written.

Our initial instinct was to maintain two independent codebases: one for Spark, one for Snowpark. Common utilities would be shared; everything else would be independent.

We spent a day or two doing global search-and-replace operations to test how this would feel in practice. The answer: not good. The duplication would be enormous, and keeping two codebases in sync over time would compound every maintenance cost.

Instead, we invested in proper abstraction. We identified every Spark-specific interface, created platform-agnostic abstractions on top of them, and implemented those abstractions separately for Spark and Snowpark. The business logic — blocking, feature engineering, model training, inference, clustering — lives in platform-agnostic code. Spark and Snowpark are just execution backends.

This was more upfront work, but it meant Zingg Enterprise on Snowflake and Zingg on Spark share the same matching logic, the same model, the same accuracy guarantees. A bug fixed in one benefits the other. A feature added once works everywhere.

What This Means for Snowflake Users

Zingg Enterprise on Snowflake runs entity resolution natively within your Snowflake environment:

  • No external compute required. No Spark cluster, no EC2 instances, no separate infrastructure to manage or pay for.
  • Data never leaves Snowflake. All processing happens inside your Snowflake account, respecting your existing security perimeter, VPC configurations, and compliance requirements.
  • Warehouse-native scheduling. Zingg runs as part of your Snowflake task graph, integrated with your existing orchestration.
  • The same ML-based matching. The probabilistic matching, active learning, and blocking model are identical to the Spark version — just executing on a different engine.

Performance

Running entity resolution inside Snowflake naturally raised performance questions. Snowpark's execution model and query planning are fundamentally different from Spark's. We did significant performance tuning work after the initial build — and have a separate post on the specific optimizations that took a 12-hour job on an X-Small warehouse down to under 30 minutes.

Try It

If your team is warehouse-native on Snowflake and entity resolution is on your roadmap, we'd like to show you what Zingg Enterprise looks like in your environment. You can explore the open source version to understand how Zingg works, and contact us to discuss the Enterprise Snowflake deployment.

Recent posts