After we built native Snowflake entity resolution using Snowpark, we had a working system. What we didn't yet have was a fast one. Our baseline on a 500,000-record dataset, running on an X-Small Snowflake warehouse, was roughly 12 hours. That's not a number you can put in front of customers.
This post walks through the specific performance bottlenecks we found and how we addressed them. If you're building on Snowpark, some of these lessons apply well beyond entity resolution.
Before getting into Snowpark specifics, it helps to understand why entity resolution is hard to scale.
The naive approach — comparing every record to every other record — grows quadratically. Double your dataset and you quadruple your comparisons. For a million records, you're looking at roughly 500 billion pairs to evaluate. That's not a compute problem; it's a physics problem.
Zingg solves this with a learned blocking model that indexes near-similar records, reducing the comparison space to typically 0.05–1% of all possible pairs. But even 1% of a large dataset is still a lot of work, and how efficiently you execute that work matters enormously.
As a rough rule: a 5× increase in input records with the same schema leads to roughly a 25× increase in comparison complexity. Performance isn't linear.
We started with 500,000 records on an X-Small warehouse: approximately 12 hours. Not great, but the job completed without errors. We used this as our baseline.
The next test was a 2.5M-record dataset — a 5× increase. Based on the quadratic relationship above, we expected somewhere around 18+ hours. What we got instead was session timeouts after roughly 8 hours.
Snowpark's client process, if run interactively, can time out during long-running jobs. The fix is simple: run the Snowpark client as a background process or with nohup. Once we did that, the job ran to completion — in over 24 hours. Obviously not acceptable.
So we killed it after identifying the first set of bottlenecks and started optimizing.
The most impactful optimization we found was around intermediate DataFrame caching.
During indexing and pair generation, we build a blocking index and then self-join the indexed DataFrame to generate candidate pairs. The expensive part is building the index — computing the blocking keys from the enriched record data.
What we discovered: if you compute a Snowpark DataFrame and then use that same DataFrame instance in a self-join, Snowpark does not automatically deduplicate the computation. The index computations are repeated for both sides of the join. Cloning the DataFrame, as the documentation suggested, didn't help.
The fix was explicit caching. When we cached the intermediate DataFrame before the self-join, Snowpark materialized it once into a temporary table, and both sides of the join read from that materialized result. One expensive computation instead of two.
The lesson is generalizable: in Snowpark, if a costly computation result is used more than once — especially in joins — cache it explicitly. The overhead of creating a temporary table is almost always smaller than recomputing the result twice.
Zingg does custom feature engineering before comparison — normalizing strings, computing phonetic codes, handling domain-specific transformations. These run as Python UDFs inside Snowpark.
Python UDFs in Snowpark have per-row invocation overhead that adds up at scale. We profiled which transformations were being called on every candidate pair versus every source record, and restructured the pipeline to apply row-level transformations once during preprocessing rather than once per pair during comparison.
For a dataset where a single record might appear in thousands of candidate pairs, this made a significant difference.
Snowpark's automatic partition behavior doesn't always match what entity resolution workloads need. We were seeing skew: some partitions with heavily populated blocking buckets getting much more work than others, while smaller partitions sat idle.
We added explicit repartitioning steps at key points in the pipeline — after blocking, before the comparison phase — to redistribute work more evenly across Snowflake's parallel execution engine. The right partition count depends on your data's blocking distribution, but monitoring query profiles in Snowflake's UI made the skew visible and the improvement measurable.
In our initial architecture, we were more conservative than necessary about persisting intermediate results — partly as a debugging aid, partly because we weren't sure which intermediate states we'd need to inspect.
In production, many of those intermediate writes were unnecessary. Reducing the number of table writes cut both execution time and storage overhead. We kept the writes that were genuinely needed (the cached DataFrames, the trained model) and eliminated the rest.
After applying these optimizations, our 2.5M-record test on the same X-Small warehouse went from 24+ hours (before we killed the job) to under 30 minutes.
For the original 500,000-record baseline: from 12 hours to under 5 minutes.
These numbers are on a deliberately constrained warehouse size. Scaling up to a larger warehouse provides additional gains — but our philosophy at Zingg is that the customer shouldn't pay a compute bill for inefficient code. The optimizations above are about doing the work right, not throwing hardware at it.
One thing this work reinforced: Snowpark and Spark have genuinely different performance characteristics, not just different APIs. Optimization intuitions from Spark don't always transfer. Caching behavior, join execution, UDF overhead, partition management — all of these work differently.
If you're porting a Spark workload to Snowpark, plan for a performance tuning phase. The logical structure of your computation may translate cleanly, but the performance profile will not.