Zingg 0.6.0: Databricks Unity Catalog Support, Spark 3.5.5, and a Lighter Python Wheel

Engineering
May 8, 2026

Zingg 0.6.0 is out. This is a focused release that reflects where the Spark ecosystem is heading and where Zingg users actually get stuck. Let's go through what changed, what it means for your pipelines, and how to take advantage of it.

Databricks Unity Catalog Support via UCPipe

This is the headline feature for anyone running Zingg on Databricks.

Databricks has been deprecating DBFS (Databricks File System) in favour of Unity Catalog (UC). If your Zingg pipelines read and write data via DBFS paths, that path is closing. Unity Catalog is the new governance layer — it handles access control, lineage, and data discovery across your Databricks workspace, and it reads/writes via Delta tables registered in a three-level namespace: catalog.schema.table.

In Zingg 0.6.0, we've introduced UCPipe — a new Python API wrapper that adds first-class Unity Catalog and Delta table support. Under the hood, this adds a SparkReadStrategyFactory and a corresponding write strategy, so Zingg can resolve and operate on Delta tables registered in Unity Catalog the same way it has always worked with Parquet, CSV, or JDBC sources.

How to use it

from zingg.client import ZinggWithSpark
from zingg.pipes import UCPipe

inputPipe = UCPipe("my_catalog.my_schema.customers")
outputPipe = UCPipe("my_catalog.my_schema.customers_resolved")

args.setData(inputPipe)
args.setOutput(outputPipe)

UCPipe handles the Delta read/write path internally. No need to set spark.databricks.delta.* configs manually — that's abstracted away.

Why this matters

Unity Catalog is not just a storage change. It's a governance change. By reading and writing through UC-registered tables, your Zingg outputs automatically inherit UC's access controls, lineage tracking, and audit logging. Your entity-resolved golden records become first-class citizens in your data catalog — not opaque Parquet dumps in a mount point.

Spark 3.5.5 Upgrade

Zingg 0.6.0 upgrades its Spark dependency to Spark 3.5.5 — a maintenance release bringing stability fixes and performance improvements around shuffle, vectorized query execution, and Parquet I/O.

A few things to check when upgrading:

  • Cluster version: On Databricks, DBR 15.x maps to Spark 3.5.
  • Python version: Spark 3.5.x requires Python 3.8+.
  • Existing models: Models trained on earlier Spark versions should load fine, but retraining on 3.5.5 is the safe path if you hit issues.

Graphframes 0.10.0 Upgrade

Zingg uses GraphFrames for connected component analysis — the step where pairwise match decisions get resolved into entity clusters. The upgrade to Graphframes 0.10.0 brings Spark 3.5.x compatibility and performance improvements for graph operations. No code changes needed — this is transparent.

Improved Error Handling

Anyone who has debugged a Zingg job failure will appreciate this one.

Previously, exceptions thrown deep in the pipeline could get wrapped in generic ZinggException wrappers, losing the original message. You'd see ZinggException: An error occurred with nothing useful underneath.

In 0.6.0, the exception handling retains the original error message through the wrapping chain. The root cause surfaces as-is. Zingg also now exits with code 1 on job failure — so Airflow, Databricks Workflows, and CI/CD pipelines can detect failures correctly. Previously a failed run could exit with code 0, silently breaking your pipeline.

There's also a new argument validator that catches misconfigured Arguments objects early — before Spark session startup — so you're not waiting minutes to discover a field name typo.

Lighter Python Wheel

The Zingg Python wheel is now significantly smaller, with unnecessary files removed from both the wheel and the release tarball. Faster cluster cold starts on Databricks, faster CI/CD downloads, less friction in bandwidth-constrained environments.

pip install zingg==0.6.0

No API changes — just less to download.

ClickHouse as a Data Source

0.6.0 adds official documentation for using ClickHouse as a Zingg data source. The integration uses Zingg's existing JDBC pipe mechanism. The new docs cover connection string format, required JAR dependencies, and settings for reading large ClickHouse result sets into Spark efficiently.

macOS Setup Improvements

0.6.0 adds proper macOS directory structure to the release packaging and a dedicated macOS setup guide. Environment variable paths for ZINGG_HOME that were being set incorrectly on macOS — causing failures when running zingg.sh — are now fixed.

Under the Hood

Several internal refactors shipped in 0.6.0 that don't change user-facing APIs:

  • StandardisePostprocessor replaced by Transform: The post-processing abstraction has been renamed. If you're using StandardisePostprocessor directly in custom code, check the migration notes.
  • Pipe code refactor: PipeUtil and the pipe creation path are cleaner, making custom data source extensions easier.
  • Linker refactored: Improved caching in the Data layer reduces redundant Spark actions during matching.
  • Default argument loader: Easier programmatic construction of Arguments objects without specifying every field from scratch.

Upgrading

pip install zingg==0.6.0

For Databricks users: update your cluster library to 0.6.0 and switch DBFS-based pipes to UCPipe if you're on a UC-enabled workspace. The Databricks UC setup docs cover the full configuration.

Questions or feedback? Open a discussion on GitHub or find us in the community.


Zingg is open-source ML-based entity resolution, built on Apache Spark. Star us on GitHub if this is useful to you.

Recent posts