Zingg 0.6.0 is out. This is a focused release that reflects where the Spark ecosystem is heading and where Zingg users actually get stuck. Let's go through what changed, what it means for your pipelines, and how to take advantage of it.
This is the headline feature for anyone running Zingg on Databricks.
Databricks has been deprecating DBFS (Databricks File System) in favour of Unity Catalog (UC). If your Zingg pipelines read and write data via DBFS paths, that path is closing. Unity Catalog is the new governance layer — it handles access control, lineage, and data discovery across your Databricks workspace, and it reads/writes via Delta tables registered in a three-level namespace: catalog.schema.table.
In Zingg 0.6.0, we've introduced UCPipe — a new Python API wrapper that adds first-class Unity Catalog and Delta table support. Under the hood, this adds a SparkReadStrategyFactory and a corresponding write strategy, so Zingg can resolve and operate on Delta tables registered in Unity Catalog the same way it has always worked with Parquet, CSV, or JDBC sources.
from zingg.client import ZinggWithSpark
from zingg.pipes import UCPipe
inputPipe = UCPipe("my_catalog.my_schema.customers")
outputPipe = UCPipe("my_catalog.my_schema.customers_resolved")
args.setData(inputPipe)
args.setOutput(outputPipe)UCPipe handles the Delta read/write path internally. No need to set spark.databricks.delta.* configs manually — that's abstracted away.
Unity Catalog is not just a storage change. It's a governance change. By reading and writing through UC-registered tables, your Zingg outputs automatically inherit UC's access controls, lineage tracking, and audit logging. Your entity-resolved golden records become first-class citizens in your data catalog — not opaque Parquet dumps in a mount point.
Zingg 0.6.0 upgrades its Spark dependency to Spark 3.5.5 — a maintenance release bringing stability fixes and performance improvements around shuffle, vectorized query execution, and Parquet I/O.
A few things to check when upgrading:
Zingg uses GraphFrames for connected component analysis — the step where pairwise match decisions get resolved into entity clusters. The upgrade to Graphframes 0.10.0 brings Spark 3.5.x compatibility and performance improvements for graph operations. No code changes needed — this is transparent.
Anyone who has debugged a Zingg job failure will appreciate this one.
Previously, exceptions thrown deep in the pipeline could get wrapped in generic ZinggException wrappers, losing the original message. You'd see ZinggException: An error occurred with nothing useful underneath.
In 0.6.0, the exception handling retains the original error message through the wrapping chain. The root cause surfaces as-is. Zingg also now exits with code 1 on job failure — so Airflow, Databricks Workflows, and CI/CD pipelines can detect failures correctly. Previously a failed run could exit with code 0, silently breaking your pipeline.
There's also a new argument validator that catches misconfigured Arguments objects early — before Spark session startup — so you're not waiting minutes to discover a field name typo.
The Zingg Python wheel is now significantly smaller, with unnecessary files removed from both the wheel and the release tarball. Faster cluster cold starts on Databricks, faster CI/CD downloads, less friction in bandwidth-constrained environments.
pip install zingg==0.6.0No API changes — just less to download.
0.6.0 adds official documentation for using ClickHouse as a Zingg data source. The integration uses Zingg's existing JDBC pipe mechanism. The new docs cover connection string format, required JAR dependencies, and settings for reading large ClickHouse result sets into Spark efficiently.
0.6.0 adds proper macOS directory structure to the release packaging and a dedicated macOS setup guide. Environment variable paths for ZINGG_HOME that were being set incorrectly on macOS — causing failures when running zingg.sh — are now fixed.
Several internal refactors shipped in 0.6.0 that don't change user-facing APIs:
StandardisePostprocessor directly in custom code, check the migration notes.PipeUtil and the pipe creation path are cleaner, making custom data source extensions easier.Data layer reduces redundant Spark actions during matching.Arguments objects without specifying every field from scratch.pip install zingg==0.6.0For Databricks users: update your cluster library to 0.6.0 and switch DBFS-based pipes to UCPipe if you're on a UC-enabled workspace. The Databricks UC setup docs cover the full configuration.
Questions or feedback? Open a discussion on GitHub or find us in the community.
Zingg is open-source ML-based entity resolution, built on Apache Spark. Star us on GitHub if this is useful to you.