0.12.0 (2024-09-03)¶

Breaking Changes¶

Change connection URL used for generating HWM names of S3 and Samba sources:
- smb://host:port -> smb://host:port/share
- s3://host:port -> s3://host:port/bucket (#304)
Update DB connectors/drivers to latest versions:
- Clickhouse 0.6.0-patch5 → 0.6.5
- MongoDB 10.3.0 → 10.4.0
- MSSQL 12.6.2 → 12.8.1
- MySQL 8.4.0 → 9.0.0
- Oracle 23.4.0.24.05 → 23.5.0.24.07
- Postgres 42.7.3 → 42.7.4
Update Excel package from 0.20.3 to 0.20.4, to include Spark 3.5.1 support. (#306)

Features¶

Add support for specifying file formats (ORC, Parquet, CSV, etc.) in HiveWriteOptions.format (#292):

Hive.WriteOptions(format=ORC(compression="snappy"))

Collect Spark execution metrics in following methods, and log then in DEBUG mode:
- DBWriter.run()
- FileDFWriter.run()
- Hive.sql()
- Hive.execute()

This is implemented using custom SparkListener which wraps the entire method call, and then report collected metrics. But these metrics sometimes may be missing due to Spark architecture, so they are not reliable source of information. That's why logs are printed only in DEBUG mode, and are not returned as method call result. (#303)

Generate default jobDescription based on currently executed method. Examples:
- DBWriter.run(schema.table) -> Postgres[host:5432/database]
- MongoDB[localhost:27017/admin] -> DBReader.has_data(mycollection)
- Hive[cluster].execute()

If user already set custom jobDescription, it will left intact. (#304)

Add log.info about JDBC dialect usage (#305):

|MySQL| Detected dialect: 'org.apache.spark.sql.jdbc.MySQLDialect'

Log estimated size of in-memory dataframe created by JDBC.fetch and JDBC.execute methods. (#303)

Bug Fixes¶

Fix passing Greenplum(extra={"options": ...}) during read/write operations. (#308)
Do not raise exception if yield-based hook whas something past (and only one) yield.