0.12.0 (2024-09-03)¶
Breaking Changes¶
- Change connection URL used for generating HWM names of S3 and Samba sources:
smb://host:port->smb://host:port/shares3://host:port->s3://host:port/bucket(#304)
- Update DB connectors/drivers to latest versions:
- Clickhouse
0.6.0-patch5→0.6.5 - MongoDB
10.3.0→10.4.0 - MSSQL
12.6.2→12.8.1 - MySQL
8.4.0→9.0.0 - Oracle
23.4.0.24.05→23.5.0.24.07 - Postgres
42.7.3→42.7.4
- Clickhouse
- Update
Excelpackage from0.20.3to0.20.4, to include Spark 3.5.1 support. (#306)
Features¶
- Add support for specifying file formats (
ORC,Parquet,CSV, etc.) inHiveWriteOptions.format(#292):
Hive.WriteOptions(format=ORC(compression="snappy"))
- Collect Spark execution metrics in following methods, and log then in DEBUG mode:
DBWriter.run()FileDFWriter.run()Hive.sql()Hive.execute()
This is implemented using custom SparkListener which wraps the entire method call, and
then report collected metrics. But these metrics sometimes may be missing due to Spark architecture,
so they are not reliable source of information. That's why logs are printed only in DEBUG mode, and
are not returned as method call result. (#303)
- Generate default
jobDescriptionbased on currently executed method. Examples:DBWriter.run(schema.table) -> Postgres[host:5432/database]MongoDB[localhost:27017/admin] -> DBReader.has_data(mycollection)Hive[cluster].execute()
If user already set custom jobDescription, it will left intact. (#304)
- Add log.info about JDBC dialect usage (#305):
|MySQL| Detected dialect: 'org.apache.spark.sql.jdbc.MySQLDialect'
- Log estimated size of in-memory dataframe created by
JDBC.fetchandJDBC.executemethods. (#303)
Bug Fixes¶
- Fix passing
Greenplum(extra={"options": ...})during read/write operations. (#308) - Do not raise exception if yield-based hook whas something past (and only one)
yield.