0.15.0 (2025-12-08)¶
Removals¶
Drop Teradata connector. It is not used in our company anymore, and
never had proper integration tests.
Breaking Changes¶
Add Iceberg(catalog=..., warehouse=...) mandatory options
(#391,
#393,
#394,
#397,
#399,
#413).
In 0.14.0 we've implemented very basic Iceberg connector configured
via dictionary:
iceberg = Iceberg(
catalog_name="mycatalog",
extra={
"type": "rest",
"uri": "https://catalog.company.com/rest",
"rest.auth.type": "oauth2",
"token": "jwt_token",
"warehouse": "s3a://mybucket/",
"io-impl": "org.apache.iceberg.aws.s3.S3FileIO",
"s3.endpoint": "http://localhost:9010",
"s3.access-key-id": "access_key",
"s3.secret-access-key": "secret_key",
"s3.path-style-access": "true",
"client.region": "us-east-1",
},
spark=spark,
)
Now we've implemented wrapper classes allowing to configure various Iceberg catalogs:
iceberg = Iceberg(
catalog_name="mycatalog",
catalog=Iceberg.RESTCatalog(
url="https://catalog.company.com/rest",
auth=Iceberg.RESTCatalog.BearerAuth(
access_token="jwt_token",
),
),
warehouse=...,
)
iceberg = Iceberg(
catalog_name="mycatalog",
catalog=Iceberg.RESTCatalog(
url="https://catalog.company.com/rest",
auth=Iceberg.RESTCatalog.OAuth2ClientCredentials(
client_id="my_client",
client_secret="my_secret",
oauth2_token_endpoint="http://keycloak.company.com/realms/my-realm/protocol/openid-connect/token",
scopes=["catalog"],
),
),
warehouse=...,
spark=spark,
)
And also set of classes to configure for warehouses:
iceberg = Iceberg(
catalog_name="mycatalog",
catalog=...,
# using Iceberg AWS integration
warehouse=Iceberg.S3Warehouse(
path="/",
bucket="mybucket",
host="localhost",
port=9010,
protocol="http",
path_style_access=True,
access_key="access_key",
secret_key="secret_key",
region="us-east-1",
),
spark=spark,
)
iceberg = Iceberg(
catalog_name="mycatalog",
catalog=...,
# Delegate warehouse config to REST Catalog
warehouse=Iceberg.DelegatedWarehouse(
warehouse="some-warehouse",
access_delegation="vended-credentials",
),
spark=spark,
)
iceberg = Iceberg(
catalog_name="mycatalog",
# store both data and metadata on HadoopFilesystem
catalog=Iceberg.FilesystemCatalog(),
warehouse=Iceberg.FilesystemWarehouse(
path="/some/warehouse",
connection=SparkHDFS(cluster="dwh"),
),
spark=spark,
)
Having classes instead of dicts brings IDE autocompletion, and allows to reuse the same catalog connection options for multiple warehouses.
Features¶
-
Added support for
Iceberg.WriteOptions(table_properties={})(#401).In particular, table's
"location": "/some/warehouse/mytable"can be set now. -
Added support for
Hive.WriteOptions(table_properties={})(#412).In particular, table's
"auto.purge": "true"can be set now.
Improvements¶
-
Allow to set
SparkS3(path_style_access=True)instead ofSparkS3(extra={"path.style.access": True)(#392).This change improves IDE autocompletion and made it more explicit that the parameter is important for the connector's functionality.
-
Add a runtime warning about missing
S3(region=...)andSparkS3(region=...)params (#418).It is recommended to explicitly pass this parameter to avoid potential access errors.
Thanks to @yabel
Dependencies¶
-
Update JDBC connectors:
- MySQL
9.4.0→9.5.0 - MSSQL
13.2.0→13.2.1 - Oracle
23.9.0.25.07→23.26.0.0.0 - Postgres
42.7.7→42.7.8
- MySQL
-
Added support for
Clickhouse.get_packages(package_version="0.9.3")(#407).Versions in range 0.8.0-0.9.2 are not supported due to issue #2625.
Versions 0.9.3+ is still not default one because of various compatibility and performance issues. Use it at your own risk.
Documentation¶
- Document using Greenplum connector with Spark on
master=k8s