Skip to content

0.15.0 (2025-12-08)

Removals

Drop Teradata connector. It is not used in our company anymore, and never had proper integration tests.

Breaking Changes

Add Iceberg(catalog=..., warehouse=...) mandatory options (#391, #393, #394, #397, #399, #413).

In 0.14.0 we've implemented very basic Iceberg connector configured via dictionary:

iceberg = Iceberg(
    catalog_name="mycatalog",
    extra={
        "type": "rest",
        "uri": "https://catalog.company.com/rest",
        "rest.auth.type": "oauth2",
        "token": "jwt_token",
        "warehouse": "s3a://mybucket/",
        "io-impl": "org.apache.iceberg.aws.s3.S3FileIO",
        "s3.endpoint": "http://localhost:9010",
        "s3.access-key-id": "access_key",
        "s3.secret-access-key": "secret_key",
        "s3.path-style-access": "true",
        "client.region": "us-east-1",
    },
    spark=spark,
)

Now we've implemented wrapper classes allowing to configure various Iceberg catalogs:

REST Catalog with Bearer token auth
iceberg = Iceberg(
    catalog_name="mycatalog",
    catalog=Iceberg.RESTCatalog(
        url="https://catalog.company.com/rest",
        auth=Iceberg.RESTCatalog.BearerAuth(
            access_token="jwt_token",
        ),
    ),
    warehouse=...,
)
REST Catalog with OAuth2 ClientCredentials auth
iceberg = Iceberg(
    catalog_name="mycatalog",
    catalog=Iceberg.RESTCatalog(
        url="https://catalog.company.com/rest",
        auth=Iceberg.RESTCatalog.OAuth2ClientCredentials(
            client_id="my_client",
            client_secret="my_secret",
            oauth2_token_endpoint="http://keycloak.company.com/realms/my-realm/protocol/openid-connect/token",
            scopes=["catalog"],
        ),
    ),
    warehouse=...,
    spark=spark,
)

And also set of classes to configure for warehouses:

S3 warehouse
iceberg = Iceberg(
    catalog_name="mycatalog",
    catalog=...,
    # using Iceberg AWS integration
    warehouse=Iceberg.S3Warehouse(
        path="/",
        bucket="mybucket",
        host="localhost",
        port=9010,
        protocol="http",
        path_style_access=True,
        access_key="access_key",
        secret_key="secret_key",
        region="us-east-1",
    ),
    spark=spark,
)
For Lakekeeper, Polaris, Gravitino
iceberg = Iceberg(
    catalog_name="mycatalog",
    catalog=...,
    # Delegate warehouse config to REST Catalog
    warehouse=Iceberg.DelegatedWarehouse(
        warehouse="some-warehouse",
        access_delegation="vended-credentials",
    ),
    spark=spark,
)
HDFS warehouse
iceberg = Iceberg(
    catalog_name="mycatalog",
    # store both data and metadata on HadoopFilesystem
    catalog=Iceberg.FilesystemCatalog(),
    warehouse=Iceberg.FilesystemWarehouse(
        path="/some/warehouse",
        connection=SparkHDFS(cluster="dwh"),
    ),
    spark=spark,
)

Having classes instead of dicts brings IDE autocompletion, and allows to reuse the same catalog connection options for multiple warehouses.

Features

  • Added support for Iceberg.WriteOptions(table_properties={}) (#401).

    In particular, table's "location": "/some/warehouse/mytable" can be set now.

  • Added support for Hive.WriteOptions(table_properties={}) (#412).

    In particular, table's "auto.purge": "true" can be set now.

Improvements

  • Allow to set SparkS3(path_style_access=True) instead of SparkS3(extra={"path.style.access": True) (#392).

    This change improves IDE autocompletion and made it more explicit that the parameter is important for the connector's functionality.

  • Add a runtime warning about missing S3(region=...) and SparkS3(region=...) params (#418).

    It is recommended to explicitly pass this parameter to avoid potential access errors.

Thanks to @yabel

Dependencies

  • Update JDBC connectors:

    • MySQL 9.4.09.5.0
    • MSSQL 13.2.013.2.1
    • Oracle 23.9.0.25.0723.26.0.0.0
    • Postgres 42.7.742.7.8
  • Added support for Clickhouse.get_packages(package_version="0.9.3") (#407).

    Versions in range 0.8.0-0.9.2 are not supported due to issue #2625.

    Versions 0.9.3+ is still not default one because of various compatibility and performance issues. Use it at your own risk.

Documentation

  • Document using Greenplum connector with Spark on master=k8s