Skip to content

Prerequisites

Version Compatibility

  • Spark versions: 3.2.x - 3.5.x
  • Java versions: 8 - 20

Installing PySpark

To use SparkS3 connector you should have PySpark installed (or injected to sys.path) BEFORE creating the connector instance.

See installation instruction for more details.

Connecting to S3

Bucket access style

AWS and some other S3 cloud providers allows bucket access using domain style only, e.g. https://mybucket.s3provider.com.

Other implementations, like Minio, by default allows path style access only, e.g. https://s3provider.com/mybucket (see MINIO_DOMAIN).

You should set path_style_access to True or False, to choose the preferred style.

Authentication

Different S3 instances can use different authentication methods, like:

  • access_key + secret_key (or username + password)
  • access_key + secret_key + session_token

Usually these are just passed to SparkS3 constructor:

SparkS3(
    access_key=...,
    secret_key=...,
    session_token=...,
)

But some S3 cloud providers, like AWS, may require custom credential providers. You can pass them like:

SparkS3(
    extra={
        # provider class
        "aws.credentials.provider": "org.apache.hadoop.fs.s3a.auth.AssumedRoleCredentialProvider",
        # other options, if needed
        "assumed.role.arn": "arn:aws:iam::90066806600238:role/s3-restricted",
    },
)

See Hadoop-AWS documentation.

Troubleshooting

See troubleshooting guide.