Prerequisites¶
Version Compatibility¶
- Spark versions: 3.2.x - 3.5.x
- Java versions: 8 - 20
Installing PySpark¶
To use SparkS3 connector you should have PySpark installed (or injected to sys.path)
BEFORE creating the connector instance.
See installation instruction for more details.
Connecting to S3¶
Bucket access style¶
AWS and some other S3 cloud providers allows bucket access using domain style only, e.g. https://mybucket.s3provider.com.
Other implementations, like Minio, by default allows path style access only, e.g. https://s3provider.com/mybucket
(see MINIO_DOMAIN).
You should set path_style_access to True or False, to choose the preferred style.
Authentication¶
Different S3 instances can use different authentication methods, like:
access_key + secret_key(or username + password)access_key + secret_key + session_token
Usually these are just passed to SparkS3 constructor:
SparkS3(
access_key=...,
secret_key=...,
session_token=...,
)
But some S3 cloud providers, like AWS, may require custom credential providers. You can pass them like:
SparkS3(
extra={
# provider class
"aws.credentials.provider": "org.apache.hadoop.fs.s3a.auth.AssumedRoleCredentialProvider",
# other options, if needed
"assumed.role.arn": "arn:aws:iam::90066806600238:role/s3-restricted",
},
)
See Hadoop-AWS documentation.