Prerequisites¶

Version Compatibility¶

Hadoop versions: 2.x, 3.x
Spark versions: 3.2.x - 3.5.x
Java versions: 8 - 20

Installing PySpark¶

To use SparkHDFS connector you should have PySpark installed (or injected to sys.path) BEFORE creating the connector instance.

See installation instruction for more details.

Using Kerberos¶

Some of Hadoop managed clusters use Kerberos authentication. In this case, you should call kinit command BEFORE starting Spark session to generate Kerberos ticket. See Kerberos installation instructions.

Sometimes it is also required to pass keytab file to Spark config, allowing Spark executors to generate own Kerberos tickets:

SparkSession.builder
    .option("spark.kerberos.access.hadoopFileSystems", "hdfs://namenode1.domain.com:9820,hdfs://namenode2.domain.com:9820")
    .option("spark.kerberos.principal", "user")
    .option("spark.kerberos.keytab", "/path/to/keytab")
    .gerOrCreate()

See Spark security documentation for more details.