Prerequisites¶

Note

onETL's Hive connection is actually SparkSession with access to Hive Thrift Metastore and HDFS/S3. All data motion is made using Spark. Hive Metastore is used only to store tables and partitions metadata.

This connector does NOT require Hive server. It also does NOT use Hive JDBC connector.

Version Compatibility¶

Hive Metastore version:
- Officially declared: 0.12 - 3.1.3 (may require to add proper .jar file explicitly)
- Actually tested: 1.2.100, 2.3.10, 3.1.3
Spark versions: 3.2.x - 4.1.x
Java versions: 8 - 22

See official documentation.

Installing PySpark¶

To use Hive connector you should have PySpark installed (or injected to sys.path) BEFORE creating the connector instance.

See installation instruction for more details.

Connecting to Hive Metastore¶

Note

If you're using managed Hadoop cluster, skip this step, as all Spark configs are should already present on the host.

Create $SPARK_CONF_DIR/hive-site.xml with Hive Metastore URL:

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
    <property>
        <name>hive.metastore.uris</name>
        <value>thrift://metastore.host.name:9083</value>
    </property>
</configuration>

Create $SPARK_CONF_DIR/core-site.xml with warehouse location ,e.g. HDFS IPC port of Hadoop namenode, or S3 bucket address & credentials:

HDFSS3

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://myhadoopcluster:9820</value>
    </property>
</configuration>

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
    <!-- See https://hadoop.apache.org/docs/current/hadoop-aws/tools/hadoop-aws/index.html#General_S3A_Client_configuration
    <property>
        <name>fs.defaultFS</name>
        <value>s3a://mys3bucket/</value>
    </property>
    <property>
        <name>fs.s3a.bucket.mybucket.endpoint</name>
        <value>http://s3.somain</value>
    </property>
    <property>
        <name>fs.s3a.bucket.mybucket.connection.ssl.enabled</name>
        <value>false</value>
    </property>
    <property>
        <name>fs.s3a.bucket.mybucket.path.style.access</name>
        <value>true</value>
    </property>
    <property>
        <name>fs.s3a.bucket.mybucket.aws.credentials.provider</name>
        <value>org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider</value>
    </property>
    <property>
        <name>fs.s3a.bucket.mybucket.access.key</name>
        <value>some-user</value>
    </property>
    <property>
        <name>fs.s3a.bucket.mybucket.secret.key</name>
        <value>mysecrettoken</value>
    </property>
</configuration>

Using Kerberos¶

Some of Hadoop managed clusters use Kerberos authentication. In this case, you should call kinit command BEFORE starting Spark session to generate Kerberos ticket. See Kerberos installation.

Sometimes it is also required to pass keytab file to Spark config, allowing Spark executors to generate own Kerberos tickets:

SparkSession.builder
    .option("spark.kerberos.access.hadoopFileSystems", "hdfs://namenode1.domain.com:9820,hdfs://namenode2.domain.com:9820")
    .option("spark.kerberos.principal", "user")
    .option("spark.kerberos.keytab", "/path/to/keytab")
    .gerOrCreate()

See Spark security documentation for more details.