Skip to content

Reading from Hive using Hive.sql

Hive.sql allows passing custom SQL query, but does not support incremental strategies.

Syntax support

Only queries with the following syntax are supported:

  • ✅︎ SELECT ... FROM ...
  • ✅︎ WITH alias AS (...) SELECT ...
  • SET ...; SELECT ...; - multiple statements not supported

Warning

Actually, query should be written using SparkSQL syntax, not HiveQL.

Examples

from onetl.connection import Hive

hive = Hive(...)
df = hive.sql(
    """
    SELECT
        id,
        key,
        CAST(value AS text) value,
        updated_at
    FROM
        some.mytable
    WHERE
        key = 'something'
    """
)

Recommendations

Use column-based write formats

Prefer these write formats:

For colum-based write formats, each file contains separated sections there column data is stored. The file footer contains location of each column section/group. Spark can use this information to load only sections required by specific query, e.g. only selected columns, to drastically speed up the query.

Another advantage is high compression ratio, e.g. 10x-100x in comparison to JSON or CSV.

Select only required columns

Instead of passing SELECT * FROM ... prefer passing exact column names SELECT col1, col2, .... This drastically reduces the amount of data read by Spark, if column-based file formats are used.

Use partition columns in where clause

Queries should include WHERE clause with filters on Hive partitioning columns. This allows Spark to read only small set of files (partition pruning) instead of scanning the entire table, so this drastically increases performance.

Supported operators are: =, >, < and BETWEEN, and only against some static value.

Details

Lazily execute SELECT statement and return DataFrame. support hooks

Same as spark.sql(query).

Added in 0.2.0

Parameters:

  • query (str) –

    SQL query to be executed.

Returns:

  • df ( DataFrame ) –

    Spark dataframe