Skip to content

Parquet

Bases: ReadWriteFileFormat

Parquet file format (columnar). support hooks

Based on Spark Parquet Files file format.

Supports reading/writing files with .parquet extension.

Added in 0.9.0

Examples:

Note

You can pass any option mentioned in official documentation. Option names should be in camelCase!

The set of supported options depends on Spark version.

You may also set options mentioned parquet-hadoop documentation. They are prefixed with parquet. with dots in names, so instead of calling constructor Parquet(parquet.option=True) (invalid in Python) you should call method Parquet.parse({"parquet.option": True}).

from onetl.file.format import Parquet

parquet = Parquet(mergeSchema=True)
from onetl.file.format import Parquet

parquet = Parquet.parse(
    {
        "compression": "snappy",
        # Enable Bloom filter for columns 'id' and 'name'
        "parquet.bloom.filter.enabled#id": True,
        "parquet.bloom.filter.enabled#name": True,
        # Set expected number of distinct values for column 'id'
        "parquet.bloom.filter.expected.ndv#id": 10_000_000,
        # other options
    }
)

mergeSchema = None class-attribute instance-attribute

Merge schemas of all Parquet files being read into a single schema. By default, Spark config option spark.sql.parquet.mergeSchema value is used (false).

Note

Used only for reading files.

compression = None class-attribute instance-attribute

Compression codec of the Parquet files. By default, Spark config option spark.sql.parquet.compression.codec value is used (snappy).

Note

Used only for writing files.