Parquet¶
Bases: ReadWriteFileFormat
Parquet file format (columnar).
Based on Spark Parquet Files file format.
Supports reading/writing files with .parquet extension.
Added in 0.9.0
Examples:
Note
You can pass any option mentioned in
official documentation.
Option names should be in camelCase!
The set of supported options depends on Spark version.
You may also set options mentioned
parquet-hadoop documentation.
They are prefixed with parquet. with dots in names,
so instead of calling constructor Parquet(parquet.option=True) (invalid in Python)
you should call method Parquet.parse({"parquet.option": True}).
from onetl.file.format import Parquet
parquet = Parquet(mergeSchema=True)
from onetl.file.format import Parquet
parquet = Parquet.parse(
{
"compression": "snappy",
# Enable Bloom filter for columns 'id' and 'name'
"parquet.bloom.filter.enabled#id": True,
"parquet.bloom.filter.enabled#name": True,
# Set expected number of distinct values for column 'id'
"parquet.bloom.filter.expected.ndv#id": 10_000_000,
# other options
}
)
mergeSchema = None
class-attribute
instance-attribute
¶
Merge schemas of all Parquet files being read into a single schema.
By default, Spark config option spark.sql.parquet.mergeSchema value is used (false).
Note
Used only for reading files.
compression = None
class-attribute
instance-attribute
¶
Compression codec of the Parquet files.
By default, Spark config option spark.sql.parquet.compression.codec value is used (snappy).
Note
Used only for writing files.