ORC¶
Bases: ReadWriteFileFormat
Based on Spark ORC Files file format.
Supports reading/writing files with .orc extension.
Added in 0.9.0
Examples:
Note
You can pass any option mentioned in
official documentation.
Option names should be in camelCase!
The set of supported options depends on Spark version.
You may also set options mentioned
orc-java documentation.
They are prefixed with orc. with dots in names,
so instead of calling constructor ORC(orc.option=True) (invalid in Python)
you should call method ORC.parse({"orc.option": True}).
from onetl.file.format import ORC
orc = ORC(mergeSchema=True)
from onetl.file.format import ORC
orc = ORC.parse(
{
"compression": "snappy",
# Enable Bloom filter for columns 'id' and 'name'
"orc.bloom.filter.columns": "id,name",
# Set Bloom filter false positive probability
"orc.bloom.filter.fpp": 0.01,
# Do not use dictionary for 'highly_selective_column'
"orc.column.encoding.direct": "highly_selective_column",
# other options
}
)
mergeSchema = None
class-attribute
instance-attribute
¶
Merge schemas of all ORC files being read into a single schema.
By default, Spark config option spark.sql.orc.mergeSchema value is used (False).
Note
Used only for reading files.
compression = None
class-attribute
instance-attribute
¶
Compression codec of the ORC files.
By default, Spark config option spark.sql.orc.compression.codec value is used (snappy).
Note
Used only for writing files.