Skip to content

ORC

Bases: ReadWriteFileFormat

ORC file format (columnar). support hooks

Based on Spark ORC Files file format.

Supports reading/writing files with .orc extension.

Added in 0.9.0

Examples:

Note

You can pass any option mentioned in official documentation. Option names should be in camelCase!

The set of supported options depends on Spark version.

You may also set options mentioned orc-java documentation. They are prefixed with orc. with dots in names, so instead of calling constructor ORC(orc.option=True) (invalid in Python) you should call method ORC.parse({"orc.option": True}).

from onetl.file.format import ORC

orc = ORC(mergeSchema=True)
from onetl.file.format import ORC

orc = ORC.parse(
    {
        "compression": "snappy",
        # Enable Bloom filter for columns 'id' and 'name'
        "orc.bloom.filter.columns": "id,name",
        # Set Bloom filter false positive probability
        "orc.bloom.filter.fpp": 0.01,
        # Do not use dictionary for 'highly_selective_column'
        "orc.column.encoding.direct": "highly_selective_column",
        # other options
    }
)

mergeSchema = None class-attribute instance-attribute

Merge schemas of all ORC files being read into a single schema. By default, Spark config option spark.sql.orc.mergeSchema value is used (False).

Note

Used only for reading files.

compression = None class-attribute instance-attribute

Compression codec of the ORC files. By default, Spark config option spark.sql.orc.compression.codec value is used (snappy).

Note

Used only for writing files.