Skip to content

FileDF Reader

Bases: FrozenModel

Allows you to read files from a source path with specified file connection and parameters, and return a Spark DataFrame. support hooks

Warning

This class does not support read strategies.

Added in 0.9.0

Parameters:

  • connection (BaseFileDFConnection) –

    File DataFrame connection. See File DF Connections section.

  • format (BaseReadableFileFormat) –

    File format to read.

  • source_path (PathLike | str, default: None ) –

    Directory path to read data from.

    Could be None, but only if you pass file paths directly to run method

  • df_schema (StructType, default: None ) –

    Spark DataFrame schema.

  • options (FileDFReaderOptions) –

    Common reading options.

Examples:

from onetl.connection import SparkLocalFS
from onetl.file import FileDFReader
from onetl.file.format import CSV

csv = CSV(delimiter=",")
local_fs = SparkLocalFS(spark=spark)

reader = FileDFReader(
    connection=local_fs,
    format=csv,
    source_path="/path/to/directory",
)
from onetl.connection import SparkLocalFS
from onetl.file import FileDFReader
from onetl.file.format import CSV

csv = CSV(delimiter=",")
local_fs = SparkLocalFS(spark=spark)

reader = FileDFReader(
    connection=local_fs,
    format=csv,
    source_path="/path/to/directory",
    options=FileDFReader.Options(recursive=False),
)

run(files=None)

Method for reading files as DataFrame. support hooks

Added in 0.9.0

Parameters:

  • files (Iterator[str | PathLike] | None, default: None ) –

    File list to read.

    If empty, read files from source_path.

Returns:

  • df ( DataFrame ) –

    Spark DataFrame

Examples:

Read CSV files from directory /path:

from onetl.connection import SparkLocalFS
from onetl.file import FileDFReader
from onetl.file.format import CSV

csv = CSV(delimiter=",")
local_fs = SparkLocalFS(spark=spark)

reader = FileDFReader(
    connection=local_fs,
    format=csv,
    source_path="/path",
)
df = reader.run()
Read some CSV files using file paths:

from onetl.connection import SparkLocalFS
from onetl.file import FileDFReader
from onetl.file.format import CSV

csv = CSV(delimiter=",")
local_fs = SparkLocalFS(spark=spark)

reader = FileDFReader(
    connection=local_fs,
    format=csv,
)

df = reader.run(
    [
        "/path/file1.csv",
        "/path/nested/file2.csv",
    ]
)
Read only specific CSV files in directory:

from onetl.connection import SparkLocalFS
from onetl.file import FileDFReader
from onetl.file.format import CSV

csv = CSV(delimiter=",")
local_fs = SparkLocalFS(spark=spark)

reader = FileDFReader(
    connection=local_fs,
    format=csv,
    source_path="/path",
)

df = reader.run(
    [
        # file paths could be relative
        "/path/file1.csv",
        "/path/nested/file2.csv",
    ]
)