FileDF Reader¶
Bases: FrozenModel
Allows you to read files from a source path with specified file connection
and parameters, and return a Spark DataFrame.
Warning
This class does not support read strategies.
Added in 0.9.0
Parameters:
-
connection(BaseFileDFConnection) –File DataFrame connection. See File DF Connections section.
-
format(BaseReadableFileFormat) –File format to read.
-
source_path(PathLike | str, default:None) –Directory path to read data from.
Could be
None, but only if you pass file paths directly to run method -
df_schema(StructType, default:None) –Spark DataFrame schema.
-
options(FileDFReaderOptions) –Common reading options.
Examples:
from onetl.connection import SparkLocalFS
from onetl.file import FileDFReader
from onetl.file.format import CSV
csv = CSV(delimiter=",")
local_fs = SparkLocalFS(spark=spark)
reader = FileDFReader(
connection=local_fs,
format=csv,
source_path="/path/to/directory",
)
from onetl.connection import SparkLocalFS
from onetl.file import FileDFReader
from onetl.file.format import CSV
csv = CSV(delimiter=",")
local_fs = SparkLocalFS(spark=spark)
reader = FileDFReader(
connection=local_fs,
format=csv,
source_path="/path/to/directory",
options=FileDFReader.Options(recursive=False),
)
run(files=None)
¶
Method for reading files as DataFrame.
Added in 0.9.0
Parameters:
-
files(Iterator[str | PathLike] | None, default:None) –File list to read.
If empty, read files from
source_path.
Returns:
-
df(DataFrame) –Spark DataFrame
Examples:
Read CSV files from directory /path:
from onetl.connection import SparkLocalFS
from onetl.file import FileDFReader
from onetl.file.format import CSV
csv = CSV(delimiter=",")
local_fs = SparkLocalFS(spark=spark)
reader = FileDFReader(
connection=local_fs,
format=csv,
source_path="/path",
)
df = reader.run()
from onetl.connection import SparkLocalFS
from onetl.file import FileDFReader
from onetl.file.format import CSV
csv = CSV(delimiter=",")
local_fs = SparkLocalFS(spark=spark)
reader = FileDFReader(
connection=local_fs,
format=csv,
)
df = reader.run(
[
"/path/file1.csv",
"/path/nested/file2.csv",
]
)
from onetl.connection import SparkLocalFS
from onetl.file import FileDFReader
from onetl.file.format import CSV
csv = CSV(delimiter=",")
local_fs = SparkLocalFS(spark=spark)
reader = FileDFReader(
connection=local_fs,
format=csv,
source_path="/path",
)
df = reader.run(
[
# file paths could be relative
"/path/file1.csv",
"/path/nested/file2.csv",
]
)