JSONLine¶
Bases: ReadWriteFileFormat
JSONLine file format (each line of file contains a JSON object).
Based on Spark JSON file format.
Supports reading/writing files with .json extension with content like:
{"key": "value1"}
{"key": "value2"}
Added in 0.9.0
Examples:
Note
You can pass any option mentioned in
official documentation.
Option names should be in camelCase!
The set of supported options depends on Spark version.
from onetl.file.format import JSONLine
jsonline = JSONLine(encoding="UTF-8", mode="PERMISSIVE")
Warning
Written files have extension .json, not .jsonl or .jsonline.
from onetl.file.format import JSONLine
jsonline = JSONLine(encoding="UTF-8", compression="gzip")
allowBackslashEscapingAnyCharacter = None
class-attribute
instance-attribute
¶
If True, prefix \ can escape any character.
Default False.
Note
Used only for reading files.
allowComments = None
class-attribute
instance-attribute
¶
If True, add support for C/C++/Java style comments (//, /* */).
Default False, meaning that JSONLine files should not contain comments.
Note
Used only for reading files.
allowNonNumericNumbers = None
class-attribute
instance-attribute
¶
If True, allow numbers to contain non-numeric characters, like:
- scientific notation (e.g.
12e10). - positive infinity floating point value (
Infinity,+Infinity,+INF). - negative infinity floating point value (
-Infinity,-INF). - Not-a-Number floating point value (
NaN).
Default True.
Note
Used only for reading files.
allowNumericLeadingZeros = None
class-attribute
instance-attribute
¶
If True, allow leading zeros in numbers (e.g. 00012).
Default False.
Note
Used only for reading files.
allowSingleQuotes = None
class-attribute
instance-attribute
¶
If True, allow JSON object field names to be wrapped with single quotes (').
Default True.
Note
Used only for reading files.
allowUnquotedControlChars = None
class-attribute
instance-attribute
¶
If True, allow unquoted control characters (ASCII values 0-31) in strings without escaping them with \.
Default False.
Note
Used only for reading files.
allowUnquotedFieldNames = None
class-attribute
instance-attribute
¶
If True, allow JSON object field names without quotes (JavaScript-style).
Default False.
Note
Used only for reading files.
columnNameOfCorruptRecord = Field(default=None, min_length=1)
class-attribute
instance-attribute
¶
Name of column to put corrupt records in.
Default is _corrupt_record.
Warning
If DataFrame schema is provided, this column should be added to schema explicitly:
from onetl.connection import SparkLocalFS
from onetl.file import FileDFReader
from onetl.file.format import JSONLine
from pyspark.sql.types import StructType, StructField, TimestampType, StringType
spark = ...
schema = StructType(
[
StructField("my_field", TimestampType()),
StructField("_corrupt_record", StringType()), # <-- important
]
)
jsonline = JSONLine(mode="PERMISSIVE", columnNameOfCorruptRecord="_corrupt_record")
reader = FileDFReader(
connection=connection,
format=jsonline,
df_schema=schema, # < ---
)
df = reader.run(["/some/file.jsonl"])
Note
Used only for reading files.
compression = None
class-attribute
instance-attribute
¶
Compression codec of the JSONLine file.
Default none.
Note
Used only for writing files.
dateFormat = Field(default=None, min_length=1)
class-attribute
instance-attribute
¶
String format for DateType() representation.
Default is yyyy-MM-dd.
dropFieldIfAllNull = None
class-attribute
instance-attribute
¶
If True and inferred column is always null or empty array, exclude if from DataFrame schema.
Default False.
Note
Used only for reading files.
encoding = None
class-attribute
instance-attribute
¶
Encoding of the JSONLine files.
Default UTF-8.
ignoreNullFields = None
class-attribute
instance-attribute
¶
If True and field value is null, don't add field into resulting object
Default is value of spark.sql.jsonGenerator.ignoreNullFields (True).
Note
Used only for writing files.
lineSep = None
class-attribute
instance-attribute
¶
Character used to separate lines in the JSONLine files.
Defaults:
- Try to detect for reading (
\r\n,\r,\n) \nfor writing.
locale = Field(default=None, min_length=1)
class-attribute
instance-attribute
¶
Locale name used to parse dates and timestamps.
Default is en-US.
Note
Used only for reading files.
mode = None
class-attribute
instance-attribute
¶
How to handle parsing errors:
PERMISSIVE- set field value asnull, move raw data to columnNameOfCorruptRecord column.DROPMALFORMED- skip the malformed row.FAILFAST- throw an error immediately.
Default is PERMISSIVE.
Note
Used only for reading files.
prefersDecimal = None
class-attribute
instance-attribute
¶
If True, infer all floating-point values as Decimal.
Default False.
Note
Used only for reading files.
primitivesAsString = None
class-attribute
instance-attribute
¶
If True, infer all primitive types (string, integer, float, boolean) as strings.
Default False.
Note
Used only for reading files.
samplingRatio = Field(default=None, ge=0, le=1)
class-attribute
instance-attribute
¶
While inferring schema, read the specified fraction of file rows.
Default 1.
Note
Used only for reading files.
timestampFormat = Field(default=None, min_length=1)
class-attribute
instance-attribute
¶
String format for TimestampType() representation.
Default is yyyy-MM-dd'T'HH:mm:ss[.SSS][XXX].
timestampNTZFormat = Field(default=None, min_length=1)
class-attribute
instance-attribute
¶
String format for TimestampNTZType() representation.
Default is yyyy-MM-dd'T'HH:mm:ss[.SSS].
Note
Added in Spark 3.2.0
timezone = Field(default=None, min_length=1, alias='timeZone')
class-attribute
instance-attribute
¶
Allows to override timezone used for parsing or serializing date and timestamp values.
By default, spark.sql.session.timeZone is used.