Skip to content

JSONLine

Bases: ReadWriteFileFormat

JSONLine file format (each line of file contains a JSON object). support hooks

Based on Spark JSON file format.

Supports reading/writing files with .json extension with content like:

example.json
{"key": "value1"}
{"key": "value2"}

Added in 0.9.0

Examples:

Note

You can pass any option mentioned in official documentation. Option names should be in camelCase!

The set of supported options depends on Spark version.

from onetl.file.format import JSONLine

jsonline = JSONLine(encoding="UTF-8", mode="PERMISSIVE")

Warning

Written files have extension .json, not .jsonl or .jsonline.

from onetl.file.format import JSONLine

jsonline = JSONLine(encoding="UTF-8", compression="gzip")

allowBackslashEscapingAnyCharacter = None class-attribute instance-attribute

If True, prefix \ can escape any character. Default False.

Note

Used only for reading files.

allowComments = None class-attribute instance-attribute

If True, add support for C/C++/Java style comments (//, /* */). Default False, meaning that JSONLine files should not contain comments.

Note

Used only for reading files.

allowNonNumericNumbers = None class-attribute instance-attribute

If True, allow numbers to contain non-numeric characters, like:

  • scientific notation (e.g. 12e10).
  • positive infinity floating point value (Infinity, +Infinity, +INF).
  • negative infinity floating point value (-Infinity, -INF).
  • Not-a-Number floating point value (NaN).

Default True.

Note

Used only for reading files.

allowNumericLeadingZeros = None class-attribute instance-attribute

If True, allow leading zeros in numbers (e.g. 00012). Default False.

Note

Used only for reading files.

allowSingleQuotes = None class-attribute instance-attribute

If True, allow JSON object field names to be wrapped with single quotes ('). Default True.

Note

Used only for reading files.

allowUnquotedControlChars = None class-attribute instance-attribute

If True, allow unquoted control characters (ASCII values 0-31) in strings without escaping them with \. Default False.

Note

Used only for reading files.

allowUnquotedFieldNames = None class-attribute instance-attribute

If True, allow JSON object field names without quotes (JavaScript-style). Default False.

Note

Used only for reading files.

columnNameOfCorruptRecord = Field(default=None, min_length=1) class-attribute instance-attribute

Name of column to put corrupt records in. Default is _corrupt_record.

Warning

If DataFrame schema is provided, this column should be added to schema explicitly:

from onetl.connection import SparkLocalFS
from onetl.file import FileDFReader
from onetl.file.format import JSONLine

from pyspark.sql.types import StructType, StructField, TimestampType, StringType

spark = ...

schema = StructType(
    [
        StructField("my_field", TimestampType()),
        StructField("_corrupt_record", StringType()),  # <-- important
    ]
)

jsonline = JSONLine(mode="PERMISSIVE", columnNameOfCorruptRecord="_corrupt_record")

reader = FileDFReader(
    connection=connection,
    format=jsonline,
    df_schema=schema,  # < ---
)
df = reader.run(["/some/file.jsonl"])

Note

Used only for reading files.

compression = None class-attribute instance-attribute

Compression codec of the JSONLine file. Default none.

Note

Used only for writing files.

dateFormat = Field(default=None, min_length=1) class-attribute instance-attribute

String format for DateType() representation. Default is yyyy-MM-dd.

dropFieldIfAllNull = None class-attribute instance-attribute

If True and inferred column is always null or empty array, exclude if from DataFrame schema. Default False.

Note

Used only for reading files.

encoding = None class-attribute instance-attribute

Encoding of the JSONLine files. Default UTF-8.

ignoreNullFields = None class-attribute instance-attribute

If True and field value is null, don't add field into resulting object Default is value of spark.sql.jsonGenerator.ignoreNullFields (True).

Note

Used only for writing files.

lineSep = None class-attribute instance-attribute

Character used to separate lines in the JSONLine files.

Defaults:

  • Try to detect for reading (\r\n, \r, \n)
  • \n for writing.

locale = Field(default=None, min_length=1) class-attribute instance-attribute

Locale name used to parse dates and timestamps. Default is en-US.

Note

Used only for reading files.

mode = None class-attribute instance-attribute

How to handle parsing errors:

  • PERMISSIVE - set field value as null, move raw data to columnNameOfCorruptRecord column.
  • DROPMALFORMED - skip the malformed row.
  • FAILFAST - throw an error immediately.

Default is PERMISSIVE.

Note

Used only for reading files.

prefersDecimal = None class-attribute instance-attribute

If True, infer all floating-point values as Decimal. Default False.

Note

Used only for reading files.

primitivesAsString = None class-attribute instance-attribute

If True, infer all primitive types (string, integer, float, boolean) as strings. Default False.

Note

Used only for reading files.

samplingRatio = Field(default=None, ge=0, le=1) class-attribute instance-attribute

While inferring schema, read the specified fraction of file rows. Default 1.

Note

Used only for reading files.

timestampFormat = Field(default=None, min_length=1) class-attribute instance-attribute

String format for TimestampType() representation. Default is yyyy-MM-dd'T'HH:mm:ss[.SSS][XXX].

timestampNTZFormat = Field(default=None, min_length=1) class-attribute instance-attribute

String format for TimestampNTZType() representation. Default is yyyy-MM-dd'T'HH:mm:ss[.SSS].

Note

Added in Spark 3.2.0

timezone = Field(default=None, min_length=1, alias='timeZone') class-attribute instance-attribute

Allows to override timezone used for parsing or serializing date and timestamp values. By default, spark.sql.session.timeZone is used.