JSON¶

Bases: ReadOnlyFileFormat

JSON file format. |support_hooks|

Based on Spark JSON <https://spark.apache.org/docs/latest/sql-data-sources-json.html>_ file format.

Supports reading (but NOT writing) files with .json extension with content like:

.. code-block:: json :caption: example.json

[
    {"key": "value1"},
    {"key": "value2"}
]

.. versionadded:: 0.9.0

Examples:

.. note ::

You can pass any option mentioned in
`official documentation <https://spark.apache.org/docs/latest/sql-data-sources-json.html>`_.
**Option names should be in** ``camelCase``!

The set of supported options depends on Spark version.

Reading files:

.. code:: python

from onetl.file.format import JSON

json = JSON(encoding="UTF-8")

Writing files:

.. warning::

Not supported. Use :obj:`JSONLine <onetl.file.format.jsonline.JSONLine>`.

`allowBackslashEscapingAnyCharacter = None` `class-attribute` `instance-attribute` ¶

If True, prefix \ can escape any character. Default False.

.. note::

Used only for reading files and :obj:`~parse_column` method.

`allowComments = None` `class-attribute` `instance-attribute` ¶

If True, add support for C/C++/Java style comments (//, /* */). Default False, meaning that JSON files should not contain comments.

.. note::

Used only for reading files and :obj:`~parse_column` method.

`allowNonNumericNumbers = None` `class-attribute` `instance-attribute` ¶

If True, allow numbers to contain non-numeric characters, like:

scientific notation (e.g. 12e10).
positive infinity floating point value (Infinity, +Infinity, +INF).
negative infinity floating point value (-Infinity, -INF).
Not-a-Number floating point value (NaN).

Default True.

.. note::

Used only for reading files and :obj:`~parse_column` method.

`allowNumericLeadingZeros = None` `class-attribute` `instance-attribute` ¶

If True, allow leading zeros in numbers (e.g. 00012). Default False.

.. note::

Used only for reading files and :obj:`~parse_column` method.

`allowSingleQuotes = None` `class-attribute` `instance-attribute` ¶

If True, allow JSON object field names to be wrapped with single quotes ('). Default True.

.. note::

Used only for reading files and :obj:`~parse_column` method.

`allowUnquotedControlChars = None` `class-attribute` `instance-attribute` ¶

If True, allow unquoted control characters (ASCII values 0-31) in strings without escaping them with \. Default False.

.. note::

Used only for reading files and :obj:`~parse_column` method.

`allowUnquotedFieldNames = None` `class-attribute` `instance-attribute` ¶

If True, allow JSON object field names without quotes (JavaScript-style). Default False.

.. note::

Used only for reading files and :obj:`~parse_column` method.

`columnNameOfCorruptRecord = Field(default=None, min_length=1)` `class-attribute` `instance-attribute` ¶

Name of column to put corrupt records in. Default is _corrupt_record.

.. warning::

If DataFrame schema is provided, this column should be added to schema explicitly:

.. code:: python

    from onetl.connection import SparkLocalFS
    from onetl.file import FileDFReader
    from onetl.file.format import JSON

    from pyspark.sql.types import StructType, StructField, TimestampType, StringType

    spark = ...

    schema = StructType(
        [
            StructField("my_field", TimestampType()),
            StructField("_corrupt_record", StringType()),  # <-- important
        ]
    )

    json = JSON(mode="PERMISSIVE", columnNameOfCorruptRecord="_corrupt_record")

    reader = FileDFReader(
        connection=connection,
        format=json,
        df_schema=schema,  # < ---
    )
    df = reader.run(["/some/file.json"])

.. note::

Used only for reading files and :obj:`~parse_column` method.

`dateFormat = Field(default=None, min_length=1)` `class-attribute` `instance-attribute` ¶

String format for DateType() representation. Default is yyyy-MM-dd.

`dropFieldIfAllNull = None` `class-attribute` `instance-attribute` ¶

If True and inferred column is always null or empty array, exclude if from DataFrame schema. Default False.

.. note::

Used only for reading files. Ignored by :obj:`~parse_column` method.

`encoding = None` `class-attribute` `instance-attribute` ¶

Encoding of the JSON file. Default UTF-8.

.. note::

Used only for reading and writing files.

Ignored by :obj:`~parse_column` and :obj:`~serialize_column` methods.

`lineSep = None` `class-attribute` `instance-attribute` ¶

Character used to separate lines in the JSON file.

Defaults:

Try to detect for reading (\r\n, \r, \n)
\n for writing

.. note::

Used only for reading and writing files.

Ignored by :obj:`~parse_column` and :obj:`~serialize_column` methods,
as they handle each DataFrame row separately.

`locale = Field(default=None, min_length=1)` `class-attribute` `instance-attribute` ¶

Locale name used to parse dates and timestamps. Default is en-US.

.. note::

Used only for reading files and :obj:`~parse_column` method.

`mode = None` `class-attribute` `instance-attribute` ¶

How to handle parsing errors:

PERMISSIVE - set field value as null, move raw data to :obj:~columnNameOfCorruptRecord column.
DROPMALFORMED - skip the malformed row.
FAILFAST - throw an error immediately.

Default is PERMISSIVE.

.. note::

Used only for reading files and :obj:`~parse_column` method.

`prefersDecimal = None` `class-attribute` `instance-attribute` ¶

If True, infer all floating-point values as Decimal. Default False.

.. note::

Used only for reading files. Ignored by :obj:`~parse_column` method.

`primitivesAsString = None` `class-attribute` `instance-attribute` ¶

If True, infer all primitive types (string, integer, float, boolean) as strings. Default False.

.. note::

Used only for reading files. Ignored by :obj:`~parse_column` method.

`samplingRatio = Field(default=None, ge=0, le=1)` `class-attribute` `instance-attribute` ¶

While inferring schema, read the specified fraction of file rows. Default 1.

.. note::

Used only for reading files. Ignored by :obj:`~parse_column` function.

`timestampFormat = Field(default=None, min_length=1)` `class-attribute` `instance-attribute` ¶

String format for `TimestampType()representation. Default isyyyy-MM-dd'T'HHss[.SSS][XXX]``.

`timestampNTZFormat = Field(default=None, min_length=1)` `class-attribute` `instance-attribute` ¶

String format for `TimestampNTZType()representation. Default isyyyy-MM-dd'T'HHss[.SSS]``.

.. note::

Added in Spark 3.2.0

`timezone = Field(default=None, min_length=1, alias='timeZone')` `class-attribute` `instance-attribute` ¶

Allows to override timezone used for parsing or serializing date and timestamp values. By default, spark.sql.session.timeZone is used.

`parse_column(column, schema)` ¶

Parses a JSON string column to a structured Spark SQL column using Spark's from_json <https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.functions.from_json.html>_ function, based on the provided schema.

.. versionadded:: 0.11.0

Parameters:

column (str | Column) –

The name of the column or the column object containing JSON strings/bytes to parse.
schema (StructType | ArrayType | MapType) –

The schema to apply when parsing the JSON data. This defines the structure of the output DataFrame column.

Returns:

Column with deserialized data, with the same structure as the provided schema. –
Column name is the same as input column. –

Examples:

>>> from pyspark.sql.types import StructType, StructField, IntegerType, StringType
>>> from pyspark.sql.functions import decode
>>> from onetl.file.format import JSON
>>> df.show()
+----+--------------------+----------+---------+------+-----------------------+-------------+
|key |value               |topic     |partition|offset|timestamp              |timestampType|
+----+--------------------+----------+---------+------+-----------------------+-------------+
|[31]|[7B 22 6E 61 6D 6...|topicJSON |0        |0     |2024-04-24 16:51:11.739|0            |
|[32]|[7B 22 6E 61 6D 6...|topicJSON |0        |1     |2024-04-24 16:51:11.749|0            |
+----+--------------------+----------+---------+------+-----------------------+-------------+
>>> df.printSchema()
root
|-- key: binary (nullable = true)
|-- value: binary (nullable = true)
|-- topic: string (nullable = true)
|-- partition: integer (nullable = true)
|-- offset: integer (nullable = true)
|-- timestamp: timestamp (nullable = true)
|-- timestampType: integer (nullable = true)
>>> json = JSON()
>>> json_schema = StructType(
...     [
...         StructField("name", StringType(), nullable=True),
...         StructField("age", IntegerType(), nullable=True),
...     ],
... )
>>> parsed_df = df.select(decode("key", "UTF-8").alias("key"), json.parse_column("value", json_schema))
>>> parsed_df.show()
+---+-----------+
|key|value      |
+---+-----------+
|1  |{Alice, 20}|
|2  |  {Bob, 25}|
+---+-----------+
>>> parsed_df.printSchema()
root
|-- key: string (nullable = true)
|-- value: struct (nullable = true)
|    |-- name: string (nullable = true)
|    |-- age: integer (nullable = true)

`serialize_column(column)` ¶

Serializes a structured Spark SQL column into a JSON string column using Spark's to_json <https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.functions.to_json.html>_ function.

.. versionadded:: 0.11.0

Parameters:

column (str | Column) –

The name of the column or the column object containing the data to serialize to JSON format.

Returns:

Column with string JSON data. Column name is the same as input column. –

Examples:

>>> from pyspark.sql.functions import decode
>>> from onetl.file.format import JSON
>>> df.show()
+---+-----------+
|key|value      |
+---+-----------+
|1  |{Alice, 20}|
|2  |  {Bob, 25}|
+---+-----------+
>>> df.printSchema()
root
|-- key: string (nullable = true)
|-- value: struct (nullable = true)
|    |-- name: string (nullable = true)
|    |-- age: integer (nullable = true)
>>> # serializing data into JSON format
>>> json = JSON()
>>> serialized_df = df.select("key", json.serialize_column("value"))
>>> serialized_df.show(truncate=False)
+---+-------------------------+
|key|value                    |
+---+-------------------------+
|  1|{"name":"Alice","age":20}|
|  2|{"name":"Bob","age":25}  |
+---+-------------------------+
>>> serialized_df.printSchema()
root
|-- key: string (nullable = true)
|-- value: string (nullable = true)

JSON¶

allowBackslashEscapingAnyCharacter = None class-attribute instance-attribute ¶

allowComments = None class-attribute instance-attribute ¶

allowNonNumericNumbers = None class-attribute instance-attribute ¶

allowNumericLeadingZeros = None class-attribute instance-attribute ¶

allowSingleQuotes = None class-attribute instance-attribute ¶

allowUnquotedControlChars = None class-attribute instance-attribute ¶

allowUnquotedFieldNames = None class-attribute instance-attribute ¶

columnNameOfCorruptRecord = Field(default=None, min_length=1) class-attribute instance-attribute ¶

dateFormat = Field(default=None, min_length=1) class-attribute instance-attribute ¶

dropFieldIfAllNull = None class-attribute instance-attribute ¶

encoding = None class-attribute instance-attribute ¶

lineSep = None class-attribute instance-attribute ¶

locale = Field(default=None, min_length=1) class-attribute instance-attribute ¶

mode = None class-attribute instance-attribute ¶

prefersDecimal = None class-attribute instance-attribute ¶

primitivesAsString = None class-attribute instance-attribute ¶

samplingRatio = Field(default=None, ge=0, le=1) class-attribute instance-attribute ¶

timestampFormat = Field(default=None, min_length=1) class-attribute instance-attribute ¶

timestampNTZFormat = Field(default=None, min_length=1) class-attribute instance-attribute ¶

timezone = Field(default=None, min_length=1, alias='timeZone') class-attribute instance-attribute ¶

parse_column(column, schema) ¶

serialize_column(column) ¶

`allowBackslashEscapingAnyCharacter = None` `class-attribute` `instance-attribute` ¶

`allowComments = None` `class-attribute` `instance-attribute` ¶

`allowNonNumericNumbers = None` `class-attribute` `instance-attribute` ¶

`allowNumericLeadingZeros = None` `class-attribute` `instance-attribute` ¶

`allowSingleQuotes = None` `class-attribute` `instance-attribute` ¶

`allowUnquotedControlChars = None` `class-attribute` `instance-attribute` ¶

`allowUnquotedFieldNames = None` `class-attribute` `instance-attribute` ¶

`columnNameOfCorruptRecord = Field(default=None, min_length=1)` `class-attribute` `instance-attribute` ¶

`dateFormat = Field(default=None, min_length=1)` `class-attribute` `instance-attribute` ¶

`dropFieldIfAllNull = None` `class-attribute` `instance-attribute` ¶

`encoding = None` `class-attribute` `instance-attribute` ¶

`lineSep = None` `class-attribute` `instance-attribute` ¶

`locale = Field(default=None, min_length=1)` `class-attribute` `instance-attribute` ¶

`mode = None` `class-attribute` `instance-attribute` ¶

`prefersDecimal = None` `class-attribute` `instance-attribute` ¶

`primitivesAsString = None` `class-attribute` `instance-attribute` ¶

`samplingRatio = Field(default=None, ge=0, le=1)` `class-attribute` `instance-attribute` ¶

`timestampFormat = Field(default=None, min_length=1)` `class-attribute` `instance-attribute` ¶

`timestampNTZFormat = Field(default=None, min_length=1)` `class-attribute` `instance-attribute` ¶

`timezone = Field(default=None, min_length=1, alias='timeZone')` `class-attribute` `instance-attribute` ¶

`parse_column(column, schema)` ¶

`serialize_column(column)` ¶