PySpark - Read and Write JSON

event 2022-07-04 thumb_up 0 visibility 2,475 comment 0 insights toc

more_vert

warning Please login first to view stats information.

Write as JSON format
Read JSON
About read and write options
Read nested JSON data

Spark provides flexible DataFrameReader and DataFrameWriter APIs to support read and write JSON data.

Write as JSON format

Let's first look into an example of saving a DataFrame as JSON format.

from pyspark.sql import SparkSession

appName = "PySpark Example - Save as JSON"
master = "local"

# Create Spark session
spark = SparkSession.builder \
    .appName(appName) \
    .master(master) \
    .getOrCreate()

# List
data = [{
    'col1': 'Category A',
    'col2': 100
}, {
    'col1': 'Category B',
    'col2': 200
}, {
    'col1': 'Category C',
    'col2': 300
}]

# Create DF and save as JSON
df = spark.createDataFrame(data)
df.show()
df.write.format('json').mode('overwrite').save(
    'file:///home/kontext/pyspark-examples/data/json-example')

Run the above script file 'write-json.py' file using spark-submit command:

spark-submit write-json.py

This script creates a DataFrame with the following content:

+----------+----+
|      col1|col2|
+----------+----+
|Category A| 100|
|Category B| 200|
|Category C| 300|
+----------+----+

The data is saved as JSON line file:

$ ls /home/kontext/pyspark-examples/data/json-example
_SUCCESS  part-00000-e489b900-bfe0-4204-b014-6600b93f99cb-c000.json

Read JSON

Now let's read JSON file back as DataFrame using the following code:

from pyspark.sql import SparkSession

appName = "PySpark Example - Read JSON"
master = "local"

# Create Spark session
spark = SparkSession.builder \
    .appName(appName) \
    .master(master) \
    .getOrCreate()


# Create DF and save as JSON
df = spark.read.format('json').load(
    'file:///home/kontext/pyspark-examples/data/json-example')
df.show()

Run the script 'read-json.py' using the following command:

spark-submit read-json.py

The following texts are printed out:

+----------+----+
|      col1|col2|
+----------+----+
|Category A| 100|
|Category B| 200|
|Category C| 300|
+----------+----+

About read and write options

There are a number of read and write options that can be applied when reading and writing JSON files.

Refer to JSON Files - Spark 3.3.0 Documentation for more details.

Read nested JSON data

The above examples deal with very simple JSON schema. What if your input JSON has nested data. For example, by changing the input data to the following:

data = [{
    'col1': 'Category A',
    'col2': 100,
    'col3': {'a': 1, 'b': 2}
}, {
    'col1': 'Category B',
    'col2': 200,
    'col3': {'a': 10, 'b': 20}
}]

The script now generates a JSON file with the following content:

{"col1":"Category A","col2":100,"col3":{"a":1,"b":2}}
{"col1":"Category B","col2":200,"col3":{"a":10,"b":20}}

The DataFrame object is created with the following schema:

+----------+----+------------------+
|      col1|col2|              col3|
+----------+----+------------------+
|Category A| 100|  {a -> 1, b -> 2}|
|Category B| 200|{a -> 10, b -> 20}|
+----------+----+------------------+

StructType([StructField('col1', StringType(), True), StructField('col2', LongType(), True), StructField('col3', MapType(StringType(), LongType(), True), True)])

We can now read the data back using the previous read-json.py script. It creates a DataFrame like the following:

+----------+----+--------+
|      col1|col2|    col3|
+----------+----+--------+
|Category A| 100|  {1, 2}|
|Category B| 200|{10, 20}|
+----------+----+--------+

The schema is inferred automatically:

StructType([StructField('col1', StringType(), True), StructField('col2', LongType(), True), StructField('col3', StructType([StructField('a', LongType(), True), StructField('b', LongType(), True)]), True)])

PySpark - Read and Write JSON

insights Stats

toc Table of contents

Write as JSON format

Read JSON

About read and write options

Read nested JSON data

Log in with external accounts