PySpark - Read and Write JSON

event 2022-07-04 visibility 2,783 comment 0 insights
more_vert
insights Stats
Kontext Kontext Spark & PySpark

Apache Spark installation guides, performance tuning tips, general tutorials, etc.

*Spark logo is a registered trademark of Apache Spark.


Spark provides flexible DataFrameReader and DataFrameWriter APIs to support read and write JSON data.

Write as JSON format

Let's first look into an example of saving a DataFrame as JSON format.

from pyspark.sql import SparkSession

appName = "PySpark Example - Save as JSON"
master = "local"

# Create Spark session
spark = SparkSession.builder \
    .appName(appName) \
    .master(master) \
    .getOrCreate()

# List
data = [{
    'col1': 'Category A',
    'col2': 100
}, {
    'col1': 'Category B',
    'col2': 200
}, {
    'col1': 'Category C',
    'col2': 300
}]

# Create DF and save as JSON
df = spark.createDataFrame(data)
df.show()
df.write.format('json').mode('overwrite').save(
    'file:///home/kontext/pyspark-examples/data/json-example')

Run the above script file 'write-json.py' file using spark-submit command:

spark-submit write-json.py

This script creates a DataFrame with the following content:

+----------+----+
|      col1|col2|
+----------+----+
|Category A| 100|
|Category B| 200|
|Category C| 300|
+----------+----+

The data is saved as JSON line file:

$ ls /home/kontext/pyspark-examples/data/json-example
_SUCCESS  part-00000-e489b900-bfe0-4204-b014-6600b93f99cb-c000.json 

Read JSON

Now let's read JSON file back as DataFrame using the following code:

from pyspark.sql import SparkSession

appName = "PySpark Example - Read JSON"
master = "local"

# Create Spark session
spark = SparkSession.builder \
    .appName(appName) \
    .master(master) \
    .getOrCreate()


# Create DF and save as JSON
df = spark.read.format('json').load(
    'file:///home/kontext/pyspark-examples/data/json-example')
df.show()
Run the script 'read-json.py' using the following command:
spark-submit read-json.py

The following texts are printed out:

+----------+----+
|      col1|col2|
+----------+----+
|Category A| 100|
|Category B| 200|
|Category C| 300|
+----------+----+

About read and write options

There are a number of read and write options that can be applied when reading and writing JSON files. 

Refer to JSON Files - Spark 3.3.0 Documentation for more details. 

Read nested JSON data

The above examples deal with very simple JSON schema. What if your input JSON has nested data. For example, by changing the input data to the following:

data = [{
    'col1': 'Category A',
    'col2': 100,
    'col3': {'a': 1, 'b': 2}
}, {
    'col1': 'Category B',
    'col2': 200,
    'col3': {'a': 10, 'b': 20}
}]

The script now generates a JSON file with the following content:

{"col1":"Category A","col2":100,"col3":{"a":1,"b":2}}
{"col1":"Category B","col2":200,"col3":{"a":10,"b":20}}

The DataFrame object is created with the following schema:

+----------+----+------------------+
|      col1|col2|              col3|
+----------+----+------------------+
|Category A| 100|  {a -> 1, b -> 2}|
|Category B| 200|{a -> 10, b -> 20}|
+----------+----+------------------+

StructType([StructField('col1', StringType(), True), StructField('col2', LongType(), True), StructField('col3', MapType(StringType(), LongType(), True), True)])

We can now read the data back using the previous read-json.py script. It creates a DataFrame like the following:

+----------+----+--------+
|      col1|col2|    col3|
+----------+----+--------+
|Category A| 100|  {1, 2}|
|Category B| 200|{10, 20}|
+----------+----+--------+

The schema is inferred automatically:

StructType([StructField('col1', StringType(), True), StructField('col2', LongType(), True), StructField('col3', StructType([StructField('a', LongType(), True), StructField('b', LongType(), True)]), True)])
More from Kontext
comment Comments
No comments yet.

Please log in or register to comment.

account_circle Log in person_add Register

Log in with external accounts