PySpark - Read and Write JSON
Spark provides flexible DataFrameReader
and DataFrameWriter
APIs to support read and write JSON data.
Write as JSON format
Let's first look into an example of saving a DataFrame as JSON format.
from pyspark.sql import SparkSession appName = "PySpark Example - Save as JSON" master = "local" # Create Spark session spark = SparkSession.builder \ .appName(appName) \ .master(master) \ .getOrCreate() # List data = [{ 'col1': 'Category A', 'col2': 100 }, { 'col1': 'Category B', 'col2': 200 }, { 'col1': 'Category C', 'col2': 300 }] # Create DF and save as JSON df = spark.createDataFrame(data) df.show() df.write.format('json').mode('overwrite').save( 'file:///home/kontext/pyspark-examples/data/json-example')
Run the above script file 'write-json.py
' file using spark-submit
command:
spark-submit write-json.py
This script creates a DataFrame with the following content:
+----------+----+ | col1|col2| +----------+----+ |Category A| 100| |Category B| 200| |Category C| 300| +----------+----+
The data is saved as JSON line file:
$ ls /home/kontext/pyspark-examples/data/json-example _SUCCESS part-00000-e489b900-bfe0-4204-b014-6600b93f99cb-c000.json
Read JSON
Now let's read JSON file back as DataFrame using the following code:
from pyspark.sql import SparkSession appName = "PySpark Example - Read JSON" master = "local" # Create Spark session spark = SparkSession.builder \ .appName(appName) \ .master(master) \ .getOrCreate() # Create DF and save as JSON df = spark.read.format('json').load( 'file:///home/kontext/pyspark-examples/data/json-example') df.show()
read-json.py
' using the following command:spark-submit read-json.py
The following texts are printed out:
+----------+----+ | col1|col2| +----------+----+ |Category A| 100| |Category B| 200| |Category C| 300| +----------+----+
About read and write options
There are a number of read and write options that can be applied when reading and writing JSON files.
Refer to JSON Files - Spark 3.3.0 Documentation for more details.
Read nested JSON data
The above examples deal with very simple JSON schema. What if your input JSON has nested data. For example, by changing the input data to the following:
data = [{ 'col1': 'Category A', 'col2': 100, 'col3': {'a': 1, 'b': 2} }, { 'col1': 'Category B', 'col2': 200, 'col3': {'a': 10, 'b': 20} }]
The script now generates a JSON file with the following content:
{"col1":"Category A","col2":100,"col3":{"a":1,"b":2}} {"col1":"Category B","col2":200,"col3":{"a":10,"b":20}}
The DataFrame object is created with the following schema:
+----------+----+------------------+ | col1|col2| col3| +----------+----+------------------+ |Category A| 100| {a -> 1, b -> 2}| |Category B| 200|{a -> 10, b -> 20}| +----------+----+------------------+ StructType([StructField('col1', StringType(), True), StructField('col2', LongType(), True), StructField('col3', MapType(StringType(), LongType(), True), True)])
We can now read the data back using the previous read-json
.py script. It creates a DataFrame like the following:
+----------+----+--------+ | col1|col2| col3| +----------+----+--------+ |Category A| 100| {1, 2}| |Category B| 200|{10, 20}| +----------+----+--------+
The schema is inferred automatically:
StructType([StructField('col1', StringType(), True), StructField('col2', LongType(), True), StructField('col3', StructType([StructField('a', LongType(), True), StructField('b', LongType(), True)]), True)])