Read JSON file as Spark DataFrame in Python / Spark

visibility 18,407 event 2019-11-18 access_time 2 years ago language English
more_vert

Spark has easy fluent APIs that can be used to read data from JSON file as DataFrame object. 

In this code example,  JSON file named 'example.json' has the following content:

[ { "Category": "Category A", "Count": 100, "Description": "This is category A" }, { "Category": "Category B", "Count": 120, "Description": "This is category B" }, { "Category": "Category C", "Count": 150, "Description": "This is category C" } ]



The file is loaded as a Spark DataFrame using SparkSession.read.json function.

multiLine=True argument is important as the JSON file content is across multiple lines. 

Code snippet

from pyspark.sql import SparkSession
from pyspark.sql.types import ArrayType, StructField, StructType, StringType, IntegerType

appName = "PySpark Example - JSON file to Spark Data Frame"
master = "local"

# Create Spark session
spark = SparkSession.builder \
    .appName(appName) \
    .master(master) \
    .getOrCreate()

# Create a schema for the dataframe
schema = StructType([
    StructField('Category', StringType(), True),
    StructField('Count', IntegerType(), True),
    StructField('Description', StringType(), True)
])

# Create data frame
json_file_path = 'data/example.json'
df = spark.read.json(json_file_path, schema, multiLine=True)
print(df.schema)
df.show()
info Last modified by Raymond 2 years ago copyright This page is subject to Site terms.
Like this article?
Share on

Please log in or register to comment.

account_circle Log in person_add Register

Log in with external accounts