Spark has easy fluent APIs that can be used to read data from JSON file as DataFrame object.
In this code example, JSON file named 'example.json' has the following content:
[
{
"Category": "Category A",
"Count": 100,
"Description": "This is category A"
},
{
"Category": "Category B",
"Count": 120,
"Description": "This is category B"
},
{
"Category": "Category C",
"Count": 150,
"Description": "This is category C"
}
]
The file is loaded as a Spark DataFrame using SparkSession.read.json function.
multiLine=True argument is important as the JSON file content is across multiple lines.
Code snippet
from pyspark.sql import SparkSession from pyspark.sql.types import ArrayType, StructField, StructType, StringType, IntegerType appName = "PySpark Example - JSON file to Spark Data Frame" master = "local" # Create Spark session spark = SparkSession.builder \ .appName(appName) \ .master(master) \ .getOrCreate() # Create a schema for the dataframe schema = StructType([ StructField('Category', StringType(), True), StructField('Count', IntegerType(), True), StructField('Description', StringType(), True) ]) # Create data frame json_file_path = 'data/example.json' df = spark.read.json(json_file_path, schema, multiLine=True) print(df.schema) df.show()