Spark has easy fluent APIs that can be used to read data from JSON file as DataFrame object.
In this code example, JSON file named 'example.json' has the following content:
[ { "Category": "Category A", "Count": 100, "Description": "This is category A" }, { "Category": "Category B", "Count": 120, "Description": "This is category B" }, { "Category": "Category C", "Count": 150, "Description": "This is category C" } ]
The file is loaded as a Spark DataFrame using SparkSession.read.json function.
multiLine=True argument is important as the JSON file content is across multiple lines.
Code snippet
from pyspark.sql import SparkSession
from pyspark.sql.types import ArrayType, StructField, StructType, StringType, IntegerType
appName = "PySpark Example - JSON file to Spark Data Frame"
master = "local"
# Create Spark session
spark = SparkSession.builder \
.appName(appName) \
.master(master) \
.getOrCreate()
# Create a schema for the dataframe
schema = StructType([
StructField('Category', StringType(), True),
StructField('Count', IntegerType(), True),
StructField('Description', StringType(), True)
])
# Create data frame
json_file_path = 'data/example.json'
df = spark.read.json(json_file_path, schema, multiLine=True)
print(df.schema)
df.show()