Spark has easy fluent APIs that can be used to read data from JSON file as DataFrame object.
In this code example, JSON file named 'example.json' has the following content:
[
{
"Category": "Category A",
"Count": 100,
"Description": "This is category A"
},
{
"Category": "Category B",
"Count": 120,
"Description": "This is category B"
},
{
"Category": "Category C",
"Count": 150,
"Description": "This is category C"
}
]
In the code snippet, the following option is important to let Spark to handle multiple line JSON content:
option("multiLine", true)
Code snippet
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.types._
val appName = "Scala Example - JSON file to Spark Data Frame"
val master = "local"
/*Create Spark session with Hive supported.*/
val spark = SparkSession.builder.appName(appName).master(master).getOrCreate()
val schema = StructType(Seq(
StructField("Category", StringType, true),
StructField("Count", IntegerType, true),
StructField("Description", StringType, true)
))
val json_file_path = "data/example.json"
val df = spark.read.option("multiLine", true).schema(schema).json(json_file_path)
print(df.schema)
df.show()