Read JSON file as Spark DataFrame in Scala / Spark

2019-07-10 scalaspark-2-x

Spark has easy fluent APIs that can be used to read data from JSON file as DataFrame object.


In this code example,  JSON file named 'example.json' has the following content:
[
  {
    "Category": "Category A",
    "Count": 100,
    "Description": "This is category A"
  },
  {
    "Category": "Category B",
    "Count": 120,
    "Description": "This is category B"
  },
  {

    "Category": "Category C",
    "Count": 150,
    "Description": "This is category C"
  }
]

In the code snippet, the following option is important to let Spark to handle multiple line JSON content:

option("multiLine", true)

Code snippet

import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.types._

val appName = "Scala Example - JSON file to Spark Data Frame"
val master = "local"

/*Create Spark session with Hive supported.*/
val spark = SparkSession.builder.appName(appName).master(master).getOrCreate()

val schema = StructType(Seq(
  StructField("Category", StringType, true),
StructField("Count", IntegerType, true),
StructField("Description", StringType, true)
))

val json_file_path = "data/example.json"
val df = spark.read.option("multiLine", true).schema(schema).json(json_file_path)
print(df.schema)
df.show()