Scala: Parse JSON String as Spark DataFrame
This article shows how to convert a JSON string to a Spark DataFrame using Scala. It can be used for processing small in memory JSON string.
Sample JSON string
The following sample JSON string will be used. It is a simple JSON array with three items in the array. For each item, there are two attributes named ID and ATTR1 with data type as integer and string respectively.
[ {"ID":1,"ATTR1":"ABC"}, {"ID":2,"ATTR1":"DEF"}, {"ID":3,"ATTR1":"GHI"} ]
Read JSON string
In Spark, DataFrameReader object can be used to read JSON.
def json(jsonDataset: Dataset[String]): DataFrame
Refer to the following official documentation for more details about this function.
*Note - this function is available from Spark 2.0 only.
To create DataFrame object, we need to convert JSON string to Dataset[String] first.
import org.apache.spark.sql._ import org.apache.spark.sql.types._ val json = """[ {"ID":1,"ATTR1":"ABC"}, {"ID":2,"ATTR1":"DEF"}, {"ID":3,"ATTR1":"GHI"}]""" val jsonDataset = Seq(json).toDS()The output of jsonDataset is like the following:
jsonDataset: org.apache.spark.sql.Dataset[String] = [value: string]
Now, we can use read method of SparkSession object to directly read from the above dataset:
val df = spark.read.json(jsonDataset) df: org.apache.spark.sql.DataFrame = [ATTR1: string, ID: bigint]
Spark automatically detected the schema of the JSON and converted it accordingly to Spark data types.
The content of the data frame looks like the following:
scala> df.show() +-----+---+ |ATTR1| ID| +-----+---+ | ABC| 1| | DEF| 2| | GHI| 3| +-----+---+
Read from multiple JSON string variables
In the above example, we only read data from one JSON string object. We can use Seq to construct multiple ones.
The following code snippet shows how to do that:
val json1 = """[ {"ID":4,"ATTR1":"123"}, {"ID":5,"ATTR1":"456"}, {"ID":6,"ATTR1":"789"}]""" spark.read.json(Seq(json,json1).toDS()).show()
Output:
scala> spark.read.json(Seq(json,json1).toDS()).show() +-----+---+ |ATTR1| ID| +-----+---+ | ABC| 1| | DEF| 2| | GHI| 3| | 123| 4| | 456| 5| | 789| 6| +-----+---+
The schema of the DataFrame contains two fields with data type as StringType and LongType respectively:
scala> spark.read.json(Seq(json,json1).toDS()).schema res5: org.apache.spark.sql.types.StructType = StructType(StructField(ATTR1,StringType,true), StructField(ID,LongType,true))
Summary
When reading data directly from database or structured files, similar read functions can be used to easily convert the input dataset to a Spark DataFrame.