Raymond Raymond

Scala: Parse JSON String as Spark DataFrame

event 2020-12-16 visibility 11,519 comment 0 insights toc
more_vert
insights Stats

This article shows how to convert a JSON string to a Spark DataFrame using Scala. It can be used for processing small in memory JSON string. 

Sample JSON string

The following sample JSON string will be used. It is a simple JSON array with three items in the array. For each item, there are two attributes named ID and ATTR1 with data type as integer and string respectively. 

[
{"ID":1,"ATTR1":"ABC"},
{"ID":2,"ATTR1":"DEF"},
{"ID":3,"ATTR1":"GHI"}
]

Read JSON string

In Spark, DataFrameReader object can be used to read JSON. 

def json(jsonDataset: Dataset[String]): DataFrame

Refer to the following official documentation for more details about this function. 

Spark 3.0.1 ScalaDoc - org.apache.spark.sql.DataFrameReader

*Note - this function is available from Spark 2.0 only. 

To create DataFrame object, we need to convert JSON string to Dataset[String] first.

import org.apache.spark.sql._
import org.apache.spark.sql.types._

val json = """[
{"ID":1,"ATTR1":"ABC"},
{"ID":2,"ATTR1":"DEF"},
{"ID":3,"ATTR1":"GHI"}]"""

val jsonDataset = Seq(json).toDS()
The output of jsonDataset is like the following:

jsonDataset: org.apache.spark.sql.Dataset[String] = [value: string]

Now, we can use read method of SparkSession object to directly read from the above dataset:

val df = spark.read.json(jsonDataset)
df: org.apache.spark.sql.DataFrame = [ATTR1: string, ID: bigint]

Spark automatically detected the schema of the JSON and converted it accordingly to Spark data types.

The content of the data frame looks like the following:

scala> df.show()
+-----+---+
|ATTR1| ID|
+-----+---+
|  ABC|  1|
|  DEF|  2|
|  GHI|  3|
+-----+---+

Read from multiple JSON string variables

In the above example, we only read data from one JSON string object. We can use Seq to construct multiple ones.

The following code snippet shows how to do that:

val json1 = """[
{"ID":4,"ATTR1":"123"},
{"ID":5,"ATTR1":"456"},
{"ID":6,"ATTR1":"789"}]"""

spark.read.json(Seq(json,json1).toDS()).show()

Output:

scala> spark.read.json(Seq(json,json1).toDS()).show()
+-----+---+
|ATTR1| ID|
+-----+---+
|  ABC|  1|
|  DEF|  2|
|  GHI|  3|
|  123|  4|
|  456|  5|
|  789|  6|
+-----+---+

The schema of the DataFrame contains two fields with data type as StringType and LongType respectively:

scala> spark.read.json(Seq(json,json1).toDS()).schema
res5: org.apache.spark.sql.types.StructType = StructType(StructField(ATTR1,StringType,true), StructField(ID,LongType,true))

Summary

When reading data directly from database or structured files, similar read functions can be used to easily convert the input dataset to a Spark DataFrame

More from Kontext
comment Comments
No comments yet.

Please log in or register to comment.

account_circle Log in person_add Register

Log in with external accounts