access_time 8 months ago languageEnglish
more_vert

Scala: Parse JSON String as Spark DataFrame

visibility 3,843 comment 0

This article shows how to convert a JSON string to a Spark DataFrame using Scala. It can be used for processing small in memory JSON string. 

Sample JSON string

The following sample JSON string will be used. It is a simple JSON array with three items in the array. For each item, there are two attributes named ID and ATTR1 with data type as integer and string respectively. 

[
{"ID":1,"ATTR1":"ABC"},
{"ID":2,"ATTR1":"DEF"},
{"ID":3,"ATTR1":"GHI"}
]

Read JSON string

In Spark, DataFrameReader object can be used to read JSON. 

def json(jsonDataset: Dataset[String]): DataFrame

Refer to the following official documentation for more details about this function. 

Spark 3.0.1 ScalaDoc - org.apache.spark.sql.DataFrameReader

*Note - this function is available from Spark 2.0 only. 

To create DataFrame object, we need to convert JSON string to Dataset[String] first.

import org.apache.spark.sql._
import org.apache.spark.sql.types._

val json = """[
{"ID":1,"ATTR1":"ABC"},
{"ID":2,"ATTR1":"DEF"},
{"ID":3,"ATTR1":"GHI"}]"""

val jsonDataset = Seq(json).toDS()
The output of jsonDataset is like the following:

jsonDataset: org.apache.spark.sql.Dataset[String] = [value: string]

Now, we can use read method of SparkSession object to directly read from the above dataset:

val df = spark.read.json(jsonDataset)
df: org.apache.spark.sql.DataFrame = [ATTR1: string, ID: bigint]

Spark automatically detected the schema of the JSON and converted it accordingly to Spark data types.

The content of the data frame looks like the following:

scala> df.show()
+-----+---+
|ATTR1| ID|
+-----+---+
|  ABC|  1|
|  DEF|  2|
|  GHI|  3|
+-----+---+

Read from multiple JSON string variables

In the above example, we only read data from one JSON string object. We can use Seq to construct multiple ones.

The following code snippet shows how to do that:

val json1 = """[
{"ID":4,"ATTR1":"123"},
{"ID":5,"ATTR1":"456"},
{"ID":6,"ATTR1":"789"}]"""

spark.read.json(Seq(json,json1).toDS()).show()

Output:

scala> spark.read.json(Seq(json,json1).toDS()).show()
+-----+---+
|ATTR1| ID|
+-----+---+
|  ABC|  1|
|  DEF|  2|
|  GHI|  3|
|  123|  4|
|  456|  5|
|  789|  6|
+-----+---+

The schema of the DataFrame contains two fields with data type as StringType and LongType respectively:

scala> spark.read.json(Seq(json,json1).toDS()).schema
res5: org.apache.spark.sql.types.StructType = StructType(StructField(ATTR1,StringType,true), StructField(ID,LongType,true))

Summary

When reading data directly from database or structured files, similar read functions can be used to easily convert the input dataset to a Spark DataFrame

info Last modified by Raymond 7 months ago copyright This page is subject to Site terms.
Like this article?
Share on

Please log in or register to comment.

account_circle Log in person_add Register

Log in with external accounts

Follow Kontext

Get our latest updates on LinkedIn.

Want to contribute on Kontext to help others?

Learn more

More from Kontext

Spark Scala: Load Data from MySQL
visibility 345
thumb_up 0
access_time 7 months ago
visibility 89
thumb_up 0
access_time 5 months ago