Scala: Parse JSON String as Spark DataFrame

access_time 29 days ago visibility95 comment 0

This article shows how to convert a JSON string to a Spark DataFrame using Scala. It can be used for processing small in memory JSON string. 

Sample JSON string

The following sample JSON string will be used. It is a simple JSON array with three items in the array. For each item, there are two attributes named ID and ATTR1 with data type as integer and string respectively. 

[
{"ID":1,"ATTR1":"ABC"},
{"ID":2,"ATTR1":"DEF"},
{"ID":3,"ATTR1":"GHI"}
]

Read JSON string

In Spark, DataFrameReader object can be used to read JSON. 

def json(jsonDataset: Dataset[String]): DataFrame

Refer to the following official documentation for more details about this function. 

Spark 3.0.1 ScalaDoc - org.apache.spark.sql.DataFrameReader

*Note - this function is available from Spark 2.0 only. 

To create DataFrame object, we need to convert JSON string to Dataset[String] first.

import org.apache.spark.sql._
import org.apache.spark.sql.types._

val json = """[
{"ID":1,"ATTR1":"ABC"},
{"ID":2,"ATTR1":"DEF"},
{"ID":3,"ATTR1":"GHI"}]"""

val jsonDataset = Seq(json).toDS()
The output of jsonDataset is like the following:

jsonDataset: org.apache.spark.sql.Dataset[String] = [value: string]

Now, we can use read method of SparkSession object to directly read from the above dataset:

val df = spark.read.json(jsonDataset)
df: org.apache.spark.sql.DataFrame = [ATTR1: string, ID: bigint]

Spark automatically detected the schema of the JSON and converted it accordingly to Spark data types.

The content of the data frame looks like the following:

scala> df.show()
+-----+---+
|ATTR1| ID|
+-----+---+
|  ABC|  1|
|  DEF|  2|
|  GHI|  3|
+-----+---+

Read from multiple JSON string variables

In the above example, we only read data from one JSON string object. We can use Seq to construct multiple ones.

The following code snippet shows how to do that:

val json1 = """[
{"ID":4,"ATTR1":"123"},
{"ID":5,"ATTR1":"456"},
{"ID":6,"ATTR1":"789"}]"""

spark.read.json(Seq(json,json1).toDS()).show()

Output:

scala> spark.read.json(Seq(json,json1).toDS()).show()
+-----+---+
|ATTR1| ID|
+-----+---+
|  ABC|  1|
|  DEF|  2|
|  GHI|  3|
|  123|  4|
|  456|  5|
|  789|  6|
+-----+---+

The schema of the DataFrame contains two fields with data type as StringType and LongType respectively:

scala> spark.read.json(Seq(json,json1).toDS()).schema
res5: org.apache.spark.sql.types.StructType = StructType(StructField(ATTR1,StringType,true), StructField(ID,LongType,true))

Summary

When reading data directly from database or structured files, similar read functions can be used to easily convert the input dataset to a Spark DataFrame

info Last modified by Raymond 28 days ago copyright This page is subject to Site terms.
Like this article?
Share on

Please log in or register to comment.

account_circle Log in person_add Register

Log in with external accounts

Follow Kontext

Get our latest updates on LinkedIn or Twitter.

Want to publish your article on Kontext?

Learn more

More from Kontext

visibility 1355
thumb_up 0
access_time 2 years ago

This page summarizes the installation guides about big data tools on Windows through Windows Subsystem for Linux (WSL). Install Hadoop 3.2.0 on Windows 10 using Windows Subsystem for Linux (WSL) A framework that allows for distributed processing of the large data sets ...

visibility 95
thumb_up 1
access_time 29 days ago

This article shows how to convert a JSON string to a Spark DataFrame using Scala. It can be used for processing small in memory JSON string.  The following sample JSON string will be used. It is a simple JSON array with three items in the array. For each item, there are two attributes named ...

visibility 6069
thumb_up 0
access_time 2 years ago

In my article Connect to Teradata database through Python , I demonstrated about how to use Teradata python package or Teradata ODBC driver to connect to Teradata. In this article, I’m going to show you how to connect to Teradata through JDBC drivers so that you can load data directly into PySpark ...