By using this site, you acknowledge that you have read and understand our Cookie and Privacy policy. Your use of Kontext website is subject to this policy. Accept

PySpark: Convert JSON String Column to Array of Object (StructType) in Data Frame

421 views last modified about 3 months ago Raymond Tang

python spark

This post shows how to derive new column in a Spark data frame from a JSON array string column. I am running the code in Spark 2.2.1 though it is compatible with Spark 1.6.0 (with less JSON SQL functions).

Prerequisites

Refer to the following post to install Spark in Windows.

Install Spark 2.2.1 in Windows

*If you are using Linux or UNIX, the code should also work.

Requirement

Convert the following list to a data frame:

source = [{"attr_1": 1, "attr_2": "[{\"a\":1,\"b\":1},{\"a\":2,\"b\":2}]"}, {

"attr_1": 2, "attr_2": "[{\"a\":3,\"b\":3},{\"a\":4,\"b\":4}]"}]

The data frame should have two column:

  • attr_1: column type is IntegerType
  • attr_2: column type is ArrayType (element type is StructType with two StructField).

And the schema of the data frame should look like the following:

root
  |-- attr_1: long (nullable = true)
  |-- attr_2: array (nullable = true)
  |    |-- element: struct (containsNull = true)
  |    |    |-- a: integer (nullable = false)
  |    |    |-- b: integer (nullable = false)

Resolution

Convert list to data frame

First, let’s convert the list to a data frame in Spark by using the following code:

# Read the list into data frame

df = sqlContext.read.json(sc.parallelize(source))

df.show()

df.printSchema()

JSON is read into a data frame through sqlContext. The output is:

+------+--------------------+
|attr_1|              attr_2|
+------+--------------------+
|     1|[{"a":1,"b":1},{"...|
|     2|[{"a":3,"b":3},{"...|
+------+--------------------+

root
  |-- attr_1: long (nullable = true)
  |-- attr_2: string (nullable = true)

At current stage, column attr_2 is string type instead of array of struct.

Create a function to parse JSON to list

For column attr_2, the value is JSON array string. Let’s create a function to parse JSON string and then convert it to list.

# Function to convert JSON array string to a list

import json

def parse_json(array_str):

json_obj = json.loads(array_str)

for item in json_obj:

yield (item["a"], item["b"])

Define the schema of column attr_2

# Define the schema

from pyspark.sql.types import ArrayType, IntegerType, StructType, StructField

json_schema = ArrayType(StructType([StructField('a', IntegerType(

), nullable=False), StructField('b', IntegerType(), nullable=False)]))

Based on the JSON string, the schema is defined as an array of struct with two fields.

Create an UDF

Now, we can create an UDF with function parse_json and schema json_schema.

# Define udf

from pyspark.sql.functions import udf

udf_parse_json = udf(lambda str: parse_json(str), json_schema)

Create a new data frame

Finally, we can create a new data frame using the defined UDF.

# Generate a new data frame with the expected schema

df_new = df.select(df.attr_1, udf_parse_json(df.attr_2).alias("attr_2"))

df_new.show()

df_new.printSchema()

The output is as the following:

+------+--------------+
|attr_1|        attr_2|
+------+--------------+
|     1|[[1,1], [2,2]]|
|     2|[[3,3], [4,4]]|
+------+--------------+

root
  |-- attr_1: long (nullable = true)
  |-- attr_2: array (nullable = true)
  |    |-- element: struct (containsNull = true)
  |    |    |-- a: integer (nullable = false)
  |    |    |-- b: integer (nullable = false)

Summary

The following is the complete code:

from pyspark import SparkContext, SparkConf, SQLContext

appName = "JSON Parse Example"

master = "local[2]"

conf = SparkConf().setAppName(appName).setMaster(master)

sc = SparkContext(conf=conf)

sqlContext = SQLContext(sc)

source = [{"attr_1": 1, "attr_2": "[{\"a\":1,\"b\":1},{\"a\":2,\"b\":2}]"}, {

"attr_1": 2, "attr_2": "[{\"a\":3,\"b\":3},{\"a\":4,\"b\":4}]"}]

# Read the list into data frame

df = sqlContext.read.json(sc.parallelize(source))

df.show()

df.printSchema()

# Function to convert JSON array string to a list

import json

def parse_json(array_str):

json_obj = json.loads(array_str)

for item in json_obj:

yield (item["a"], item["b"])

# Define the schema

from pyspark.sql.types import ArrayType, IntegerType, StructType, StructField

json_schema = ArrayType(StructType([StructField('a', IntegerType(

), nullable=False), StructField('b', IntegerType(), nullable=False)]))

# Define udf

from pyspark.sql.functions import udf

udf_parse_json = udf(lambda str: parse_json(str), json_schema)

# Generate a new data frame with the expected schema

df_new = df.select(df.attr_1, udf_parse_json(df.attr_2).alias("attr_2"))

df_new.show()

df_new.printSchema()

Save the code as file parse_json.py and then you can use the following command to run it in Spark:

spark-submit parse_json.py

The following screenshot is captured from my local environment (Spark 2.2.1 & Python 3.6.4 in Windows ).

image

Related pages

Debug PySpark Code in Visual Studio Code

21 views   0 comments last modified about 16 days ago

The page summarizes the steps required to run and debug PySpark (Spark for Python) in Visual Studio Code. Install Python and pip Install Python from the official website: https://...

View detail

Implement SCD Type 2 Full Merge via Spark Data Frames

307 views   0 comments last modified about 2 months ago

Overview For SQL developers that are familiar with SCD and merge statements, you may wonder how to implement the same in big data platforms, considering database or storages in Hadoop are not designed/optimised for record level updates and inserts. In this post, I’m going to demons...

View detail

Write and Read Parquet Files in Spark/Scala

7228 views   2 comments last modified about 2 years ago

In this page, I’m going to demonstrate how to write and read parquet files in Spark/Scala by using Spark SQLContext class. Reference What is parquet format? Go the following project site to understand more about parquet. ...

View detail

Install Big Data Tools (Spark, Zeppelin, Hadoop) in Windows for Learning and Practice

1188 views   2 comments last modified about 11 months ago

Are you a Windows/.NET developer and willing to learn big data concepts and tools in your Windows? If yes, you can follow the links below to install them in your PC. The installations are usually easier to do in Linux/UNIX but they are not difficult to implement in Windows either since the...

View detail

Load Data into HDFS from SQL Server via Sqoop

1073 views   0 comments last modified about 12 months ago

This page shows how to import data from SQL Server into Hadoop via Apache Sqoop. Prerequisites Please follow the link below to install Sqoop in your machine if you don’t have one environment ready. ...

View detail

Write and Read Parquet Files in HDFS through Spark/Scala

4053 views   0 comments last modified about 2 years ago

In my previous post, I demonstrated how to write and read parquet files in Spark/Scala. The parquet file destination is a local folder. Write and Read Parquet Files in Spark/Scala In this page...

View detail

Add comment

Comments (0)

No comments yet.

Contacts

  • enquiry[at]kontext.tech

Subscribe