PySpark DataFrame - Convert JSON Column to Row using json_tuple

Kontext Kontext event 2022-08-16 visibility 2,049
more_vert

Code description

PySpark SQL functions json_tuple can be used to convert DataFrame JSON string columns to tuples (new rows in the DataFrame). 

Syntax of this function looks like the following:

pyspark.sql.functions.json_tuple(col, *fields)

The first parameter is the JSON string column name in the DataFrame and the second is the filed name list to extract.

If you need to extract complex JSON documents like JSON arrays, you can follow this article - PySpark: Convert JSON String Column to Array of Object (StructType) in DataFrame.

Output

StructType([StructField('id', LongType(), True), StructField('c0', StringType(), True), StructField('c1', StringType(), True), StructField('c2', StringType(), True)])

+---+---+------+----------+
| id| c0|    c1|        c2|
+---+---+------+----------+
|  1|  1|10.201|2021-01-01|
|  2|  2|20.201|2022-01-01|
+---+---+------+----------+

Code snippet

from pyspark.sql import SparkSession
from pyspark.sql.functions import json_tuple

app_name = "PySpark json_tuple sql functions"
master = "local"

spark = SparkSession.builder \
    .appName(app_name) \
    .master(master) \
    .getOrCreate()

spark.sparkContext.setLogLevel("WARN")

# Create a DataFrame
df = spark.createDataFrame(
    [[1, '{"Attr_INT":1, "ATTR_DOUBLE":10.201, "ATTR_DATE": "2021-01-01"}'],
     [2, '{"Attr_INT":2, "ATTR_DOUBLE":20.201, "ATTR_DATE": "2022-01-01"}']], ['id', 'json_col'])

# Extract JSON values
df = df.select(df.id, json_tuple(
    df.json_col, 'Attr_INT', 'ATTR_DOUBLE', 'ATTR_DATE'))
print(df.schema)
df.show()
More from Kontext
comment Comments
No comments yet.

Please log in or register to comment.

account_circle Log in person_add Register

Log in with external accounts