PySpark DataFrame - Extract JSON Value using get_json_object Function

PySpark SQL functions `get_json_object `can be used to extract JSON values from a JSON string column in Spark DataFrame. This is equivalent as using Spark SQL directly: [Spark SQL - Extract Value from JSON String](https://kontext.tech/article/584/spark-sql-extract-value-from-json-string). Syntax of this function looks like the following: ``` pyspark.sql.functions.get_json_object(col, path) ``` The first parameter is the JSON string column name in the DataFrame and the second is the JSON path. This code snippet shows you how to extract JSON values using JSON path. If you need to extract complex JSON documents like JSON arrays, you can follow this article - [PySpark: Convert JSON String Column to Array of Object (StructType) in DataFrame](https://kontext.tech/article/284/pyspark-convert-json-string-column-to-array-of-object-structtype-in-data-frame). ### Output ``` StructType([StructField('id', LongType(), True), StructField('json_col', StringType(), True), StructField('ATTR_INT_0', StringType(), True), StructField('ATTR_DATE_1', StringType(), True)]) +---+--------------------+----------+-----------+ | id| json_col|ATTR_INT_0|ATTR_DATE_1| +---+--------------------+----------+-----------+ | 1|[{"Attr_INT":1, "...| 1| 2022-01-01| +---+--------------------+----------+-----------+ ```

Kontext Kontext 0 6108 5.83 index 8/16/2022

Code description

PySpark SQL functions get_json_object can be used to extract JSON values from a JSON string column in Spark DataFrame. This is equivalent as using Spark SQL directly: Spark SQL - Extract Value from JSON String.

Syntax of this function looks like the following:

    pyspark.sql.functions.get_json_object(col, path)

The first parameter is the JSON string column name in the DataFrame and the second is the JSON path.

This code snippet shows you how to extract JSON values using JSON path. If you need to extract complex JSON documents like JSON arrays, you can follow this article - PySpark: Convert JSON String Column to Array of Object (StructType) in DataFrame.

Output

    StructType([StructField('id', LongType(), True), StructField('json_col', StringType(), True), StructField('ATTR_INT_0', StringType(), True), StructField('ATTR_DATE_1', StringType(), True)])
    
    +---+--------------------+----------+-----------+
    | id|            json_col|ATTR_INT_0|ATTR_DATE_1|
    +---+--------------------+----------+-----------+
    |  1|[{"Attr_INT":1, "...|         1| 2022-01-01|
    +---+--------------------+----------+-----------+

Code snippet

    from pyspark.sql import SparkSession
    from pyspark.sql.functions import get_json_object
    
    app_name = "PySpark get_json_object sql functions"
    master = "local"
    
    spark = SparkSession.builder         .appName(app_name)         .master(master)         .getOrCreate()
    
    spark.sparkContext.setLogLevel("WARN")
    
    json_str = """[{"Attr_INT":1, "ATTR_DOUBLE":10.201, "ATTR_DATE": "2021-01-01"},
    {"Attr_INT":2, "ATTR_DOUBLE":20.201, "ATTR_DATE": "2022-01-01"}]"""
    
    # Create a DataFrame
    df = spark.createDataFrame(
        [[1, json_str]], ['id', 'json_col'])
    
    # Extract JSON values
    df = df.withColumn('ATTR_INT_0',
                       get_json_object('json_col', '$[0].Attr_INT'))
    df = df.withColumn('ATTR_DATE_1',
                       get_json_object('json_col', '$[1].ATTR_DATE'))
    print(df.schema)
    df.show()
    
pyspark spark-sql-function

Join the Discussion

View or add your thoughts below

Comments