Read JSON file as Spark DataFrame in Python / Spark

access_time 2 years ago visibility4463 comment 0

Spark has easy fluent APIs that can be used to read data from JSON file as DataFrame object. 

In this code example,  JSON file named 'example.json' has the following content:

[ { "Category": "Category A", "Count": 100, "Description": "This is category A" }, { "Category": "Category B", "Count": 120, "Description": "This is category B" }, { "Category": "Category C", "Count": 150, "Description": "This is category C" } ]



The file is loaded as a Spark DataFrame using SparkSession.read.json function.

multiLine=True argument is important as the JSON file content is across multiple lines. 

Code snippet

from pyspark.sql import SparkSession
from pyspark.sql.types import ArrayType, StructField, StructType, StringType, IntegerType

appName = "PySpark Example - JSON file to Spark Data Frame"
master = "local"

# Create Spark session
spark = SparkSession.builder \
    .appName(appName) \
    .master(master) \
    .getOrCreate()

# Create a schema for the dataframe
schema = StructType([
    StructField('Category', StringType(), True),
    StructField('Count', IntegerType(), True),
    StructField('Description', StringType(), True)
])

# Create data frame
json_file_path = 'data/example.json'
df = spark.read.json(json_file_path, schema, multiLine=True)
print(df.schema)
df.show()
info Last modified by Administrator at 2 months ago copyright This page is subject to Site terms.
Like this article?
Share on

Please log in or register to comment.

account_circle Log in person_add Register

Log in with external accounts

Kontext Column

Created for everyone to publish data, programming and cloud related articles.
Follow three steps to create your columns.


Learn more arrow_forward

More from Kontext

local_offer python local_offer pandas local_offer python-file-operations

visibility 231
thumb_up 0
access_time 10 months ago

Pickle files are commonly used Python data related projects. This article shows how to create and load pickle files using Pandas.  import pandas as pd import numpy as np file_name="data/test.pkl" data = np.random.randn(1000, 2) # pd.set_option('display.max_rows', None) df = ...

Improve PySpark Performance using Pandas UDF with Apache Arrow

local_offer pyspark local_offer spark local_offer spark-2-x local_offer pandas local_offer spark-advanced

visibility 3424
thumb_up 4
access_time 10 months ago

Apache Arrow is an in-memory columnar data format that can be used in Spark to efficiently transfer data between JVM and Python processes. This currently is most beneficial to Python users that work with Pandas/NumPy data. In this article, I'm going to show you how to utilise Pandas UDF in ...

local_offer teradata local_offer python local_offer Java local_offer python-database

visibility 959
thumb_up 0
access_time 6 months ago

Python JayDeBeApi module allows you to connect from Python to Teradata databases using Java JDBC drivers. In article Connect to Teradata database through Python , I showed how to use teradata package to connect to Teradata via Teradata ODBC driver. This article demos how to use this JayDeBeApi ...

About column

Code snippets for various programming languages/frameworks.

rss_feed Subscribe RSS