Spark Read JSON Lines (.jsonl) File

visibility 335 access_time 6 months ago languageEnglish timeline Stats
timeline Stats
Page index 2.10

About JSON Lines 

JSON Lines text file is a newline-delimited JSON object document. It is commonly used in many data related products. For example, Spark by default reads JSON line document, BigQuery provides APIs to load JSON Lines file. 

JSON Lines has the following requirements:

  • UTF-8 encoded.
  • Each line is a valid JSON, for example, a JSON object or a JSON array.
  • Line seperator is '\n'.

The recommend file extension is .jsonl.

Example JSON Lines document

The following content is from JSON Lines official documentation:

{"name": "Gilbert", "wins": [["straight", "7♣"], ["one pair", "10♥"]]}
{"name": "Alexa", "wins": [["two pair", "4♠"], ["two pair", "9♠"]]}
{"name": "May", "wins": []}
{"name": "Deloise", "wins": [["three of a kind", "5♣"]]}

Save the document locally with file name as example.jsonl. We will use PySpark to read the file. 

Read JSON Lines in Spark

Spark by default reads JSON Lines when using json API (or format 'json'). The following is a sample script:

from pyspark.sql import SparkSession
appName = "PySpark - Read JSON Lines"
master = "local"
# Create Spark session
spark = SparkSession.builder \
    .appName(appName) \
    .master(master) \
    .getOrCreate()
# Create data frame
df = spark.read.json('file:///F:\Projects\Python\PySpark\example.jsonl')
print(df.schema)
df.show()

You can replace the local file path with HDFS file path. 

Run the script and the following schema will print out:

StructType(List(StructField(name,StringType,true),StructField(wins,ArrayType(ArrayType(StringType,true),true),true)))

By default, Spark will infer the schema. You can also customize the schema. Refer to this article for an example: Read JSON file as Spark DataFrame in Python / Spark.

The DataFrame looks like the following:

+-------+--------------------+
|   name|                wins|
+-------+--------------------+
|Gilbert|[[straight, 7♣], ...|
|  Alexa|[[two pair, 4♠], ...|
|    May|                  []|
|Deloise|[[three of a kind...|
+-------+--------------------+

References

JSON Lines

copyright This page is subject to Site terms.

Please log in or register to comment.

account_circle Log in person_add Register

Log in with external accounts

More from Kontext
Get the Current Spark Context Settings/Configurations
visibility 11,828
thumb_up 0
access_time 4 years ago
Install Zeppelin 0.7.3 on Windows
visibility 7,088
thumb_up 0
access_time 2 years ago
Apache Spark 3.0.1 Installation on macOS
visibility 3,052
thumb_up 1
access_time 2 years ago
Apache Spark 3.0.1 Installation on macOS