Spark Read JSON Lines (.jsonl) File

visibility 300 access_time 5 months ago languageEnglish

About JSON Lines 

JSON Lines text file is a newline-delimited JSON object document. It is commonly used in many data related products. For example, Spark by default reads JSON line document, BigQuery provides APIs to load JSON Lines file. 

JSON Lines has the following requirements:

  • UTF-8 encoded.
  • Each line is a valid JSON, for example, a JSON object or a JSON array.
  • Line seperator is '\n'.

The recommend file extension is .jsonl.

Example JSON Lines document

The following content is from JSON Lines official documentation:

{"name": "Gilbert", "wins": [["straight", "7♣"], ["one pair", "10♥"]]}
{"name": "Alexa", "wins": [["two pair", "4♠"], ["two pair", "9♠"]]}
{"name": "May", "wins": []}
{"name": "Deloise", "wins": [["three of a kind", "5♣"]]}

Save the document locally with file name as example.jsonl. We will use PySpark to read the file. 

Read JSON Lines in Spark

Spark by default reads JSON Lines when using json API (or format 'json'). The following is a sample script:

from pyspark.sql import SparkSession
appName = "PySpark - Read JSON Lines"
master = "local"
# Create Spark session
spark = SparkSession.builder \
    .appName(appName) \
    .master(master) \
    .getOrCreate()
# Create data frame
df = spark.read.json('file:///F:\Projects\Python\PySpark\example.jsonl')
print(df.schema)
df.show()

You can replace the local file path with HDFS file path. 

Run the script and the following schema will print out:

StructType(List(StructField(name,StringType,true),StructField(wins,ArrayType(ArrayType(StringType,true),true),true)))

By default, Spark will infer the schema. You can also customize the schema. Refer to this article for an example: Read JSON file as Spark DataFrame in Python / Spark.

The DataFrame looks like the following:

+-------+--------------------+
|   name|                wins|
+-------+--------------------+
|Gilbert|[[straight, 7♣], ...|
|  Alexa|[[two pair, 4♠], ...|
|    May|                  []|
|Deloise|[[three of a kind...|
+-------+--------------------+

References

JSON Lines

copyright This page is subject to Site terms.

Please log in or register to comment.

account_circle Log in person_add Register

Log in with external accounts

timeline Stats
Page index 2.01
More from Kontext
Spark repartition vs. coalesce
visibility 370
thumb_up 0
access_time 2 months ago
Convert string to date in Python / Spark
visibility 2,821
thumb_up 0
access_time 3 years ago
Scala: Parse JSON String as Spark DataFrame
visibility 7,508
thumb_up 1
access_time 2 years ago