Raymond Raymond

Spark Read JSON Lines (.jsonl) File

event 2021-12-21 visibility 3,987 comment 0 insights toc
more_vert
insights Stats

About JSON Lines 

JSON Lines text file is a newline-delimited JSON object document. It is commonly used in many data related products. For example, Spark by default reads JSON line document, BigQuery provides APIs to load JSON Lines file. 

JSON Lines has the following requirements:

  • UTF-8 encoded.
  • Each line is a valid JSON, for example, a JSON object or a JSON array.
  • Line seperator is '\n'.

The recommend file extension is .jsonl.

Example JSON Lines document

The following content is from JSON Lines official documentation:

{"name": "Gilbert", "wins": [["straight", "7♣"], ["one pair", "10♥"]]}
{"name": "Alexa", "wins": [["two pair", "4♠"], ["two pair", "9♠"]]}
{"name": "May", "wins": []}
{"name": "Deloise", "wins": [["three of a kind", "5♣"]]}

Save the document locally with file name as example.jsonl. We will use PySpark to read the file. 

Read JSON Lines in Spark

Spark by default reads JSON Lines when using json API (or format 'json'). The following is a sample script:

from pyspark.sql import SparkSession
appName = "PySpark - Read JSON Lines"
master = "local"
# Create Spark session
spark = SparkSession.builder \
    .appName(appName) \
    .master(master) \
    .getOrCreate()
# Create data frame
df = spark.read.json('file:///F:\Projects\Python\PySpark\example.jsonl')
print(df.schema)
df.show()

You can replace the local file path with HDFS file path. 

Run the script and the following schema will print out:

StructType(List(StructField(name,StringType,true),StructField(wins,ArrayType(ArrayType(StringType,true),true),true)))

By default, Spark will infer the schema. You can also customize the schema. Refer to this article for an example: Read JSON file as Spark DataFrame in Python / Spark.

The DataFrame looks like the following:

+-------+--------------------+
|   name|                wins|
+-------+--------------------+
|Gilbert|[[straight, 7♣], ...|
|  Alexa|[[two pair, 4♠], ...|
|    May|                  []|
|Deloise|[[three of a kind...|
+-------+--------------------+

References

JSON Lines

More from Kontext
comment Comments
No comments yet.

Please log in or register to comment.

account_circle Log in person_add Register

Log in with external accounts