Spark Read JSON Lines (.jsonl) File

Raymond Tang Raymond Tang 0 5343 4.15 index 12/21/2021

About JSON Lines

JSON Lines text file is a newline-delimited JSON object document. It is commonly used in many data related products. For example, Spark by default reads JSON line document, BigQuery provides APIs to load JSON Lines file.

JSON Lines has the following requirements:

  • UTF-8 encoded.
  • Each line is a valid JSON, for example, a JSON object or a JSON array.
  • Line seperator is '\n'.

The recommend file extension is .jsonl.

Example JSON Lines document

The following content is from JSON Lines official documentation:

{"name": "Gilbert", "wins": [["straight", "7♣"], ["one pair", "10♥"]]}{"name": "Alexa", "wins": [["two pair", "4♠"], ["two pair", "9♠"]]}{"name": "May", "wins": []}{"name": "Deloise", "wins": [["three of a kind", "5♣"]]}

Save the document locally with file name as example.jsonl. We will use PySpark to read the file.

Read JSON Lines in Spark

Spark by default reads JSON Lines when using jsonAPI (or format 'json'). The following is a sample script:

from pyspark.sql import SparkSessionappName = "PySpark - Read JSON Lines"master = "local"# Create Spark sessionspark = SparkSession.builder \    .appName(appName) \    .master(master) \    .getOrCreate()
# Create data framedf = spark.read.json('file:///F:\Projects\Python\PySpark\example.jsonl')
print(df.schema)df.show()

You can replace the local file path with HDFS file path.

Run the script and the following schema will print out:

StructType(List(StructField(name,StringType,true),StructField(wins,ArrayType(ArrayType(StringType,true),true),true)))

By default, Spark will infer the schema. You can also customize the schema. Refer to this article for an example: Read JSON file as Spark DataFrame in Python / Spark.

The DataFramelooks like the following:

+-------+--------------------+
|   name|                wins|
+-------+--------------------+
|Gilbert|[[straight, 7♣], ...|
|  Alexa|[[two pair, 4♠], ...|
|    May|                  []|
|Deloise|[[three of a kind...|
+-------+--------------------+

References

JSON Lines

pyspark spark

Join the Discussion

View or add your thoughts below

Comments