Spark Read JSON Lines (.jsonl) File
About JSON Lines
JSON Lines text file is a newline-delimited JSON object document. It is commonly used in many data related products. For example, Spark by default reads JSON line document, BigQuery provides APIs to load JSON Lines file.
JSON Lines has the following requirements:
- UTF-8 encoded.
- Each line is a valid JSON, for example, a JSON object or a JSON array.
- Line seperator is '\n'.
The recommend file extension is .jsonl.
Example JSON Lines document
The following content is from JSON Lines official documentation:
{"name": "Gilbert", "wins": [["straight", "7♣"], ["one pair", "10♥"]]}
{"name": "Alexa", "wins": [["two pair", "4♠"], ["two pair", "9♠"]]}
{"name": "May", "wins": []}
{"name": "Deloise", "wins": [["three of a kind", "5♣"]]}
Save the document locally with file name as example.jsonl. We will use PySpark to read the file.
Read JSON Lines in Spark
Spark by default reads JSON Lines when using json API (or format 'json'). The following is a sample script:
from pyspark.sql import SparkSession
appName = "PySpark - Read JSON Lines"
master = "local"
# Create Spark session
spark = SparkSession.builder \
.appName(appName) \
.master(master) \
.getOrCreate()
# Create data frame
df = spark.read.json('file:///F:\Projects\Python\PySpark\example.jsonl')
print(df.schema)
df.show()
You can replace the local file path with HDFS file path.
Run the script and the following schema will print out:
StructType(List(StructField(name,StringType,true),StructField(wins,ArrayType(ArrayType(StringType,true),true),true)))
By default, Spark will infer the schema. You can also customize the schema. Refer to this article for an example: Read JSON file as Spark DataFrame in Python / Spark.
The DataFrame looks like the following:
+-------+--------------------+ | name| wins| +-------+--------------------+ |Gilbert|[[straight, 7♣], ...| | Alexa|[[two pair, 4♠], ...| | May| []| |Deloise|[[three of a kind...| +-------+--------------------+