PySpark Read Multiple Lines Records from CSV

CSV is a common format used when extracting and exchanging data between systems and platforms. Once CSV file is ingested into HDFS, you can easily read them as DataFrame in Spark. However there are a few options you need to pay attention to especially if you source file:

  • Has records across multiple lines.
  • Has escaped characters in the field.
  • Fields contain delimiters.

This page shows you how to handle the above scenarios in Spark by using Python as programming language. If you prefer Scala or other Spark compatible languages, the APIs are very similar. 

Sample data file

The CSV file content looks like the followng:

ID,Text1,Text2
1,Record 1,Hello World!
2,Record 2,Hello Hadoop!
3,Record 3,"Hello 
Kontext!"
4,Record 4,Hello!
For the third record, field Text2 is across two lines.
The file is ingested into my Hadoop instance with location as:
hadoop fs -copyFromLocal data.csv /data.csv

Normal CSV file read

Let's create a python script using the following code:

from pyspark.sql import SparkSession

appName = "Python Example - PySpark Read CSV"
master = 'local'

# Create Spark session
spark = SparkSession.builder \
    .master(master) \
    .appName(appName) \
    .getOrCreate()

# Convert list to data frame
df = spark.read.format('csv') \
                .option('header',True) \
                .option('sep', ',') \
                .load('/data.csv')
df.show()

In the above code snippet, we used 'read' API with CSV as the format and specified the following options:

  • header = True: this means there is a header line in the data file.
  • sep=, : comma is the delimiter/separator. Since our file is using comma, we don't need to specify this as by default is is comma.

The output looks like the following:

+---------+--------+-------------+
|       ID|   Text1|        Text2|
+---------+--------+-------------+
|        1|Record 1| Hello World!|
|        2|Record 2|Hello Hadoop!|
|        3|Record 3|       Hello |
|Kontext!"|    null|         null|
|        4|Record 4|       Hello!|
+---------+--------+-------------+

This isn't what we are looking for as it doesn't parse the multiple lines record correct.

Read multiple line records

It's very easy to read multiple line records CSV in spark and we just need to specify multiLine option as True.

from pyspark.sql import SparkSession

appName = "Python Example - PySpark Read CSV"
master = 'local'

# Create Spark session
spark = SparkSession.builder \
    .master(master) \
    .appName(appName) \
    .getOrCreate()

# Convert list to data frame
df = spark.read.format('csv') \
                .option('header',True) \
                .option('multiLine', True) \
                .load('/data.csv')
df.show()
print(f'Record count is: {df.count()}')

The output looks like the following:

+---+--------+---------------+
| ID|   Text1|          Text2|
+---+--------+---------------+
|  1|Record 1|   Hello World!|
|  2|Record 2|  Hello Hadoop!|
|  3|Record 3|Hello
Kontext!|
|  4|Record 4|         Hello!|
+---+--------+---------------+
Record count is: 4

Different quote character 

Let's imagine the data file content looks like the following (double quote is replaced with @):

ID,Text1,Text2
1,Record 1,Hello World!
2,Record 2,Hello Hadoop!
3,Record 3,@Hello 
Kontext!@
4,Record 4,Hello!
Even we specify multiLine option, our previous script still read it as 5 records.
To fix this, we can simply specify another very useful option 'quote':
from pyspark.sql import SparkSession

appName = "Python Example - PySpark Read CSV"
master = 'local'

# Create Spark session
spark = SparkSession.builder \
    .master(master) \
    .appName(appName) \
    .getOrCreate()

# Convert list to data frame
df = spark.read.format('csv') \
                .option('header',True) \
                .option('multiLine', True) \
                .option('quote','@') \
                .load('/data.csv')
df.show()
print(f'Record count is: {df.count()}')

The output looks like the following:

+---+--------+---------------+
| ID|   Text1|          Text2|
+---+--------+---------------+
|  1|Record 1|   Hello World!|
|  2|Record 2|  Hello Hadoop!|
|  3|Record 3|Hello
Kontext!|
|  4|Record 4|         Hello!|
+---+--------+---------------+

Escape double quotes

Another common used option is the escape character.

Let's assume your CSV content looks like the following:

ID,Text1,Text2
1,Record 1,Hello World!
2,Record 2,Hello Hadoop!
3,Record 3,"Hello 
""Kontext""!"
4,Record 4,Hello!

Let's change the read function to use the default quote character '"':

df = spark.read.format('csv') \
                .option('header',True) \
                .option('multiLine', True) \
                .option('quote','"') \
                .load('/data.csv')

It doesn't read the content properly though the record count is correct:

+---+--------+--------------------+
| ID|   Text1|               Text2|
+---+--------+--------------------+
|  1|Record 1|        Hello World!|
|  2|Record 2|       Hello Hadoop!|
|  3|Record 3|"Hello
""Kontext...|
|  4|Record 4|              Hello!|
+---+--------+--------------------+

To fix this, we can just specify the escape option:

df = spark.read.format('csv') \
                .option('header',True) \
                .option('multiLine', True) \
                .option('quote','"') \
                .option('escape','"') \
                .load('/data.csv')

It will output the correct format we are looking for:

+---+--------+-----------------+
| ID|   Text1|            Text2|
+---+--------+-----------------+
|  1|Record 1|     Hello World!|
|  2|Record 2|    Hello Hadoop!|
|  3|Record 3|Hello
"Kontext"!|
|  4|Record 4|           Hello!|
+---+--------+-----------------+

If you escape character is different, you can also specify it accordingly.

Multiple character quotes

If your attributes are quoted using multiple characters in CSV,  unfortunately this CSV ser/deser doesn't support that.

For example, let's assume the field is quoted with double double quotes:

ID,Text1,Text2
1,Record 1,Hello World!
2,Record 2,Hello Hadoop!
3,Record 3,""Hello 
Kontext!""
4,Record 4,Hello!

We will encounter one error if we use the following code to read it:

df = spark.read.format('csv') \
                .option('header',True) \
                .option('multiLine', True) \
                .option('quote','""') \
                .option('escape','"') \
                .load('/data.csv')

Error:

java.lang.RuntimeException: quote cannot be more than one character

Similarly, for escape character, it only supports one character.

To resolve these problems, you need to implement your own text file deserializer. 

copyright This page is subject to Site terms.

More from Kontext

local_offer pyspark local_offer spark

visibility 20
thumb_up 0
access_time 2 days ago

CSV is a commonly used data format. Spark provides rich APIs to load files from HDFS as data frame.  This page provides examples about how to load CSV from HDFS using Spark. If you want to read a local CSV file in Python, refer to this page  ...

open_in_new Spark + PySpark

local_offer teradata local_offer python

visibility 1035
thumb_up 1
access_time 4 months ago

Pandas is commonly used by Python users to perform data operations. In many scenarios, the results need to be saved to a storage like Teradata. This article shows you how to do that easily using JayDeBeApi or  ...

open_in_new Spark + PySpark

local_offer python

visibility 317
thumb_up 0
access_time 4 months ago

CSV is a common data format used in many applications. It's also a common task for data workers to read and parse CSV and then save it into another storage such as RDBMS (Teradata, SQL Server, MySQL). In my previous article  ...

open_in_new Python Programming

local_offer teradata local_offer python local_offer Java

visibility 584
thumb_up 0
access_time 4 months ago

Python JayDeBeApi module allows you to connect from Python to Teradata databases using Java JDBC drivers. In article Connect to Teradata database through Python , I showed ho...

open_in_new Python Programming

comment Comments (0)

comment Add comment

Please log in or register to comment.

account_circle Log in person_add Register

Log in with external accounts

No comments yet.

Kontext Column

Created for everyone to publish data, programming and cloud related articles. Follow three steps to create your columns.


Learn more arrow_forward