Load CSV File in PySpark

access_time 3 months ago visibility1043 comment 0

CSV is a commonly used data format. Spark provides rich APIs to load files from HDFS as data frame.  This page provides examples about how to load CSV from HDFS using Spark. If you want to read a local CSV file in Python, refer to this page Python: Load / Read Multiline CSV File instead.

Sample data file

The CSV file content looks like the following:

ID,Text1,Text2
1,Record 1,Hello World!
2,Record 2,Hello Hadoop!
3,Record 3,"Hello 
Kontext!"
4,Record 4,Hello!
For the third record, field Text2 is across two lines. 
The file is ingested into my Hadoop instance with location as:
hadoop fs -copyFromLocal data.csv /data.csv

Load CSV file

We can use 'read' API of SparkSession object to read CSV with the following options:

  • header = True: this means there is a header line in the data file.
  • sep=, : comma is the delimiter/separator. Since our file is using comma, we don't need to specify this as by default is is comma.
  • multiLine = True: this setting allows us to read multi-line records. 

Sample code

from pyspark.sql import SparkSession

appName = "Python Example - PySpark Read CSV"
master = 'local'

# Create Spark session
spark = SparkSession.builder \
    .master(master) \
    .appName(appName) \
    .getOrCreate()

# Convert list to data frame
df = spark.read.format('csv') \
                .option('header',True) \
                .option('multiLine', True) \
                .load('/data.csv')
df.show()
print(f'Record count is: {df.count()}')

Output

The output looks like the following:

+---+--------+---------------+
| ID|   Text1|          Text2|
+---+--------+---------------+
|  1|Record 1|   Hello World!|
|  2|Record 2|  Hello Hadoop!|
|  3|Record 3|Hello
Kontext!|
|  4|Record 4|         Hello!|
+---+--------+---------------+
Record count is: 4
info Last modified by Raymond at 3 months ago copyright This page is subject to Site terms.
Like this article?
Share on

Please log in or register to comment.

account_circle Log in person_add Register

Log in with external accounts

Kontext Column

Created for everyone to publish data, programming and cloud related articles.
Follow three steps to create your columns.


Learn more arrow_forward

More from Kontext

local_offer SQL Server local_offer python local_offer spark local_offer pyspark local_offer spark-database-connect

visibility 22253
thumb_up 4
access_time 2 years ago

Spark is an analytics engine for big data processing. There are various ways to connect to a database in Spark. This page summarizes some of common approaches to connect to SQL Server using Python as programming language. For each method, both Windows Authentication and SQL Server ...

local_offer spark local_offer pyspark

visibility 3483
thumb_up 0
access_time 2 years ago

In Spark, there are a number of settings/configurations you can specify including application properties and runtime parameters. https://spark.apache.org/docs/latest/configuration.html To retrieve all the current configurations, you can use the following code (Python): from pyspark.sql ...

local_offer spark local_offer scala

visibility 340
thumb_up 0
access_time 12 months ago

Parquet is columnar store format published by Apache. It's commonly used in Hadoop ecosystem. There are many programming language APIs that have been implemented to support writing and reading parquet files. 

About column

Apache Spark installation guides, performance tuning tips, general tutorials, etc.

*Spark logo is a registered trademark of Apache Spark.

rss_feed Subscribe RSS