Load CSV File in PySpark
CSV is a commonly used data format. Spark provides rich APIs to load files from HDFS as data frame. This page provides examples about how to load CSV from HDFS using Spark. If you want to read a local CSV file in Python, refer to this page Python: Load / Read Multiline CSV File instead.
Sample data file
The CSV file content looks like the following:
ID,Text1,Text2 1,Record 1,Hello World! 2,Record 2,Hello Hadoop! 3,Record 3,"Hello Kontext!" 4,Record 4,Hello!
For the third record, field Text2 is across two lines.
The file is ingested into my Hadoop instance with location as:
hadoop fs -copyFromLocal data.csv /data.csv
Load CSV file
We can use 'read' API of SparkSession object to read CSV with the following options:
- header = True: this means there is a header line in the data file.
- sep=, : comma is the delimiter/separator. Since our file is using comma, we don't need to specify this as by default is is comma.
- multiLine = True: this setting allows us to read multi-line records.
Sample code
from pyspark.sql import SparkSession appName = "Python Example - PySpark Read CSV" master = 'local' # Create Spark session spark = SparkSession.builder \ .master(master) \ .appName(appName) \ .getOrCreate() # Convert list to data frame df = spark.read.format('csv') \ .option('header',True) \ .option('multiLine', True) \ .load('/data.csv') df.show() print(f'Record count is: {df.count()}')
Output
The output looks like the following:
+---+--------+---------------+ | ID| Text1| Text2| +---+--------+---------------+ | 1|Record 1| Hello World!| | 2|Record 2| Hello Hadoop!| | 3|Record 3|Hello Kontext!| | 4|Record 4| Hello!| +---+--------+---------------+
Record count is: 4
info Last modified by Raymond 5 years ago
copyright
This page is subject to Site terms.
comment Comments
No comments yet.