Spark provides rich APIs to save data frames to many different formats of files such as CSV, Parquet, Orc, Avro, etc. CSV is commonly used in data application though nowadays binary formats are getting momentum. In this article, I am going to show you how to save Spark data frame as CSV file in both local file system and HDFS.

Spark CSV parameters

Refer to the following official documentation about all the parameters supported by CSV api in PySpark.

https://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=savemode#pyspark.sql.DataFrameReader.csv

Example code

In the following sample code, a data frame is created from a python list.  The data frame is then saved to both local file path and HDFS. To save file to local path, specify 'file://'. By default, the path is HDFS path. There are also several options used:

  1. header: to specify whether include header in the file.
  2. sep: to specify the delimiter
  3. mode is used to specify the behavior of the save operation when data already exists.
    • append: Append contents of this DataFrame to existing data.

    • overwrite: Overwrite existing data.

    • ignore: Silently ignore this operation if data already exists.

    • error or errorifexists (default case): Throw an exception if data already exists.

from pyspark.sql import SparkSession
from pyspark.sql.types import ArrayType, StructField, StructType, StringType, IntegerType, DecimalType
from decimal import Decimal

appName = "Python Example - PySpark Save DataFrame as CSV"
master = 'local'

# Create Spark session
spark = SparkSession.builder \
    .master(master) \
    .appName(appName) \
    .getOrCreate()

# List
data = [('Category A', 1, Decimal(12.40)),
        ('Category B', 2, Decimal(30.10)),
        ('Category C', 3, Decimal(100.01)),
        ('Category A', 4, Decimal(110.01)),
        ('Category B', 5, Decimal(70.85))
        ]

# Create a schema for the dataframe
schema = StructType([
    StructField('Category', StringType(), False),
    StructField('ItemID', IntegerType(), False),
    StructField('Amount', DecimalType(scale=2), True)
])

# Convert list to data frame
df = spark.createDataFrame(data, schema)
df.show()

# Save file local folder, delimiter by default is ,
df.write.format('csv').option('header',True).mode('overwrite').option('sep',',').save('file:///home/tangr/output.csv')

# Save file to HDFS
df.write.format('csv').option('header',True).mode('overwrite').option('sep','|').save('/output.csv')

Check the results

You can then check the results in HDFS and local file storage.

The following are examples from my WSL:

tangr@raymond-pc:~$ hadoop fs -ls /
Found 4 items
drwxr-xr-x   - tangr supergroup          0 2019-12-03 20:40 /output.csv
drwxr-xr-x   - tangr supergroup          0 2019-08-25 12:11 /scripts
drwxrwxr-x   - tangr supergroup          0 2019-05-18 15:52 /tmp
drwxr-xr-x   - tangr supergroup          0 2019-08-25 09:35 /user
tangr@raymond-pc:~$ hadoop fs -ls /output.csv
Found 2 items
-rw-r--r--   1 tangr supergroup          0 2019-12-03 20:40 /output.csv/_SUCCESS
-rw-r--r--   1 tangr supergroup        120 2019-12-03 20:40 /output.csv/part-00000-508be2a7-a564-4603-b77c-f4de7c07dbcd-c000.csv
tangr@raymond-pc:~$ hadoop fs -cat /output.csv/part-00000-508be2a7-a564-4603-b77c-f4de7c07dbcd-c000.csv
Category|ItemID|Amount
Category A|1|12.40
Category B|2|30.10
Category C|3|100.01
Category A|4|110.01
Category B|5|70.85
tangr@raymond-pc:~$ cd output.csv/
tangr@raymond-pc:~/output.csv$ ls
_SUCCESS  part-00000-bfbb44b0-1880-4400-a9c1-9c03180553a2-c000.csv
tangr@raymond-pc:~/output.csv$ cat part-00000-bfbb44b0-1880-4400-a9c1-9c03180553a2-c000.csv
Category,ItemID,Amount
Category A,1,12.40
Category B,2,30.10
Category C,3,100.01
Category A,4,110.01
Category B,5,70.85
* This page is subject to Site terms.

More from Kontext

PySpark Read Multiple Lines Records from CSV

local_offer pyspark local_offer spark-2-x local_offer python

visibility 213
thumb_up 0
access_time 3 months ago

CSV is a common format used when extracting and exchanging data between systems and platforms. Once CSV file is ingested into HDFS, you can easily read them as DataFrame in Spark. However there are a few options you need to pay attention to especially if you source file: Has records ac...

open_in_new View open_in_new Spark + PySpark

local_offer pyspark local_offer spark-2-x local_offer teradata local_offer SQL Server

visibility 603
thumb_up 0
access_time 3 months ago

In my previous article about  Connect to SQL Server in Spark (PySpark) , I mentioned the ways t...

open_in_new View open_in_new Spark + PySpark

Spark Read from SQL Server Source using Windows/Kerberos Authentication

local_offer pyspark local_offer SQL Server local_offer spark-2-x

visibility 337
thumb_up 0
access_time 5 months ago

In this article, I am going to show you how to use JDBC Kerberos authentication to connect to SQL Server sources in Spark (PySpark). I will use  Kerberos connection with principal names and password directly that requires  ...

open_in_new View open_in_new Spark + PySpark

Schema Merging (Evolution) with Parquet in Spark and Hive

local_offer parquet local_offer pyspark local_offer spark-2-x local_offer hive local_offer hdfs

visibility 1438
thumb_up 0
access_time 4 months ago

Schema evolution is supported by many frameworks or data serialization systems such as Avro, Orc, Protocol Buffer and Parquet. With schema evolution, one set of data can be stored in multiple files with different but compatible schema. In Spark, Parquet data source can detect and merge sch...

open_in_new View open_in_new Spark + PySpark

info About author

Dark theme mode

Dark theme mode is available on Kontext.

Learn more arrow_forward

Kontext Column

Created for everyone to publish data, programming and cloud related articles. Follow three steps to create your columns.


Learn more arrow_forward