PySpark - Read and Write Orc Files
Apache Orc is a data serialization format that is considered as the smallest, fastest columnar storage for Hadoop workload. It also supports ACID, built-in indexes, native zstd compression, bloom filter and columnar encryption. This article provides some examples of reading and writing data with Orc format in Spark. Spark supports two Orc implementations: native and hive. The latter is used to work with Hive and to use Hive SerDe.
Environment
The sample code snippets in this article runs in Spark 3.2.1 in WSL 2 Ubuntu distro. You can follow this page Install Spark 3.2.1 on Linux or WSL to setup a Spark environment.
Save DataFrame as Orc format
The following code snippet creates a DataFrame in memory and then save it as Orc format. The data is stored in local file system instead of HDFS.
#orc-example.py from pyspark.sql import SparkSession appName = "PySpark Example - Read and Write Orc" master = "local" # Create Spark session spark = SparkSession.builder \ .appName(appName) \ .master(master) \ .getOrCreate() # List data = [{ 'col1': 'Category A', 'col2': 100 }, { 'col1': 'Category B', 'col2': 200 }, { 'col1': 'Category C', 'col2': 300 }] df = spark.createDataFrame(data) df.show() # Save as Orc df.write.format('orc').mode('overwrite').save( 'file:///home/kontext/pyspark-examples/data/orc-test')
Run the script
We can then run the script using spark-submit
command. Orc package is built in Spark thus there is no need to install the package like Avro format:
spark-submit orc-example.py
Once the script is executed successfully, the script will create data in the local file system as the screenshot shows:
About *.orc.crc file
*.orc.crc
file is the checksum file which can be used to validate if the data file has been modified after it is generated. It is a method to protect data.
Load Orc files
Now we can also read the data using Orc data deserializer. This can be done by adding the following lines to the previous one:
# Read Orc df2 = spark.read.format('orc').load( 'file:///home/kontext/pyspark-examples/data/orc-test') df2.show()
Run the script using the same command line:
spark-submit orc-example.py
Output:
+----------+----+ | col1|col2| +----------+----+ |Category A| 100| |Category B| 200| |Category C| 300| +----------+----+
Decide Orc implementation
We can use Spark configuration spark.sql.orc.impl
to specify the implementation. By default, Spark utilizes the native implementation instead of Hive implementation.
# Create Spark session
spark = SparkSession.builder \
.appName(appName) \
.master(master) \
.config('spark.sql.orc.impl', 'native') \
.getOrCreate()