PySpark - Read and Write Orc Files

visibility 907 event 2022-06-18 access_time 4 months ago language English
more_vert

Apache Orc is a data serialization format that is considered as the smallest, fastest columnar storage for Hadoop workload. It also supports ACID, built-in indexes, native zstd compression, bloom filter and columnar encryption. This article provides some examples of reading and writing data with Orc format in Spark. Spark supports two Orc implementations: native and hive. The latter is used to work with Hive and to use Hive SerDe.

Environment

The sample code snippets in this article runs in Spark 3.2.1 in WSL 2 Ubuntu distro. You can follow this page Install Spark 3.2.1 on Linux or WSL to setup a Spark environment.

Save DataFrame as Orc format

The following code snippet creates a DataFrame in memory and then save it as Orc format. The data is stored in local file system instead of HDFS. 

#orc-example.py
from pyspark.sql import SparkSession

appName = "PySpark Example - Read and Write Orc"
master = "local"

# Create Spark session
spark = SparkSession.builder \
    .appName(appName) \
    .master(master) \
    .getOrCreate()

# List
data = [{
    'col1': 'Category A',
    'col2': 100
}, {
    'col1': 'Category B',
    'col2': 200
}, {
    'col1': 'Category C',
    'col2': 300
}]

df = spark.createDataFrame(data)
df.show()

# Save as Orc
df.write.format('orc').mode('overwrite').save(
    'file:///home/kontext/pyspark-examples/data/orc-test')

Run the script

We can then run the script using spark-submit command. Orc package is built in Spark thus there is no need to install the package like Avro format:

spark-submit orc-example.py 

Once the script is executed successfully, the script will create data in the local file system as the screenshot shows:

2022061815957-image.png

About *.orc.crc file

*.orc.crc file is the checksum file which can be used to validate if the data file has been modified after it is generated. It is a method to protect data.

Load Orc files

Now we can also read the data using Orc data deserializer. This can be done by adding the following lines to the previous one:

# Read Orc
df2 = spark.read.format('orc').load(
    'file:///home/kontext/pyspark-examples/data/orc-test')
df2.show()

Run the script using the same command line:

spark-submit orc-example.py 

Output:

+----------+----+
|      col1|col2|
+----------+----+
|Category A| 100|
|Category B| 200|
|Category C| 300|
+----------+----+

Decide Orc implementation

We can use Spark configuration spark.sql.orc.impl to specify the implementation. By default, Spark utilizes the native implementation instead of Hive implementation. 

# Create Spark session
spark = SparkSession.builder \
    .appName(appName) \
    .master(master) \
    .config('spark.sql.orc.impl', 'native') \
    .getOrCreate()

References

Apache Avro Data Source Guide

info Last modified by Kontext 4 months ago copyright This page is subject to Site terms.
Like this article?
Share on

Please log in or register to comment.

account_circle Log in person_add Register

Log in with external accounts