PySpark - Read and Write Orc Files

visibility 24 access_time 11 days ago languageEnglish timeline Stats
timeline Stats
Page index 2.18
more_horiz

Apache Orc is a data serialization format that is considered as the smallest, fastest columnar storage for Hadoop workload. It also supports ACID, built-in indexes, native zstd compression, bloom filter and columnar encryption. This article provides some examples of reading and writing data with Orc format in Spark. Spark supports two Orc implementations: native and hive. The latter is used to work with Hive and to use Hive SerDe.

Environment

The sample code snippets in this article runs in Spark 3.2.1 in WSL 2 Ubuntu distro. You can follow this page Install Spark 3.2.1 on Linux or WSL to setup a Spark environment.

Save DataFrame as Orc format

The following code snippet creates a DataFrame in memory and then save it as Orc format. The data is stored in local file system instead of HDFS. 

#orc-example.py
from pyspark.sql import SparkSession

appName = "PySpark Example - Read and Write Orc"
master = "local"

# Create Spark session
spark = SparkSession.builder \
    .appName(appName) \
    .master(master) \
    .getOrCreate()

# List
data = [{
    'col1': 'Category A',
    'col2': 100
}, {
    'col1': 'Category B',
    'col2': 200
}, {
    'col1': 'Category C',
    'col2': 300
}]

df = spark.createDataFrame(data)
df.show()

# Save as Orc
df.write.format('orc').mode('overwrite').save(
    'file:///home/kontext/pyspark-examples/data/orc-test')

Run the script

We can then run the script using spark-submit command. Orc package is built in Spark thus there is no need to install the package like Avro format:

spark-submit orc-example.py 

Once the script is executed successfully, the script will create data in the local file system as the screenshot shows:


About *.orc.crc file

*.orc.crc file is the checksum file which can be used to validate if the data file has been modified after it is generated. It is a method to protect data.

Load Orc files

Now we can also read the data using Orc data deserializer. This can be done by adding the following lines to the previous one:

# Read Orc
df2 = spark.read.format('orc').load(
    'file:///home/kontext/pyspark-examples/data/orc-test')
df2.show()

Run the script using the same command line:

spark-submit orc-example.py 

Output:

+----------+----+
|      col1|col2|
+----------+----+
|Category A| 100|
|Category B| 200|
|Category C| 300|
+----------+----+

Decide Orc implementation

We can use Spark configuration spark.sql.orc.impl to specify the implementation. By default, Spark utilizes the native implementation instead of Hive implementation. 

# Create Spark session
spark = SparkSession.builder \
    .appName(appName) \
    .master(master) \
    .config('spark.sql.orc.impl', 'native') \
    .getOrCreate()

References

Apache Avro Data Source Guide

copyright This page is subject to Site terms.
Like this article?
Share on

Please log in or register to comment.

account_circle Log in person_add Register

Log in with external accounts

More from Kontext
[Diagram] Spark Partition Discovery image
visibility 68
thumb_up 0
access_time 7 months ago
Spark Partition Discovery
Convert string to date in Scala / Spark
visibility 9,549
thumb_up 0
access_time 4 years ago
Scala: Change Column Type in Spark Data Frame
visibility 2,542
thumb_up 0
access_time 2 years ago