Write and Read Parquet Files in Spark/Scala

access_time 3 years ago visibility23372 comment 2

In this page, I’m going to demonstrate how to write and read parquet files in Spark/Scala by using Spark SQLContext class.


What is parquet format?

Go the following project site to understand more about parquet.




If you have not installed Spark, follow this page to setup:

Install Big Data Tools (Spark, Zeppelin, Hadoop) in Windows for Learning and Practice

Hadoop (Optional)

In this example, I am going to read CSV files in HDFS. You can setup your local Hadoop instance via the same above link.

Alternatively, you can change the file path to a local file.

IntelliJ IDEA

I am using IntelliJ to write the Scala script. You can also use Scala shell to test instead of using IDE. Scala SDK is also required. In my case, I am using the Scala SDK distributed as part of my Spark.


JDK is required to run Scala in JVM.

Read and Write parquet files

In this example, I am using Spark SQLContext object to read and write parquet files.


import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.{DataFrame, SQLContext}
object ParquetTest {
def main(args: Array[String]) = {
// Two threads local[2]
val conf: SparkConf = new SparkConf().setMaster("local[2]").setAppName("ParquetTest")
val sc: SparkContext = new SparkContext(conf)
val sqlContext: SQLContext = new SQLContext(sc)
writeParquet(sc, sqlContext)
def writeParquet(sc: SparkContext, sqlContext: SQLContext) = {
// Read file as RDD
val rdd = sqlContext.read.format("csv").option("header", "true").load("hdfs://")
// Convert rdd to data frame using toDF; the following import is required to use toDF function.
val df: DataFrame = rdd.toDF()
// Write file to parquet
def readParquet(sqlContext: SQLContext) = {
// read back parquet to DF
val newDataDF = sqlContext.read.parquet("Sales.parquet")
// show contents

Before you run the code

Make sure IntelliJ project has all the required SDKs and libraries setup. In my case

  • JDK is using 1.8 JDK installed in my C drive.
  • Scala SDK: version 2.11.8 as part of my Spark installation (spark-2.2.1-bin-hadoop2.7)
  • Jars: all libraries in my Spark jar folder (for Spark libraries used in the sample code).


Run the code in IntelliJ

The following is the screenshot for the output:


What was created?

In the example code, a local folder Sales.parquet is created:


Run the code in Zeppelin

You can also run the same code in Zeppelin. If you don’t have a Zeppelin instance to play with, you can follow the same link in the Prerequisites section to setup.

info Last modified by Administrator 6 months ago copyright This page is subject to Site terms.
Like this article?
Share on

Please log in or register to comment.

account_circle Log in person_add Register

Log in with external accounts

More from Kontext

visibility 1426
thumb_up 0
access_time 2 years ago

This code snippet shows how to convert string to date.

visibility 1360
thumb_up 0
access_time 2 years ago

This page summarizes the installation guides about big data tools on Windows through Windows Subsystem for Linux (WSL). Install Hadoop 3.2.0 on Windows 10 using Windows Subsystem for Linux (WSL) A framework that allows for distributed processing of the large data sets ...

visibility 1920
thumb_up 0
access_time 2 years ago

I’ve been following Mobius project for a while and have been waiting for this day. .NET for Apache Spark v0.1.0 was just published on 2019-04-25 on GitHub. It provides high performance APIs for programming Apache Spark applications with C# and F#. It is .NET Standard complaint and can run in ...