This site uses cookies to deliver our services. By using this site, you acknowledge that you have read and understand our Cookie and Privacy policy. Your use of Kontext website is subject to this policy. Allow Cookies and Dismiss

Write and Read Parquet Files in Spark/Scala

3441 views 2 comments last modified about 9 months ago Raymond Tang

spark scala parquet

In this page, I’m going to demonstrate how to write and read parquet files in Spark/Scala by using Spark SQLContext class.

Reference

What is parquet format?

Go the following project site to understand more about parquet.

https://parquet.apache.org/

Prerequisites

Spark

If you have not installed Spark, follow this page to setup:

Install Big Data Tools (Spark, Zeppelin, Hadoop) in Windows for Learning and Practice

Hadoop (Optional)

In this example, I am going to read CSV files in HDFS. You can setup your local Hadoop instance via the same above link.

Alternatively, you can change the file path to a local file.

IntelliJ IDEA

I am using IntelliJ to write the Scala script. You can also use Scala shell to test instead of using IDE. Scala SDK is also required. In my case, I am using the Scala SDK distributed as part of my Spark.

JDK

JDK is required to run Scala in JVM.

Read and Write parquet files

In this example, I am using Spark SQLContext object to read and write parquet files.

Code

import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.{DataFrame, SQLContext}

object ParquetTest {
def main(args: Array[String]) = {
// Two threads local[2]
val conf: SparkConf = new SparkConf().setMaster("local[2]").setAppName("ParquetTest")
val sc: SparkContext = new SparkContext(conf)
val sqlContext: SQLContext = new SQLContext(sc)
writeParquet(sc, sqlContext)
readParquet(sqlContext)
}

def writeParquet(sc: SparkContext, sqlContext: SQLContext) = {
// Read file as RDD
val rdd = sqlContext.read.format("csv").option("header", "true").load("hdfs://0.0.0.0:19000/Sales.csv")
// Convert rdd to data frame using toDF; the following import is required to use toDF function.
val df: DataFrame = rdd.toDF()
// Write file to parquet
df.write.parquet("Sales.parquet")
}

def readParquet(sqlContext: SQLContext) = {
// read back parquet to DF
val newDataDF = sqlContext.read.parquet("Sales.parquet")
// show contents
newDataDF.show()
}
}

Before you run the code

Make sure IntelliJ project has all the required SDKs and libraries setup. In my case

  • JDK is using 1.8 JDK installed in my C drive.
  • Scala SDK: version 2.11.8 as part of my Spark installation (spark-2.2.1-bin-hadoop2.7)
  • Jars: all libraries in my Spark jar folder (for Spark libraries used in the sample code).

image

Run the code in IntelliJ

The following is the screenshot for the output:

image

What was created?

In the example code, a local folder Sales.parquet is created:

image

Run the code in Zeppelin

You can also run the same code in Zeppelin. If you don’t have a Zeppelin instance to play with, you can follow the same link in the Prerequisites section to setup.

Related pages

Install Big Data Tools (Spark, Zeppelin, Hadoop) in Windows for Learning and Practice

743 views   2 comments last modified about 7 months ago

Are you a Windows/.NET developer and willing to learn big data concepts and tools in your Windows? If yes, you can follow the links below to install them in your PC. The installations are usually easier to do in Linux/UNIX but they are not difficult to implement in Windows either since the...

View detail

Load Data into HDFS from SQL Server via Sqoop

578 views   0 comments last modified about 7 months ago

This page shows how to import data from SQL Server into Hadoop via Apache Sqoop. Prerequisites Please follow the link below to install Sqoop in your machine if you don’t have one environment ready. ...

View detail

Write and Read Parquet Files in HDFS through Spark/Scala

2144 views   0 comments last modified about 9 months ago

In my previous post, I demonstrated how to write and read parquet files in Spark/Scala. The parquet file destination is a local folder. Write and Read Parquet Files in Spark/Scala In this page...

View detail

Convert String to Date in Spark (Scala)

2028 views   0 comments last modified about 9 months ago

Context This pages demonstrates how to convert string to java.util.Date in Spark via Scala. Prerequisites If you have not installed Spark, follow the page below to install it: ...

View detail

Read Text File from Hadoop in Zeppelin through Spark Context

1553 views   0 comments last modified about 9 months ago

Background This page provides an example to load text file from HDFS through SparkContext in Zeppelin (sc). Reference The details about this method can be found at: SparkContext.textFile ...

View detail

Install Spark 2.2.1 in Windows

303 views   0 comments last modified about 9 months ago

This page summarizes the steps to install Spark 2.2.1 in your Windows environment. Tools and Environment GIT Bash Command Prompt Windows 10 Download Binary Package Download the latest binary from the following site: ...

View detail

Add comment

Please login first to add comments.  Log in New user?  Register

Comments (2)

RT Re: Write and Read Parquet Files in Spark/Scala

Raym*** about 7 months ago

@Ansh

Yes, you can.

For example, the following code is used to read parquet files from a Hadoop cluster.

def readParquet(sqlContext: SQLContext) = {
// read back parquet to DF
val newDataDF = sqlContext.read.parquet("hdfs://hdp-master:19000/user/hadoop/sqoop_test/blogs")
// show contents
newDataDF.show()
}

The cluster was setup by following this post:

Configure Hadoop 3.1.0 in a Multi Node Cluster

Of source the hdp-master:19000 needs to be accessible from the server that running the Spark/Scala code.

At the moment, my HDFS is set as readable for all servers/users in the LAN. In a production environment, you may need to manage the permissions too.

Furthermore, you can also run Spark apps in a Spark Cluster instead of in stand-alone or local machine.  I will cover more about this in my future post.

An*** about 7 months ago

Can we connect and read remotely located HDFS Parquet file? by using above code

A Re: Write and Read Parquet Files in Spark/Scala

An*** about 7 months ago

Can we connect and read remotely located HDFS Parquet file? by using above code