In my previous post, I demonstrated how to write and read parquet files in Spark/Scala. The parquet file destination is a local folder.

Write and Read Parquet Files in Spark/Scala

In this page, I am going to demonstrate how to write and read parquet files in HDFS.

Sample code

import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.{DataFrame, SQLContext}

object ParquetTest {
def main(args: Array[String]) = {
// Two threads local[2]
val conf: SparkConf = new SparkConf().setMaster("local[2]").setAppName("ParquetTest")
val sc: SparkContext = new SparkContext(conf)
val sqlContext: SQLContext = new SQLContext(sc)
writeParquet(sc, sqlContext)

def writeParquet(sc: SparkContext, sqlContext: SQLContext) = {
// Read file as RDD
val rdd ="csv").option("header", "true").load("hdfs://")
// Convert rdd to data frame using toDF; the following import is required to use toDF function.
val df: DataFrame = rdd.toDF()
// Write file to parquet

def readParquet(sqlContext: SQLContext) = {
// read back parquet to DF
val newDataDF ="hdfs://")
// show contents

The output should be similar to the previous example.

View the parquet files in HDFS

The following command can be used to list the parquet files:

F:\DataAnalytics\hadoop-3.0.0\sbin>hdfs dfs -ls /
Found 4 items
-rw-r--r--   1 fahao supergroup        167 2018-02-26 14:42 /Sales.csv
drwxr-xr-x   - fahao supergroup          0 2018-03-17 15:44 /Sales.parquet
-rw-r--r--   1 fahao supergroup        167 2018-02-26 14:11 /Sales2.csv
-rw-r--r--   1 fahao supergroup          9 2018-02-19 22:18 /test.txt

You can also use the HDFS website portal to view it:


Navigate into the parquet folder:


info Last modified by Raymond at 3 years ago * This page is subject to Site terms.

More from Kontext

local_offer jupyter-notebook local_offer hdfs

visibility 258
thumb_up 0
access_time 4 months ago

Jupyter notebook service can be started in most of operating system. In the system where Hadoop clients are available, you can also easily ingest data into HDFS (Hadoop Distributed File System) using HDFS CLIs.  *Python 3 Kernel is used in the following examples. List files in H...

open_in_new Code snippets

local_offer hdfs local_offer hadoop local_offer windows

visibility 354
thumb_up 0
access_time 4 months ago

Network Attached Storage are commonly used in many enterprises where files are stored remotely on those servers.  They typically provide access to files using network file sharing protocols such as  ...

open_in_new Hadoop

local_offer hive local_offer hdfs

visibility 117
thumb_up 0
access_time 5 months ago

In Hive, there are two types of tables can be created - internal and external table. Internal tables are also called managed tables. Different features are available to different types. This article lists some of the common differences.  Internal table By default, Hive creates ...

open_in_new Hadoop

Schema Merging (Evolution) with Parquet in Spark and Hive

local_offer parquet local_offer pyspark local_offer spark-2-x local_offer hive local_offer hdfs

visibility 2188
thumb_up 0
access_time 6 months ago

Schema evolution is supported by many frameworks or data serialization systems such as Avro, Orc, Protocol Buffer and Parquet. With schema evolution, one set of data can be stored in multiple files with different but compatible schema. In Spark, Parquet data source can detect and merge sch...

open_in_new Spark + PySpark

info About author

comment Comments (0)

comment Add comment

Please log in or register to comment.

account_circle Log in person_add Register

Log in with external accounts

No comments yet.

Dark theme mode

Dark theme mode is available on Kontext.

Learn more arrow_forward

Kontext Column

Created for everyone to publish data, programming and cloud related articles. Follow three steps to create your columns.

Learn more arrow_forward