Write and Read Parquet Files in HDFS through Spark/Scala

access_time 3 years ago visibility14640 comment 0

In my previous post, I demonstrated how to write and read parquet files in Spark/Scala. The parquet file destination is a local folder.

Write and Read Parquet Files in Spark/Scala

In this page, I am going to demonstrate how to write and read parquet files in HDFS.

Sample code

import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.{DataFrame, SQLContext}

object ParquetTest {
def main(args: Array[String]) = {
// Two threads local[2]
val conf: SparkConf = new SparkConf().setMaster("local[2]").setAppName("ParquetTest")
val sc: SparkContext = new SparkContext(conf)
val sqlContext: SQLContext = new SQLContext(sc)
writeParquet(sc, sqlContext)
readParquet(sqlContext)
}

def writeParquet(sc: SparkContext, sqlContext: SQLContext) = {
// Read file as RDD
val rdd = sqlContext.read.format("csv").option("header", "true").load("hdfs://0.0.0.0:19000/Sales.csv")
// Convert rdd to data frame using toDF; the following import is required to use toDF function.
val df: DataFrame = rdd.toDF()
// Write file to parquet
df.write.parquet("hdfs://0.0.0.0:19000/Sales.parquet");
}

def readParquet(sqlContext: SQLContext) = {
// read back parquet to DF
val newDataDF = sqlContext.read.parquet("hdfs://0.0.0.0:19000/Sales.parquet")
// show contents
newDataDF.show()
}
}

The output should be similar to the previous example.

View the parquet files in HDFS

The following command can be used to list the parquet files:

F:\DataAnalytics\hadoop-3.0.0\sbin>hdfs dfs -ls /
Found 4 items
-rw-r--r--   1 fahao supergroup        167 2018-02-26 14:42 /Sales.csv
drwxr-xr-x   - fahao supergroup          0 2018-03-17 15:44 /Sales.parquet
-rw-r--r--   1 fahao supergroup        167 2018-02-26 14:11 /Sales2.csv
-rw-r--r--   1 fahao supergroup          9 2018-02-19 22:18 /test.txt

You can also use the HDFS website portal to view it:

image

Navigate into the parquet folder:

image

info Last modified by Administrator at 2 months ago copyright This page is subject to Site terms.
Like this article?
Share on

Please log in or register to comment.

account_circle Log in person_add Register

Log in with external accounts

Kontext Column

Created for everyone to publish data, programming and cloud related articles.
Follow three steps to create your columns.


Learn more arrow_forward

More from Kontext

local_offer spark

visibility 2197
thumb_up 0
access_time 3 years ago

This page summarizes the steps to install Spark 2.2.1 in your Windows environment. GIT Bash Command Prompt Windows 10 Download the latest binary from the following site: https://spark.apache.org/downloads.html In my case, I am saving the file to folder: F:\DataAnalytics. Open Git ...

Install Hadoop 3.3.0 on Windows 10 Step by Step Guide

local_offer windows10 local_offer hadoop local_offer yarn local_offer hdfs local_offer big-data-on-windows-10

visibility 1858
thumb_up 0
access_time 2 months ago

This detailed step-by-step guide shows you how to install the latest Hadoop v3.3.0 on Windows 10. It leverages Hadoop 3.3.0 winutils tool and WSL is not required. This version was released on July 14 2020. It is the first release of Apache Hadoop 3.3 line. There are significant changes compared with Hadoop 3.2.0, such as Java 11 runtime support, protobuf upgrade to 3.7.1, scheduling of opportunistic containers, non-volatile SCM support in HDFS cache directives, etc.

local_offer hdfs local_offer hadoop local_offer windows10

visibility 753
thumb_up 0
access_time 7 months ago

Network Attached Storage are commonly used in many enterprises where files are stored remotely on those servers.  They typically provide access to files using network file sharing protocols such as  NFS ,  SMB , or  AFP .  In some cases, you may want to ingest these ...

About column

Apache Spark installation guides, performance tuning tips, general tutorials, etc.

*Spark logo is a registered trademark of Apache Spark.

rss_feed Subscribe RSS