This site uses cookies to deliver our services. By using this site, you acknowledge that you have read and understand our Cookie and Privacy policy. Your use of Kontext website is subject to this policy. Allow Cookies and Dismiss

Write and Read Parquet Files in HDFS through Spark/Scala

1430 views 0 comments last modified about 7 months ago Raymond Tang

lite-log spark hdfs scala parquet

In my previous post, I demonstrated how to write and read parquet files in Spark/Scala. The parquet file destination is a local folder.

Write and Read Parquet Files in Spark/Scala

In this page, I am going to demonstrate how to write and read parquet files in HDFS.

Sample code

import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.{DataFrame, SQLContext}

object ParquetTest {
def main(args: Array[String]) = {
// Two threads local[2]
val conf: SparkConf = new SparkConf().setMaster("local[2]").setAppName("ParquetTest")
val sc: SparkContext = new SparkContext(conf)
val sqlContext: SQLContext = new SQLContext(sc)
writeParquet(sc, sqlContext)
readParquet(sqlContext)
}

def writeParquet(sc: SparkContext, sqlContext: SQLContext) = {
// Read file as RDD
val rdd = sqlContext.read.format("csv").option("header", "true").load("hdfs://0.0.0.0:19000/Sales.csv")
// Convert rdd to data frame using toDF; the following import is required to use toDF function.
val df: DataFrame = rdd.toDF()
// Write file to parquet
df.write.parquet("hdfs://0.0.0.0:19000/Sales.parquet");
}

def readParquet(sqlContext: SQLContext) = {
// read back parquet to DF
val newDataDF = sqlContext.read.parquet("hdfs://0.0.0.0:19000/Sales.parquet")
// show contents
newDataDF.show()
}
}

The output should be similar to the previous example.

View the parquet files in HDFS

The following command can be used to list the parquet files:

F:\DataAnalytics\hadoop-3.0.0\sbin>hdfs dfs -ls /
Found 4 items
-rw-r--r--   1 fahao supergroup        167 2018-02-26 14:42 /Sales.csv
drwxr-xr-x   - fahao supergroup          0 2018-03-17 15:44 /Sales.parquet
-rw-r--r--   1 fahao supergroup        167 2018-02-26 14:11 /Sales2.csv
-rw-r--r--   1 fahao supergroup          9 2018-02-19 22:18 /test.txt

You can also use the HDFS website portal to view it:

image

Navigate into the parquet folder:

image

Related pages

Install Hadoop 3.0.0 in Windows (Single Node)

4905 views   14 comments last modified about 8 months ago

This page summarizes the steps to install Hadoop 3.0.0 in your Windows environment. Reference page: https://wiki.apache.org/hadoop/Hadoop2OnWindows ...

View detail

Write and Read Parquet Files in Spark/Scala

2031 views   2 comments last modified about 7 months ago

In this page, I’m going to demonstrate how to write and read parquet files in Spark/Scala by using Spark SQLContext class. Reference What is parquet format? Go the following project site to understand more about parquet. ...

View detail

Resolve Hadoop RemoteException - Name node is in safe mode

94 views   0 comments last modified about 5 months ago

In Safe Mode, the HDFS cluster is read-only. After completion of block replication maintenance activity, the name node leaves safe mode automatically. If you try to delete files in safe mode, the following exception may raise: org.apache.hadoop.ipc.RemoteException(org.apac...

View detail

Configure Sqoop in a Edge Node of Hadoop Cluster

433 views   0 comments last modified about 5 months ago

This page continues with the following documentation about configuring a Hadoop multi-nodes cluster via adding a new edge node to configure administration or client tools. ...

View detail

Configure YARN and MapReduce Resources in Hadoop Cluster

180 views   0 comments last modified about 5 months ago

When configuring YARN and MapReduce in Hadoop cluster, it is very important to configure the memory and virtual processors correctly. If the configurations are incorrect, the nodes may not be able to start properly and the applications may not be able to run successfully. For example...

View detail

Configure Hadoop 3.1.0 in a Multi Node Cluster

1958 views   0 comments last modified about 5 months ago

Previously, I summarized the steps to install Hadoop in a single node Windows machine. Install Hadoop 3.0.0 in Windows (Single Node) In this page, I...

View detail

Add comment

Please login first to add comments.  Log in New user?  Register

Comments (0)

No comments yet.