By using this site, you acknowledge that you have read and understand our Cookie and Privacy policy. Your use of Kontext website is subject to this policy. Accept

Write and Read Parquet Files in HDFS through Spark/Scala

4053 views 0 comments last modified about 2 years ago Raymond Tang

lite-log spark hdfs scala parquet

In my previous post, I demonstrated how to write and read parquet files in Spark/Scala. The parquet file destination is a local folder.

Write and Read Parquet Files in Spark/Scala

In this page, I am going to demonstrate how to write and read parquet files in HDFS.

Sample code

import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.{DataFrame, SQLContext}

object ParquetTest {
def main(args: Array[String]) = {
// Two threads local[2]
val conf: SparkConf = new SparkConf().setMaster("local[2]").setAppName("ParquetTest")
val sc: SparkContext = new SparkContext(conf)
val sqlContext: SQLContext = new SQLContext(sc)
writeParquet(sc, sqlContext)
readParquet(sqlContext)
}

def writeParquet(sc: SparkContext, sqlContext: SQLContext) = {
// Read file as RDD
val rdd = sqlContext.read.format("csv").option("header", "true").load("hdfs://0.0.0.0:19000/Sales.csv")
// Convert rdd to data frame using toDF; the following import is required to use toDF function.
val df: DataFrame = rdd.toDF()
// Write file to parquet
df.write.parquet("hdfs://0.0.0.0:19000/Sales.parquet");
}

def readParquet(sqlContext: SQLContext) = {
// read back parquet to DF
val newDataDF = sqlContext.read.parquet("hdfs://0.0.0.0:19000/Sales.parquet")
// show contents
newDataDF.show()
}
}

The output should be similar to the previous example.

View the parquet files in HDFS

The following command can be used to list the parquet files:

F:\DataAnalytics\hadoop-3.0.0\sbin>hdfs dfs -ls /
Found 4 items
-rw-r--r--   1 fahao supergroup        167 2018-02-26 14:42 /Sales.csv
drwxr-xr-x   - fahao supergroup          0 2018-03-17 15:44 /Sales.parquet
-rw-r--r--   1 fahao supergroup        167 2018-02-26 14:11 /Sales2.csv
-rw-r--r--   1 fahao supergroup          9 2018-02-19 22:18 /test.txt

You can also use the HDFS website portal to view it:

image

Navigate into the parquet folder:

image

Related pages

Debug PySpark Code in Visual Studio Code

21 views   0 comments last modified about 16 days ago

The page summarizes the steps required to run and debug PySpark (Spark for Python) in Visual Studio Code. Install Python and pip Install Python from the official website: https://...

View detail

Implement SCD Type 2 Full Merge via Spark Data Frames

307 views   0 comments last modified about 2 months ago

Overview For SQL developers that are familiar with SCD and merge statements, you may wonder how to implement the same in big data platforms, considering database or storages in Hadoop are not designed/optimised for record level updates and inserts. In this post, I’m going to demons...

View detail

Password Security Solution for Sqoop

37 views   0 comments last modified about 3 months ago

In Sqoop, there are multiple approaches to pass in passwords for RDBMS. Options Option 1 - clear password through --password argument sqoop [subcommand] --username user --password pwd This is the weakest approach as password is exposed directly...

View detail

PySpark: Convert JSON String Column to Array of Object (StructType) in Data Frame

421 views   0 comments last modified about 3 months ago

This post shows how to derive new column in a Spark data frame from a JSON array string column. I am running the code in Spark 2.2.1 though it is compatible with Spark 1.6.0 (with less JSON SQL functions). Prerequisites Refer to the following post to install Spark in Windows. ...

View detail

Install Hadoop 3.0.0 in Windows (Single Node)

12863 views   14 comments last modified about 2 years ago

This page summarizes the steps to install Hadoop 3.0.0 in your Windows environment. Reference page: https://wiki.apache.org/hadoop/Hadoop2OnWindows ...

View detail

Write and Read Parquet Files in Spark/Scala

7228 views   2 comments last modified about 2 years ago

In this page, I’m going to demonstrate how to write and read parquet files in Spark/Scala by using Spark SQLContext class. Reference What is parquet format? Go the following project site to understand more about parquet. ...

View detail

Add comment

Comments (0)

No comments yet.

Contacts

  • enquiry[at]kontext.tech

Subscribe