Write and read parquet files in Scala / Spark

access_time 2 years ago visibility277 comment 0

Parquet is columnar store format published by Apache. It's commonly used in Hadoop ecosystem. There are many programming language APIs that have been implemented to support writing and reading parquet files. 

You can easily use Spark to read or write Parquet files. 

Code snippet

import org.apache.spark.sql.SparkSession

val appName = "Scala Parquet Example"
val master = "local"

/*Create Spark session with Hive supported.*/
val spark = SparkSession.builder.appName(appName).master(master).getOrCreate()
val df = spark.read.format("csv").option("header", "true").load("Sales.csv")
/*Write parquet file*/
df.write.parquet("Sales.parquet")
val df2 = spark.read.parquet("Sales.parquet")
df2.show()
info Last modified by Raymond at 2 years ago copyright This page is subject to Site terms.
Like this article?
Share on

Please log in or register to comment.

account_circle Log in person_add Register

Log in with external accounts

Want to publish your article on Kontext?

Learn more

Kontext Column

Created for everyone to publish data, programming and cloud related articles.
Follow three steps to create your columns.


Learn more arrow_forward

More from Kontext

local_offer python local_offer spark local_offer pyspark local_offer spark-advanced

visibility 8589
thumb_up 0
access_time 2 years ago

For SQL developers that are familiar with SCD and merge statements, you may wonder how to implement the same in big data platforms, considering database or storages in Hadoop are not designed/optimised for record level updates and inserts. In this post, I’m going to demonstrate how to implement ...

local_offer pyspark local_offer spark-2-x local_offer spark local_offer spark-file-operations

visibility 4489
thumb_up 0
access_time 10 months ago

This article shows you how to read and write XML files in Spark. Create a sample XML file named test.xml with the following content: <?xml version="1.0"?> <data> <record id="1"> <rid>1</rid> <name>Record 1</name> ...

Install Apache Spark 3.0.0 on Windows 10

local_offer spark local_offer pyspark local_offer windows10 local_offer big-data-on-windows-10

visibility 327
thumb_up 1
access_time 2 months ago

Spark 3.0.0 was release on 18th June 2020 with many new features. The highlights of features include adaptive query execution, dynamic partition pruning, ANSI SQL compliance, significant improvements in pandas APIs, new UI for structured streaming, up to 40x speedups for calling R user-defined ...

About column

Code snippets for various programming languages/frameworks.

rss_feed Subscribe RSS