Write and read parquet files in Python / Spark

access_time 2 years ago visibility5210 comment 0

Parquet is columnar store format published by Apache. It's commonly used in Hadoop ecosystem. There are many programming language APIs that have been implemented to support writing and reading parquet files. 

You can also use PySpark to read or write parquet files.

Code snippet

from pyspark.sql import SparkSession

appName = "Scala Parquet Example"
master = "local"

spark = SparkSession.builder.appName(appName).master(master).getOrCreate()

df = spark.read.format("csv").option("header", "true").load("Sales.csv")

df.write.parquet("Sales.parquet")

df2 = spark.read.parquet("Sales.parquet")
df2.show()
info Last modified by Administrator 6 months ago copyright This page is subject to Site terms.
Like this article?
Share on

Please log in or register to comment.

account_circle Log in person_add Register

Log in with external accounts

More from Kontext

visibility 1397
thumb_up 0
access_time 6 months ago

This article shows how to 'delete' column from Spark data frame using Python.  Follow article  Convert Python Dictionary List to PySpark DataFrame to construct a dataframe. +----------+---+------+ | Category| ID| Value| +----------+---+------+ |Category A| 1| 12.40| |Category B| ...

visibility 4691
thumb_up 0
access_time 2 years ago

In Spark, SparkContext.parallelize function can be used to convert list of objects to RDD and then RDD can be converted to DataFrame object through SparkSession.

visibility 22
thumb_up 0
access_time 2 months ago

Column renaming is a common action when working with data frames. In this article, I will show you how to rename column names in a Spark data frame using Scala.  info This is the Scala version of article:  Change DataFrame Column Names in PySpark The following code snippet creates a ...