About Configuration spark.sql.optimizer.metadataOnly

access_time 27 days ago visibility49 comment 0

Recently, one of my colleague asked me one question about Spark: for the same SQL statement on finding max value of partition column, different values are returned in Spark SQL and Hive/Impala SQL.

The SQL statement looks like the following:


PART_COL is the partition column of external table HiveDb.TestSQL. The table is stored as Parquet format in HDFS.

Look into partitions 

Run the following statement to show all the partitions in Hive metadata:

show partitions HiveDb.TestSQL;

The result shows the last partition was added into the metadata but the actual HDFS path doesn't exist. When running the SQL statements in Hive, the MapReduce job actually scans the partitions in HDFS. Since the last partition HDFS path doesn't exist, Hive engine returns the second last partition in the result.

Spark configuration 

Spark SQL engine will try to optimize query plans.

For this scenario, the following configuration item impacts:

  • spark.sql.optimizer.metadataOnly: When true, enable the metadata-only query optimization that use the table's metadata to produce the partition columns instead of table scans. It applies when all the columns scanned are partition columns and the query has an aggregate operator that satisfies distinct semantics.

The default value is true, which means Spark will directly read from the Hive metadata for the above case. 

To fix the issue, we can simply set the above configuration as false when establishing Spark session.

from pyspark.sql import SparkSession

appName = "PySpark Hive Example"
master = "local"

# Create Spark session with Hive supported.
spark = SparkSession.builder \
    .appName(appName) \
    .master(master) \
    .config('spark.sql.optimizer.metadataOnly','false') \
    .enableHiveSupport() \


Command MSCK REPAIR TABLE can be used to fix partitions in Hive metadata however it only adds the missing ones but will not remove the ones added in metadata but not existing in HDFS.




Spark SQL and DataFrame

copyright This page is subject to Site terms.
Like this article?
Share on

Please log in or register to comment.

account_circle Log in person_add Register

Log in with external accounts

Want to publish your article on Kontext?

Learn more

Kontext Column

Created for everyone to publish data, programming and cloud related articles.
Follow three steps to create your columns.

Learn more arrow_forward

More from Kontext

Spark Structured Streaming - Read from and Write into Kafka Topics

local_offer spark local_offer kafka

visibility 398
thumb_up 0
access_time 2 months ago

Spark structured streaming provides rich APIs to read from and write to Kafka topics. When reading from Kafka, Kafka sources can be created for both streaming and batch queries. When writing into Kafka, Kafka sinks can be created as destination for both streaming and batch queries too.  ...

local_offer python local_offer spark local_offer hadoop local_offer pyspark

visibility 1513
thumb_up 0
access_time 2 years ago

In one of my previous articles about Password Security Solution for Sqoop , I mentioned creating credential using hadoop credential command. The credentials are stored in JavaKeyStoreProvider. Credential providers are used to separate the use of sensitive tokens, secrets and passwords from the ...

local_offer spark local_offer pyspark local_offer partitioning local_offer spark-advanced

visibility 7061
thumb_up 3
access_time 2 years ago

In my previous post about Data Partitioning in Spark (PySpark) In-depth Walkthrough , I mentioned how to repartition data frames in Spark using repartition or coalesce functions. In this post, I am going to explain how Spark partition data using partitioning functions. Partitioner class is ...

About column

Apache Spark installation guides, performance tuning tips, general tutorials, etc.

*Spark logo is a registered trademark of Apache Spark.

rss_feed Subscribe RSS