About Configuration spark.sql.optimizer.metadataOnly

visibility 514 comment 0 access_time 2y languageEnglish

Recently, one of my colleague asked me one question about Spark: for the same SQL statement on finding max value of partition column, different values are returned in Spark SQL and Hive/Impala SQL.

The SQL statement looks like the following:


PART_COL is the partition column of external table HiveDb.TestSQL. The table is stored as Parquet format in HDFS.

Look into partitions 

Run the following statement to show all the partitions in Hive metadata:

show partitions HiveDb.TestSQL;

The result shows the last partition was added into the metadata but the actual HDFS path doesn't exist. When running the SQL statements in Hive, the MapReduce job actually scans the partitions in HDFS. Since the last partition HDFS path doesn't exist, Hive engine returns the second last partition in the result.

Spark configuration 

Spark SQL engine will try to optimize query plans.

For this scenario, the following configuration item impacts:

  • spark.sql.optimizer.metadataOnly: When true, enable the metadata-only query optimization that use the table's metadata to produce the partition columns instead of table scans. It applies when all the columns scanned are partition columns and the query has an aggregate operator that satisfies distinct semantics.

The default value is true, which means Spark will directly read from the Hive metadata for the above case. 

To fix the issue, we can simply set the above configuration as false when establishing Spark session.

from pyspark.sql import SparkSession

appName = "PySpark Hive Example"
master = "local"

# Create Spark session with Hive supported.
spark = SparkSession.builder \
    .appName(appName) \
    .master(master) \
    .config('spark.sql.optimizer.metadataOnly','false') \
    .enableHiveSupport() \


Command MSCK REPAIR TABLE can be used to fix partitions in Hive metadata however it only adds the missing ones but will not remove the ones added in metadata but not existing in HDFS.




Spark SQL and DataFrame

copyright This page is subject to Site terms.

Please log in or register to comment.

account_circle Log in person_add Register

Log in with external accounts

More from Kontext
visibility 10,094
thumb_up 0
access_time 3y
visibility 2,916
thumb_up 0
access_time 4y
visibility 1,003
thumb_up 0
access_time 12m