About Configuration spark.sql.optimizer.metadataOnly

Recently, one of my colleague asked me one question about Spark: for the same SQL statement on finding max value of partition column, different values are returned in Spark SQL and Hive/Impala SQL.

The SQL statement looks like the following:

SELECT MAX(PART_COL) FROM HiveDb.TestSQL;

PART_COL is the partition column of external table HiveDb.TestSQL. The table is stored as Parquet format in HDFS.

Look into partitions

Run the following statement to show all the partitions in Hive metadata:

show partitions HiveDb.TestSQL;

The result shows the last partition was added into the metadata but the actual HDFS path doesn't exist. When running the SQL statements in Hive, the MapReduce job actually scans the partitions in HDFS. Since the last partition HDFS path doesn't exist, Hive engine returns the second last partition in the result.

Spark configuration

Spark SQL engine will try to optimize query plans.

For this scenario, the following configuration item impacts:

spark.sql.optimizer.metadataOnly: When true, enable the metadata-only query optimization that use the table's metadata to produce the partition columns instead of table scans. It applies when all the columns scanned are partition columns and the query has an aggregate operator that satisfies distinct semantics.

The default value is true, which means Spark will directly read from the Hive metadata for the above case.

To fix the issue, we can simply set the above configuration as false when establishing Spark session.

from pyspark.sql import SparkSession

appName = "PySpark Hive Example"
master = "local"

# Create Spark session with Hive supported.
spark = SparkSession.builder \
    .appName(appName) \
    .master(master) \
    .config('spark.sql.optimizer.metadataOnly','false') \
    .enableHiveSupport() \
    .getOrCreate()

MSCK REPAIR TABLE

Command MSCK REPAIR TABLE can be used to fix partitions in Hive metadata however it only adds the missing ones but will not remove the ones added in metadata but not existing in HDFS.

Example:

MSCK REPAIR TABLE HiveDb.TestSQL;

References

Spark SQL and DataFrame

Look into partitions

Spark configuration

MSCK REPAIR TABLE

References

In this article