About Configuration spark.sql.optimizer.metadataOnly

access_time 4 months ago visibility153 comment 0

Recently, one of my colleague asked me one question about Spark: for the same SQL statement on finding max value of partition column, different values are returned in Spark SQL and Hive/Impala SQL.

The SQL statement looks like the following:

SELECT MAX(PART_COL) FROM HiveDb.TestSQL;

PART_COL is the partition column of external table HiveDb.TestSQL. The table is stored as Parquet format in HDFS.

Look into partitions 

Run the following statement to show all the partitions in Hive metadata:

show partitions HiveDb.TestSQL;

The result shows the last partition was added into the metadata but the actual HDFS path doesn't exist. When running the SQL statements in Hive, the MapReduce job actually scans the partitions in HDFS. Since the last partition HDFS path doesn't exist, Hive engine returns the second last partition in the result.

Spark configuration 

Spark SQL engine will try to optimize query plans.

For this scenario, the following configuration item impacts:

  • spark.sql.optimizer.metadataOnly: When true, enable the metadata-only query optimization that use the table's metadata to produce the partition columns instead of table scans. It applies when all the columns scanned are partition columns and the query has an aggregate operator that satisfies distinct semantics.

The default value is true, which means Spark will directly read from the Hive metadata for the above case. 

To fix the issue, we can simply set the above configuration as false when establishing Spark session.

from pyspark.sql import SparkSession

appName = "PySpark Hive Example"
master = "local"

# Create Spark session with Hive supported.
spark = SparkSession.builder \
    .appName(appName) \
    .master(master) \
    .config('spark.sql.optimizer.metadataOnly','false') \
    .enableHiveSupport() \
    .getOrCreate()

MSCK REPAIR TABLE

Command MSCK REPAIR TABLE can be used to fix partitions in Hive metadata however it only adds the missing ones but will not remove the ones added in metadata but not existing in HDFS.

Example:

MSCK REPAIR TABLE HiveDb.TestSQL;

References

Spark SQL and DataFrame

copyright This page is subject to Site terms.
Like this article?
Share on

Please log in or register to comment.

account_circle Log in person_add Register

Log in with external accounts

Follow Kontext

Get our latest updates on LinkedIn or Twitter.

Want to publish your article on Kontext?

Learn more

More from Kontext

visibility 29
thumb_up 0
access_time 2 months ago

About 12 months ago, I shared an article about reading and writing XML files in Spark using Python . For many companies, Scala is still preferred for better performance and also to utilize full features that Spark offers.  Thus, this article will provide examples about how to load XML file as ...

visibility 17032
thumb_up 1
access_time 3 years ago

In my previous post, I demonstrated how to write and read parquet files in Spark/Scala. The parquet file destination is a local folder. Write and Read Parquet Files in Spark/Scala In this page, I am going to demonstrate how to write and read parquet files in HDFS. import ...

visibility 3566
thumb_up 0
access_time 3 years ago

This page shows how to import data from SQL Server into Hadoop via Apache Sqoop. Please follow the link below to install Sqoop in your machine if you don’t have one environment ready. Install Apache Sqoop in Windows Use the following command in Command Prompt, you will be able to find out ...