Recently, one of my colleague asked me one question about Spark: for the same SQL statement on finding max value of partition column, different values are returned in Spark SQL and Hive/Impala SQL.
The SQL statement looks like the following:
SELECT MAX(PART_COL) FROM HiveDb.TestSQL;
PART_COL is the partition column of external table HiveDb.TestSQL. The table is stored as Parquet format in HDFS.
Look into partitions
Run the following statement to show all the partitions in Hive metadata:
show partitions HiveDb.TestSQL;
The result shows the last partition was added into the metadata but the actual HDFS path doesn't exist. When running the SQL statements in Hive, the MapReduce job actually scans the partitions in HDFS. Since the last partition HDFS path doesn't exist, Hive engine returns the second last partition in the result.
Spark SQL engine will try to optimize query plans.
For this scenario, the following configuration item impacts:
- spark.sql.optimizer.metadataOnly: When true, enable the metadata-only query optimization that use the table's metadata to produce the partition columns instead of table scans. It applies when all the columns scanned are partition columns and the query has an aggregate operator that satisfies distinct semantics.
The default value is true, which means Spark will directly read from the Hive metadata for the above case.
To fix the issue, we can simply set the above configuration as false when establishing Spark session.
from pyspark.sql import SparkSession appName = "PySpark Hive Example" master = "local" # Create Spark session with Hive supported. spark = SparkSession.builder \ .appName(appName) \ .master(master) \ .config('spark.sql.optimizer.metadataOnly','false') \ .enableHiveSupport() \ .getOrCreate()
MSCK REPAIR TABLE
Command MSCK REPAIR TABLE can be used to fix partitions in Hive metadata however it only adds the missing ones but will not remove the ones added in metadata but not existing in HDFS.
MSCK REPAIR TABLE HiveDb.TestSQL;