Differences between Hive External and Internal (Managed) Tables

Raymond Raymond event 2020-02-22 visibility 1,607
more_vert

In Hive, there are two types of tables can be created - internal and external table. Internal tables are also called managed tables. Different features are available to different types. This article lists some of the common differences. 

Internal table

By default, Hive creates internal tables. These tables' properties and data layout will and can only be changed via Hive commands. The data for managed tables are stored under the hive.metastore.warehouse.dir path. You can also specify a location via LOCATION sub-statement when creating tables. 

CREATE [TEMPORARY] TABLE [IF NOT EXISTS] [db_name.]table_name    -- (Note: TEMPORARY available in Hive 0.14.0 and later)
  [(col_name data_type [column_constraint_specification] [COMMENT col_comment], ... [constraint_specification])]
  [COMMENT table_comment]
  [PARTITIONED BY (col_name data_type [COMMENT col_comment], ...)]
  [CLUSTERED BY (col_name, col_name, ...) [SORTED BY (col_name [ASC|DESC], ...)] INTO num_buckets BUCKETS]
  [SKEWED BY (col_name, col_name, ...)                  -- (Note: Available in Hive 0.10.0 and later)]
     ON ((col_value, col_value, ...), (col_value, col_value, ...), ...)
     [STORED AS DIRECTORIES]
  [
   [ROW FORMAT row_format] 
   [STORED AS file_format]
     | STORED BY 'storage.handler.class.name' [WITH SERDEPROPERTIES (...)]  -- (Note: Available in Hive 0.6.0 and later)
  ]
  [LOCATION hdfs_path]
  [TBLPROPERTIES (property_name=property_value, ...)]   -- (Note: Available in Hive 0.6.0 and later)
  [AS select_statement];   -- (Note: Available in Hive 0.5.0 and later; not supported for external tables)
 
CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name
  LIKE existing_table_or_view_name
  [LOCATION hdfs_path];

For complete syntax, please refer to Hive DDL Language Manual.

For example, the following statement creates a Hive internal table with storage format as Parquet. The data is stored in HDFS path: /data/internaltable.

CREATE TABLE IF NOT EXISTS `internal-table`(
`id` int,
`name` string)
STORED AS PARQUET
LOCATION '/data/internaltable'

External table

For creating external tables, we need to specify EXTERNAL in CREATE TABLE statement. An external table only describes the metadata/schema on external files. MSCK REPAIR TABLE statement can be used to refresh table metadata information when the structure of partitions of an external table has changed.

The following code snippet creates a Hive external table with data stored in /data/externaltable.

CREATE EXTERNAL TABLE IF NOT EXISTS `external-table`(
`id` int,
`name` string)
STORED AS PARQUET
LOCATION '/data/externaltable'

The external sources can be HDFS (hdfs://), Azure Storage (wasb:///), Google Cloud Storage (gs://), AWS S3 (s3://), etc. The following cloud storage examples can be run in each cloud's managed Hadoop/Hive instances. When file protocol is not specified, HDFS will be used by default. 

Azure Storage example

CREATE EXTERNAL TABLE IF NOT EXISTS `external-table`(
`id` int,
`name` string)
STORED AS PARQUET
LOCATION 'wasb:///data/externaltable'

Google Cloud Storage example

CREATE EXTERNAL TABLE IF NOT EXISTS `external-table`(
`id` int,
`name` string)
STORED AS PARQUET
LOCATION 'gs://myproject-warehouse/data/externaltable'

AWS S3 example

CREATE EXTERNAL TABLE IF NOT EXISTS `external-table`(
`id` int,
`name` string)
STORED AS PARQUET
LOCATION 's3://data/externaltable/'

Identify internal and external tables

Internal and external tables can be identified via the following SQL statements:

DESCRIBE FORMATTEED table_name;
SHOW TABLE table_name;

Each of the statement will show you whether the table is created as external table.

Differences between internal and external tables

The following table lists incomplete items of the differences:

FeatureInternal TableExternal Table
Statements: 
ARCHIVE
UNARCHIVE
TRUNCATE
MERGE
CONCATENATE

Yes
DROP deletes data from storageYes
DROP deletes only metadata
Yes
ACID/transnational actions support Yes
Query Result CachingYes
RELY constraints (cost-based-optimization)
Yes
Temporary tableYes
For scenarios that data is also used outside of Hive
Yes
CREATE TABLE AS SELECTYes
More from Kontext
comment Comments
No comments yet.

Please log in or register to comment.

account_circle Log in person_add Register

Log in with external accounts