Differences between Hive External and Internal (Managed) Tables

access_time 12 months ago visibility427 comment 0

In Hive, there are two types of tables can be created - internal and external table. Internal tables are also called managed tables. Different features are available to different types. This article lists some of the common differences. 

Internal table

By default, Hive creates internal tables. These tables' properties and data layout will and can only be changed via Hive commands. The data for managed tables are stored under the hive.metastore.warehouse.dir path. You can also specify a location via LOCATION sub-statement when creating tables. 

CREATE [TEMPORARY] TABLE [IF NOT EXISTS] [db_name.]table_name    -- (Note: TEMPORARY available in Hive 0.14.0 and later)
  [(col_name data_type [column_constraint_specification] [COMMENT col_comment], ... [constraint_specification])]
  [COMMENT table_comment]
  [PARTITIONED BY (col_name data_type [COMMENT col_comment], ...)]
  [CLUSTERED BY (col_name, col_name, ...) [SORTED BY (col_name [ASC|DESC], ...)] INTO num_buckets BUCKETS]
  [SKEWED BY (col_name, col_name, ...)                  -- (Note: Available in Hive 0.10.0 and later)]
     ON ((col_value, col_value, ...), (col_value, col_value, ...), ...)
     [STORED AS DIRECTORIES]
  [
   [ROW FORMAT row_format] 
   [STORED AS file_format]
     | STORED BY 'storage.handler.class.name' [WITH SERDEPROPERTIES (...)]  -- (Note: Available in Hive 0.6.0 and later)
  ]
  [LOCATION hdfs_path]
  [TBLPROPERTIES (property_name=property_value, ...)]   -- (Note: Available in Hive 0.6.0 and later)
  [AS select_statement];   -- (Note: Available in Hive 0.5.0 and later; not supported for external tables)
 
CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name
  LIKE existing_table_or_view_name
  [LOCATION hdfs_path];

For complete syntax, please refer to Hive DDL Language Manual.

For example, the following statement creates a Hive internal table with storage format as Parquet. The data is stored in HDFS path: /data/internaltable.

CREATE TABLE IF NOT EXISTS `internal-table`(
`id` int,
`name` string)
STORED AS PARQUET
LOCATION '/data/internaltable'

External table

For creating external tables, we need to specify EXTERNAL in CREATE TABLE statement. An external table only describes the metadata/schema on external files. MSCK REPAIR TABLE statement can be used to refresh table metadata information when the structure of partitions of an external table has changed.

The following code snippet creates a Hive external table with data stored in /data/externaltable.

CREATE EXTERNAL TABLE IF NOT EXISTS `external-table`(
`id` int,
`name` string)
STORED AS PARQUET
LOCATION '/data/externaltable'

The external sources can be HDFS (hdfs://), Azure Storage (wasb:///), Google Cloud Storage (gs://), AWS S3 (s3://), etc. The following cloud storage examples can be run in each cloud's managed Hadoop/Hive instances. When file protocol is not specified, HDFS will be used by default. 

Azure Storage example

CREATE EXTERNAL TABLE IF NOT EXISTS `external-table`(
`id` int,
`name` string)
STORED AS PARQUET
LOCATION 'wasb:///data/externaltable'

Google Cloud Storage example

CREATE EXTERNAL TABLE IF NOT EXISTS `external-table`(
`id` int,
`name` string)
STORED AS PARQUET
LOCATION 'gs://myproject-warehouse/data/externaltable'

AWS S3 example

CREATE EXTERNAL TABLE IF NOT EXISTS `external-table`(
`id` int,
`name` string)
STORED AS PARQUET
LOCATION 's3://data/externaltable/'

Identify internal and external tables

Internal and external tables can be identified via the following SQL statements:

DESCRIBE FORMATTEED table_name;
SHOW TABLE table_name;

Each of the statement will show you whether the table is created as external table.

Differences between internal and external tables

The following table lists incomplete items of the differences:

FeatureInternal TableExternal Table
Statements: 
ARCHIVE
UNARCHIVE
TRUNCATE
MERGE
CONCATENATE

Yes
DROP deletes data from storageYes
DROP deletes only metadata
Yes
ACID/transnational actions support Yes
Query Result CachingYes
RELY constraints (cost-based-optimization)
Yes
Temporary tableYes
For scenarios that data is also used outside of Hive
Yes
CREATE TABLE AS SELECTYes
copyright This page is subject to Site terms.
Like this article?
Share on

Please log in or register to comment.

account_circle Log in person_add Register

Log in with external accounts

Follow Kontext

Get our latest updates on LinkedIn or Twitter.

Want to publish your article on Kontext?

Learn more

More from Kontext

visibility 853
thumb_up 0
access_time 2 years ago

In Oracle database, you can select from dual table if you only want to return a one row result set. In many other databases, the query engine supports select directly from constant values without specifying a table name. 

visibility 3164
thumb_up 0
access_time 2 years ago

This post shows you how to connect to HiveServer2 via Hive JDBC driver in Java. *The way to connect to HiveServer1 is very similar though the driver names are different: Version Driver Name HiveServer1 org.apache.hadoop.hive.jdbc.HiveDriver HiveServer2 ...

Install Hadoop 3.3.0 on Windows 10 Step by Step Guide
visibility 6340
thumb_up 7
access_time 6 months ago

This detailed step-by-step guide shows you how to install the latest Hadoop v3.3.0 on Windows 10. It leverages Hadoop 3.3.0 winutils tool and WSL is not required. This version was released on July 14 2020. It is the first release of Apache Hadoop 3.3 line. There are significant changes compared with Hadoop 3.2.0, such as Java 11 runtime support, protobuf upgrade to 3.7.1, scheduling of opportunistic containers, non-volatile SCM support in HDFS cache directives, etc.