Differences between Hive External and Internal (Managed) Tables

access_time 9 months ago visibility278 comment 0

In Hive, there are two types of tables can be created - internal and external table. Internal tables are also called managed tables. Different features are available to different types. This article lists some of the common differences. 

Internal table

By default, Hive creates internal tables. These tables' properties and data layout will and can only be changed via Hive commands. The data for managed tables are stored under the hive.metastore.warehouse.dir path. You can also specify a location via LOCATION sub-statement when creating tables. 

CREATE [TEMPORARY] TABLE [IF NOT EXISTS] [db_name.]table_name    -- (Note: TEMPORARY available in Hive 0.14.0 and later)
  [(col_name data_type [column_constraint_specification] [COMMENT col_comment], ... [constraint_specification])]
  [COMMENT table_comment]
  [PARTITIONED BY (col_name data_type [COMMENT col_comment], ...)]
  [CLUSTERED BY (col_name, col_name, ...) [SORTED BY (col_name [ASC|DESC], ...)] INTO num_buckets BUCKETS]
  [SKEWED BY (col_name, col_name, ...)                  -- (Note: Available in Hive 0.10.0 and later)]
     ON ((col_value, col_value, ...), (col_value, col_value, ...), ...)
     [STORED AS DIRECTORIES]
  [
   [ROW FORMAT row_format] 
   [STORED AS file_format]
     | STORED BY 'storage.handler.class.name' [WITH SERDEPROPERTIES (...)]  -- (Note: Available in Hive 0.6.0 and later)
  ]
  [LOCATION hdfs_path]
  [TBLPROPERTIES (property_name=property_value, ...)]   -- (Note: Available in Hive 0.6.0 and later)
  [AS select_statement];   -- (Note: Available in Hive 0.5.0 and later; not supported for external tables)
 
CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name
  LIKE existing_table_or_view_name
  [LOCATION hdfs_path];

For complete syntax, please refer to Hive DDL Language Manual.

For example, the following statement creates a Hive internal table with storage format as Parquet. The data is stored in HDFS path: /data/internaltable.

CREATE TABLE IF NOT EXISTS `internal-table`(
`id` int,
`name` string)
STORED AS PARQUET
LOCATION '/data/internaltable'

External table

For creating external tables, we need to specify EXTERNAL in CREATE TABLE statement. An external table only describes the metadata/schema on external files. MSCK REPAIR TABLE statement can be used to refresh table metadata information when the structure of partitions of an external table has changed.

The following code snippet creates a Hive external table with data stored in /data/externaltable.

CREATE EXTERNAL TABLE IF NOT EXISTS `external-table`(
`id` int,
`name` string)
STORED AS PARQUET
LOCATION '/data/externaltable'

The external sources can be HDFS (hdfs://), Azure Storage (wasb:///), Google Cloud Storage (gs://), AWS S3 (s3://), etc. The following cloud storage examples can be run in each cloud's managed Hadoop/Hive instances. When file protocol is not specified, HDFS will be used by default. 

Azure Storage example

CREATE EXTERNAL TABLE IF NOT EXISTS `external-table`(
`id` int,
`name` string)
STORED AS PARQUET
LOCATION 'wasb:///data/externaltable'

Google Cloud Storage example

CREATE EXTERNAL TABLE IF NOT EXISTS `external-table`(
`id` int,
`name` string)
STORED AS PARQUET
LOCATION 'gs://myproject-warehouse/data/externaltable'

AWS S3 example

CREATE EXTERNAL TABLE IF NOT EXISTS `external-table`(
`id` int,
`name` string)
STORED AS PARQUET
LOCATION 's3://data/externaltable/'

Identify internal and external tables

Internal and external tables can be identified via the following SQL statements:

DESCRIBE FORMATTEED table_name;
SHOW TABLE table_name;

Each of the statement will show you whether the table is created as external table.

Differences between internal and external tables

The following table lists incomplete items of the differences:

FeatureInternal TableExternal Table
Statements: 
ARCHIVE
UNARCHIVE
TRUNCATE
MERGE
CONCATENATE

Yes
DROP deletes data from storageYes
DROP deletes only metadata
Yes
ACID/transnational actions support Yes
Query Result CachingYes
RELY constraints (cost-based-optimization)
Yes
Temporary tableYes
For scenarios that data is also used outside of Hive
Yes
CREATE TABLE AS SELECTYes
copyright This page is subject to Site terms.
Like this article?
Share on

Please log in or register to comment.

account_circle Log in person_add Register

Log in with external accounts

Want to publish your article on Kontext?

Learn more

Kontext Column

Created for everyone to publish data, programming and cloud related articles.
Follow three steps to create your columns.


Learn more arrow_forward

More from Kontext

local_offer python local_offer spark local_offer pyspark local_offer hive local_offer spark-database-connect

visibility 24413
thumb_up 4
access_time 2 years ago

From Spark 2.0, you can easily read data from Hive data warehouse and also write/append new data to Hive tables. This page shows how to operate with Hive in Spark including: Create DataFrame from existing Hive table Save DataFrame to a new Hive table Append data to the existing Hive table via ...

Fix for Hadoop 3.2.1 namenode format issue on Windows 10

local_offer windows10 local_offer hadoop local_offer hdfs

visibility 1897
thumb_up 0
access_time 9 months ago

When installing Hadoop 3.2.1 on Windows 10,  you may encounter the following error when trying to format HDFS  namnode: ERROR namenode.NameNode: Failed to start namenode. The error happens when running the following command in Command Prompt: hdfs namenode -format 2020-01-18 ...

local_offer hive local_offer SQL local_offer hive-sql-ddl

visibility 46
thumb_up 0
access_time 3 months ago

This page shows how to create partitioned Hive tables via Hive SQL (HQL). Example: CREATE TABLE IF NOT EXISTS hql.transactions(txn_id BIGINT, cust_id INT, amount DECIMAL(20,2),txn_type STRING, created_date DATE) COMMENT 'A table to store transactions' PARTITIONED BY (txn_date DATE) STORED AS ...

About column

Articles about Apache Hadoop installation, performance tuning and general tutorials.

*The yellow elephant logo is a registered trademark of Apache Hadoop.

rss_feed Subscribe RSS