Apache Hive manages two different types of tables. Internal table and External table. The internal table is called the Manage table as well and for External tables, Hive assumes that it does not manage the data. Managed and External tables can be identified using the DESCRIBE FORMATTED table_name command, which will display either Manage table or External table depending on table type.
Let us see each table type in detail.
1. Internal Table
When a user creates a table in Hive without specifying the “external” keyword in the create table statement then by default table is created as an Internal table. It is saved in the hive.metastore.warehouse.dir path property and the folder would be /user/hive/warehouse/databasename.db/tablename/. If a user drops an internal table then the data and metadata both are dropped.
The Internal table should be used in the following conditions.
- We should create an Internal table when generating temporary tables.
- We should create an Internal table when Hive should manage the lifecycle of the table.
Let us create an Internal Table and then perform LOAD, DESCRIBE, and DROP operations.
Create an Internal Table Statement:
We will create a table with the name “internaltable”.
Table Load Statement:
We will load data in “internaltable”.
Table Describe Statement:
When we describe the table with the FORMATTED option we can see it's default location would be “hdfs://localhost:9000/user/hive/warehouse/internaltable”.
Command Output:
Drop Internal Table:
We can drop Internal tables using the Drop table command. Once we drop the table, it’s data and metadata also gets deleted.
2. External Table
An External table is managed by another process and not by Apache Hive. For external tables, the data is stored on external sources such as remote Hadoop file system(HDFS) or cloud storage such as Azure Storage Volumes. If a user drops the external table then the data remains but the metadata entry is dropped.
The following are the conditions in which the External table is used.
- We should create an External table when data is not owned by HIVE.
- We should create an External table when we don’t want to drop data even after the DROP table.
- We should create an External table as an independent table and not based on some other table.
- We should create an External table when we want to access data outside of Hive.
Let us create an External Table and then perform LOAD, DESCRIBE, and DROP operations.
Create External Table Statement:
We will create a table with the name “externaltable”.
Table Load Statement:
We will load data in “externaltable”.
Table Describe Statement:
When we describe the table with the FORMATTED option we can see its location is not defaulted and moved to “hdfs://localhost:9000/hive/warehouse” per.
Drop External Table:
When we describe the table with the FORMATTED option we can see its location is not defaulted and moved to “hdfs://localhost:9000/hive/warehouse”. After dropping the External table its metadata only gets deleted but data remains the same.
Difference Between Internal and External Tables
The following is the difference between Apache Hive Internal and External tables.
Internal Table | External Table | In the Internal table, data is moved to the “/user/hive/warehouse” directory. | In the External table, data does not move to the “/user/hive/warehouse” directory. | If we drop the Internal table then its data along with metadata also gets deleted. | If we drop the External table then its metadata only gets deleted but the data remains the same. | Internal table supports TRUNCATE operation. | The external table does not support the TRUNCATE operation. | Internal table supports ACID transactions. | The external table does not support ACID transactions. | Query Results Caching only works for Internal tables. | Query Results Caching does not work for Internal tables. |
---|