Hive SQL - Virtual Columns
Hive provides a number of virtual columns that can be used to retrieve metadata information of the data, for example, the input file for the record.
Virtual columns
The following columns are supported since Hive 0.8.0:
- INPUT__FILE__NAME: The input file name for the mapper task.
- BLOCK__OFFSET__INSIDE__FILE: Current file global position. As Hadoop store files as blocks, this virtual column can be used to decide the current block's first byte's offset in the global file.
- ROW__OFFSET__INSIDE__BLOCK: Row offset inside the current block where the row locates in.
- RAW__DATA__SIZE: Raw data size. This one is not visible at the moment in Hive 3.
- ROW__ID: Row ID of the record.
- GROUPING__ID: Grouping ID of the record.
Some of the virtual columns can only be used with certain syntax.
Examples
Use INPUT__FILE__NAME
The following HQL uses INPUT__FILE__NAME:
hive> select *, INPUT__FILE__NAME from transactions; OK 101 10.01 2021-01-01 hdfs://localhost:9000/user/hive/warehouse/hivesql.db/transactions/000000_0 101 102.01 2021-01-01 hdfs://localhost:9000/user/hive/warehouse/hivesql.db/transactions/000000_0 102 93.00 2021-01-01 hdfs://localhost:9000/user/hive/warehouse/hivesql.db/transactions/000000_0 103 913.10 2021-01-02 hdfs://localhost:9000/user/hive/warehouse/hivesql.db/transactions/000000_0 102 913.10 2021-01-02 hdfs://localhost:9000/user/hive/warehouse/hivesql.db/transactions/000000_0 101 900.56 2021-01-03 hdfs://localhost:9000/user/hive/warehouse/hivesql.db/transactions/000000_0
As the result shows, all the 6 records of Hive table transactions
are stored in the same HDFS file: hdfs://localhost:9000/user/hive/warehouse/hivesql.db/transactions/000000_0.
Use BLOCK__OFFSET__INSIDE__FILE
The following is another example that uses the offset related virtual columns.
hive> select *, BLOCK__OFFSET__INSIDE__FILE from transactions; OK 101 10.01 2021-01-01 0 101 102.01 2021-01-01 21 102 93.00 2021-01-01 43 103 913.10 2021-01-02 64 102 913.10 2021-01-02 86 101 900.56 2021-01-03 108
The result includes a column that shows the offset of the record in the block.
Use GROUPPING_ID
select acct, count(*),GROUPING__ID from transactions group by acct with cube;
The above HQL uses GROUP BY WITH CUBE and then GROUPING__ID can be used.
The result looks like the following:
NULL 6 1 101 3 0 102 2 0 103 1 0
Note, there are are partitions without any records thus a NULL group with ID 1 was created.
Use ROW_OFFSET__INSIDE_BLOCK
hive> set hive.exec.rowoffset=true; hive> select *,ROW__OFFSET__INSIDE__BLOCK from transactions; OK 101 10.01 2021-01-01 0 101 102.01 2021-01-01 0 102 93.00 2021-01-01 0 103 913.10 2021-01-02 0 102 913.10 2021-01-02 0 101 900.56 2021-01-03 0
The result is always 0.
Use ROW__ID
This virtual column will only return values for Orc format. For Parquet, the result is always NULL.
hive> select *,ROW__ID from transactions; OK 101 10.01 2021-01-01 NULL 101 102.01 2021-01-01 NULL 102 93.00 2021-01-01 NULL 103 913.10 2021-01-02 NULL 102 913.10 2021-01-02 NULL 101 900.56 2021-01-03 NULL