Hive SQL - Virtual Columns

Kontext Kontext event 2022-07-23 visibility 722
more_vert

Hive provides a number of virtual columns that can be used to retrieve metadata information of the data, for example, the input file for the record.

Virtual columns

The following columns are supported since Hive 0.8.0:

  • INPUT__FILE__NAME: The input file name for the mapper task.
  • BLOCK__OFFSET__INSIDE__FILE: Current file global position. As Hadoop store files as blocks, this virtual column can be used to decide the current block's first byte's offset in the global file.
  • ROW__OFFSET__INSIDE__BLOCK: Row offset inside the current block where the row locates in. 
  • RAW__DATA__SIZE: Raw data size. This one is not visible at the moment in Hive 3.
  • ROW__ID: Row ID of the record. 
  • GROUPING__ID: Grouping ID of the record.

Some of the virtual columns can only be used with certain syntax.

Examples

Use INPUT__FILE__NAME

The following HQL uses INPUT__FILE__NAME:

hive> select *, INPUT__FILE__NAME from transactions;
OK
101     10.01   2021-01-01      hdfs://localhost:9000/user/hive/warehouse/hivesql.db/transactions/000000_0
101     102.01  2021-01-01      hdfs://localhost:9000/user/hive/warehouse/hivesql.db/transactions/000000_0
102     93.00   2021-01-01      hdfs://localhost:9000/user/hive/warehouse/hivesql.db/transactions/000000_0
103     913.10  2021-01-02      hdfs://localhost:9000/user/hive/warehouse/hivesql.db/transactions/000000_0
102     913.10  2021-01-02      hdfs://localhost:9000/user/hive/warehouse/hivesql.db/transactions/000000_0
101     900.56  2021-01-03      hdfs://localhost:9000/user/hive/warehouse/hivesql.db/transactions/000000_0

As the result shows, all the 6 records of Hive table transactions are stored in the same HDFS file: hdfs://localhost:9000/user/hive/warehouse/hivesql.db/transactions/000000_0.

Use BLOCK__OFFSET__INSIDE__FILE

The following is another example that uses the offset related virtual columns.

hive> select *, BLOCK__OFFSET__INSIDE__FILE from transactions;
OK
101     10.01   2021-01-01      0
101     102.01  2021-01-01      21
102     93.00   2021-01-01      43
103     913.10  2021-01-02      64
102     913.10  2021-01-02      86
101     900.56  2021-01-03      108

The result includes a column that shows the offset of the record in the block. 

Use GROUPPING_ID

select acct, count(*),GROUPING__ID from transactions group by acct with cube;

The above HQL uses GROUP BY WITH CUBE and then GROUPING__ID can be used.

The result looks like the following:

NULL    6       1
101     3       0
102     2       0
103     1       0

Note, there are are partitions without any records thus a NULL group with ID 1 was created.

Use ROW_OFFSET__INSIDE_BLOCK

hive> set hive.exec.rowoffset=true;
hive> select *,ROW__OFFSET__INSIDE__BLOCK from transactions;
OK
101     10.01   2021-01-01      0
101     102.01  2021-01-01      0
102     93.00   2021-01-01      0
103     913.10  2021-01-02      0
102     913.10  2021-01-02      0
101     900.56  2021-01-03      0

The result is always 0.

Use ROW__ID

This virtual column will only return values for Orc format. For Parquet, the result is always NULL.

hive> select *,ROW__ID from transactions;
OK
101     10.01   2021-01-01      NULL
101     102.01  2021-01-01      NULL
102     93.00   2021-01-01      NULL
103     913.10  2021-01-02      NULL
102     913.10  2021-01-02      NULL
101     900.56  2021-01-03      NULL
More from Kontext
comment Comments
No comments yet.

Please log in or register to comment.

account_circle Log in person_add Register

Log in with external accounts