Kontext Kontext

Extract Values from XML Column in Hive Tables

event 2022-07-23 visibility 881 comment 0 insights toc
more_vert
insights Stats
Extract Values from XML Column in Hive Tables
Apache Hive is a broadly used data warehousing product based on Hadoop ecosystem. It provides built-in functions like xpath, xpath_short, xpath_int, xpath_long, xpath_float, xpath_double, xpath_number, xpath_string and so on to extract values from XML. All these functions are powered by underlying JDK class Java XPath library javax.xml.xpath.

Syntax of these functions

All these functions follow the following syntax:

xpath_*(xml_string, xpath_expression_string)

These functions can be directly applied to string literals or XML string columns in your Hive table. Now let's start to look into some examples. For simplicity, I will directly use string literals and you can replace them with your table columns as necessary.

XPath is a W3C standard to query XML data. This article is not about XPath itself. If you want to learn more about XPath, refer to the XPath official documentation.

Use xpath

Function xpath returns a array of strings. It is used to return a list of node text values of attribute values.

Example:

hive> select xpath('<records><rec id="1" value="A"/><rec id="2" value="B"/><rec id="3" value="C"/><rec id="4" value="D"/></records>', '//rec/@id');
OK
["1","2","3","4"]

The above query returns a list of values for attribute node id.

If we change the XPath to retrieve nodes, it will just return an empty array.

Use xpath_string

Function xpath_string can be used to return the text of the first match node.

The following are some examples.

hive> select xpath_string('<records><rec id="1" value="A">Node A</rec><rec id="2" value="B">Node B</rec></records>', '//
rec');
OK
Node A

Return node's text that has a certain attribute:

hive> select xpath_string('<records><rec id="1" value="A">Node A</rec><rec id="2" value="B">Node B</rec></records>', '//
rec[@id=2]');
OK
Node B

Use xpath_boolean

Function xpath_boolean returns true is the XPath expression is evaluated to true or if a matching node is found.

hive> select xpath_boolean('<records><rec id="1" value="A">Node A</rec><rec id="2" value="B">Node B</rec></records>', '//rec');
OK
true

hive> select xpath_boolean('<records><rec id="1" value="A">Node A</rec><rec id="2" value="B">Node B</rec></records>', '//rec[@id=3]');
OK
false

hive> select xpath_boolean('<records><rec id="1" value="A">Node A</rec><rec id="2" value="B">Node B</rec></records>', '//rec[@id=2]');
OK
true

Use xpath_short, xpath_int, xpath_long

These three functions can be used extract integer values from an XML string. It returns 0 if no match is found.

hive> select xpath_short('<records><rec id="1" value="A">Node A</rec><rec id="2" value="B">Node B</rec></records>', '//rec[@value="A"]/@id');
OK
1

hive> select xpath_int('<records><rec id="1" value="A">Node A</rec><rec id="2" value="B">Node B</rec></records>', '//rec[@value="A"]/@id');
OK
1

hive> select xpath_long('<records><rec id="1" value="A">Node A</rec><rec id="2" value="B">Node B</rec></records>', '//rec[@value="A"]/@id');
OK
1

The above three HQL queries return the same result as the value of id attribute is very small and can fit into all three types.

Use xpath_float, xpath_double, xpath_number

Similar as the above integer functions, these three functions are used to extract float numbers. Function xpath_number is the alias of xpath_double. If the extracted value is not numeric, the returned value will be NaN.

The following are some examples to use these three functions in Hive QL:

hive> select xpath_float('<records><rec id="1" value="A">Node A</rec><rec id="2" value="B">Node B</rec></records>', '//rec[@value="A"]/@id');
OK
1.0

hive> select xpath_double('<records><rec id="1" value="A">Node A</rec><rec id="2" value="B">Node B</rec></records>', '//rec[@value="A"]/@id');
OK
1.0

hive> select xpath_number('<records><rec id="1" value="A">Node A</rec><rec id="2" value="B">Node B</rec></records>', '//rec/@value');
OK
NaN

The last statement returns NaN as the value is not numeric.

More from Kontext
comment Comments
No comments yet.

Please log in or register to comment.

account_circle Log in person_add Register

Log in with external accounts