Apache Hive is a broadly used data warehousing product based on Hadoop ecosystem. It provides built-in functions like xpath, xpath_short, xpath_int, xpath_long, xpath_float, xpath_double, xpath_number, xpath_string and so on to extract values from XML. All these functions are powered by underlying JDK class Java XPath library javax.xml.xpath.
Syntax of these functions
All these functions follow the following syntax:
xpath_*(xml_string, xpath_expression_string)
These functions can be directly applied to string literals or XML string columns in your Hive table. Now let's start to look into some examples. For simplicity, I will directly use string literals and you can replace them with your table columns as necessary.
XPath is a W3C standard to query XML data. This article is not about XPath itself. If you want to learn more about XPath, refer to the XPath official documentation.
Use xpath
Function xpath returns a array of strings. It is used to return a list of node text values of attribute values.
Example:
hive> select xpath('<records><rec id="1" value="A"/><rec id="2" value="B"/><rec id="3" value="C"/><rec id="4" value="D"/></records>', '//rec/@id');
OK
["1","2","3","4"]
The above query returns a list of values for attribute node id.
If we change the XPath to retrieve nodes, it will just return an empty array.
Use xpath\_string
Function xpath_string can be used to return the text of the first match node.
The following are some examples.
hive> select xpath_string('<records><rec id="1" value="A">Node A</rec><rec id="2" value="B">Node B</rec></records>', '//
rec');
OK
Node A
Return node's text that has a certain attribute:
hive> select xpath_string('<records><rec id="1" value="A">Node A</rec><rec id="2" value="B">Node B</rec></records>', '//
rec[@id=2]');
OK
Node B
Use xpath\_boolean
Function xpath_boolean returns true is the XPath expression is evaluated to true or if a matching node is found.
hive> select xpath_boolean('<records><rec id="1" value="A">Node A</rec><rec id="2" value="B">Node B</rec></records>', '//rec');
OK
true
hive> select xpath_boolean('<records><rec id="1" value="A">Node A</rec><rec id="2" value="B">Node B</rec></records>', '//rec[@id=3]');
OK
false
hive> select xpath_boolean('<records><rec id="1" value="A">Node A</rec><rec id="2" value="B">Node B</rec></records>', '//rec[@id=2]');
OK
true
Use xpath\_short, xpath\_int, xpath\_long
These three functions can be used extract integer values from an XML string. It returns 0 if no match is found.
hive> select xpath_short('<records><rec id="1" value="A">Node A</rec><rec id="2" value="B">Node B</rec></records>', '//rec[@value="A"]/@id');
OK
1
hive> select xpath_int('<records><rec id="1" value="A">Node A</rec><rec id="2" value="B">Node B</rec></records>', '//rec[@value="A"]/@id');
OK
1
hive> select xpath_long('<records><rec id="1" value="A">Node A</rec><rec id="2" value="B">Node B</rec></records>', '//rec[@value="A"]/@id');
OK
1
The above three HQL queries return the same result as the value of id attribute is very small and can fit into all three types.
Use xpath\_float, xpath\_double, xpath\_number
Similar as the above integer functions, these three functions are used to extract float numbers. Function xpath_number is the alias of xpath_double. If the extracted value is not numeric, the returned value will be NaN.
The following are some examples to use these three functions in Hive QL:
hive> select xpath_float('<records><rec id="1" value="A">Node A</rec><rec id="2" value="B">Node B</rec></records>', '//rec[@value="A"]/@id');
OK
1.0
hive> select xpath_double('<records><rec id="1" value="A">Node A</rec><rec id="2" value="B">Node B</rec></records>', '//rec[@value="A"]/@id');
OK
1.0
hive> select xpath_number('<records><rec id="1" value="A">Node A</rec><rec id="2" value="B">Node B</rec></records>', '//rec/@value');
OK
NaN
The last statement returns NaN as the value is not numeric.