Extract Values from XML Column in Hive Tables
javax.xml.xpath
.Syntax of these functions
All these functions follow the following syntax:
xpath_*(xml_string, xpath_expression_string)
These functions can be directly applied to string literals or XML string columns in your Hive table. Now let's start to look into some examples. For simplicity, I will directly use string literals and you can replace them with your table columns as necessary.
XPath is a W3C standard to query XML data. This article is not about XPath itself. If you want to learn more about XPath, refer to the XPath official documentation.
Use xpath
Function xpath
returns a array of strings. It is used to return a list of node text values of attribute values.
Example:
hive> select xpath('<records><rec id="1" value="A"/><rec id="2" value="B"/><rec id="3" value="C"/><rec id="4" value="D"/></records>', '//rec/@id'); OK ["1","2","3","4"]
The above query returns a list of values for attribute node id
.
If we change the XPath to retrieve nodes, it will just return an empty array.
Use xpath_string
Function xpath_string
can be used to return the text of the first match node.
The following are some examples.
hive> select xpath_string('<records><rec id="1" value="A">Node A</rec><rec id="2" value="B">Node B</rec></records>', '// rec'); OK Node A
Return node's text that has a certain attribute:
hive> select xpath_string('<records><rec id="1" value="A">Node A</rec><rec id="2" value="B">Node B</rec></records>', '// rec[@id=2]'); OK Node B
Use xpath_boolean
Function xpath_boolean
returns true is the XPath expression is evaluated to true or if a matching node is found.
hive> select xpath_boolean('<records><rec id="1" value="A">Node A</rec><rec id="2" value="B">Node B</rec></records>', '//rec'); OK true hive> select xpath_boolean('<records><rec id="1" value="A">Node A</rec><rec id="2" value="B">Node B</rec></records>', '//rec[@id=3]'); OK false hive> select xpath_boolean('<records><rec id="1" value="A">Node A</rec><rec id="2" value="B">Node B</rec></records>', '//rec[@id=2]'); OK true
Use xpath_short, xpath_int, xpath_long
These three functions can be used extract integer values from an XML string. It returns 0 if no match is found.
hive> select xpath_short('<records><rec id="1" value="A">Node A</rec><rec id="2" value="B">Node B</rec></records>', '//rec[@value="A"]/@id'); OK 1 hive> select xpath_int('<records><rec id="1" value="A">Node A</rec><rec id="2" value="B">Node B</rec></records>', '//rec[@value="A"]/@id'); OK 1 hive> select xpath_long('<records><rec id="1" value="A">Node A</rec><rec id="2" value="B">Node B</rec></records>', '//rec[@value="A"]/@id'); OK 1
The above three HQL queries return the same result as the value of id
attribute is very small and can fit into all three types.
Use xpath_float, xpath_double, xpath_number
Similar as the above integer functions, these three functions are used to extract float numbers. Function xpath_number
is the alias of xpath_double
. If the extracted value is not numeric, the returned value will be NaN.
The following are some examples to use these three functions in Hive QL:
hive> select xpath_float('<records><rec id="1" value="A">Node A</rec><rec id="2" value="B">Node B</rec></records>', '//rec[@value="A"]/@id'); OK 1.0 hive> select xpath_double('<records><rec id="1" value="A">Node A</rec><rec id="2" value="B">Node B</rec></records>', '//rec[@value="A"]/@id'); OK 1.0 hive> select xpath_number('<records><rec id="1" value="A">Node A</rec><rec id="2" value="B">Node B</rec></records>', '//rec/@value'); OK NaN
The last statement returns NaN as the value is not numeric.