2015-06-30

Using libxml for XPath

With the proliferation of presentation of data using XML, I find parsing the data is not easy as more and more ingredients (like attributes, name spaces) are introduced.  Up to now, I still find using XPath syntax to represent XML nodes is the best because it simplifies the complicated data hierarchy structure into the conventional slash form, like:
/A/B/C
to select the C element in the XML
<A>
  <B>
    <C/>
  </B>
</A>
(example borrowed from https://en.wikipedia.org/wiki/XPath)

Although I have used Java's built-in XPath features previously, when I have a C project, I need to resort to external library for the job.  I finally choose libxml.  One of the headaches in libxml is its memory management because otherwise you will induce memory leakage easily.
This document describes my learning.

This source code of my example is xpath_demo.c and the full listing is included in another post (link).

Compilation
Most of the installation of libxml is at /usr/local, therefore the sample program xpath_demo.c is compiled with the following switches:
cc -o xpath_demo -L/usr/local/lib -R/usr/local/lib -lxml2 -I/usr/local/include/libxml2 xpath_demo.c

Program Structure
The program has only two functions, the main function (which includes most of the logic) and register_namespaces (which is copied from libxml site for the name space registration)

Program Usage
The simplest usage is:
xpath_demo xml_filename xpath_expression
If there is name space, then the usage will be:
xpath_demo xml_filename xpath_expression name_space_list

Pseudo Codes

Invoke libxml function
Input/Output
Outstanding Object
xmlParseFile
Input: xml filename
Output: xmlDocPtr
xmlDocPtr
xmlPathNewContext
Input: xmlDocPtr
Output: xmlXPathContextPtr
xmlDocPtr
xmlXPathContextPtr
xmlXPathRegisterNs (only applicable for xml with namespace)
Input: xmlXPathContextPtr
namespace_prefix
namespace_URL

xmlXPathEvalExpression
Input: XPath_Expression,
xmlXPathContextPtr
Output: xmlXPathObjectPtr
xmlDocPtr
xmlXPathContextPtr
xmlXPathObjectPtr
xmlXPathFreeContext

xmlDocPtr
xmlXPathObjectPtr
Check if xmlXPathNodeSetIsEmpty
Input: xmlXPathObjectPtr->nodesetval

Retrieve the node:
xmlXPathObjectPtr ->nodesetval->nodeTab[0]


Retrieve the text of the node
xmlNodeGetContent
Input: xmlNode *
Output: xmlChar *
xmlDocPtr
xmlXPathObjectPtr
xmlChar * node_text
xmlFree (node_text)
xmlXPathFreeObject(xmlXPathObjectPtr)

xmlDocPtr
Final Clean up
xmlFreeDoc(xmlDocPtr);
xmlCleanupParser();

Nil

The "simplified" print out of various inputs are shown as follows:
cat data.xml
<?xml version='1.0'?>
<Envelope>
<Header>Header_Text</Header>
<Body attribute1='funny'>
<Field1>Value1</Field1>
<Field2>Value2</Field2>
</Body>
</Envelope>

xpath_demo error.xml /FIELD1
I/O warning : failed to load external entity "error.xml"
Error: Document not parsed successfully.

xpath_demo data.xml /Envelope/Body
node-text: "
Value1
Value2
"
Remark: According to the specification, the text of node includes all the text of its daughter nodes as well.

xpath_demo data.xml /Envelope/Body/Field1
node-text: "Value1"

xpath_demo data.xml /Envelope/Body/Field3
Empty

Cases with Name Space
Personally I do not like name space in XML because it is awkward.  Anway, libxml does support it, with an additional step to register the name space list.

An XML file (ns_data.xml) with name space is shown below:

<?xml version='1.0'?>
<Envelope xmlns:ns1='http://www.domain.com/ns/sample'>
<Header>Header_Text</Header>
<Body name='value'>
<ns1:Field1>Value1 in ns1</ns1:Field1>
<Field1>Value1 without NS<Field1>
</Body>
</Envelope>

A nameapace ns1 is defined in the root element <Envelope>.  You can see there are two tags with name Field1, one of which with name space ns1.  They are can accessed separately as follows:

xpath_demo ns_data.xml /Envelope/Body/ns1:Field1 ns1=http://www.domain.com/ns/sample
node-text: "Value1 in ns1"

xpath_demo ns_data.xml /Envelope/Body/Field1 ns1=http://www.domain.com/ns/sample

node-text: "Value1 without NS"