Wednesday, February 13, 2013

How to XML-ify a tab separated text file with xsltproc (revisited)

My last post described an XSLT 1.0 solution how to transform a plain tab separated text file into XML.

Here I present a solution which uses some features available in extensions (and which are available in XSLT 2 as language features), namely tokenize and node-set. tokenize will allow me to split a string into tokens at once rather than having to call substring-before and substring-after repeatedly. In a certain way it is a contrast to the template thinking of XSLT but of course useful. node-set is a mighty tool since it allows me to transform variables into node sets and with that comes the ability to use proper XPATH functions on the nodes.
The xsltproc version on my Mac contains some EXSTL extensions (visible via xsltproc --dumpextensions) so here are the required namespaces which need to be declared at the beginning of the script


And here is how to use them:

tokenizeI use tokeinze in a for-each loop to split $someText delimited by $newline
<xsl:for-each select="strings:tokenize($someText,$newline)" >
node-setTransform the contents of a variable $lines into a node-set $lineNodes
<xsl:variable name="lineNodes" select="common:node-set($lines)" />

All the work is being done in the parseDelimited template and it follows pretty much old style programming conventions. There is one loop which splits the complete input by newline. The first line is split by delimiter into the names of the headers All other lines are then split by delimiter into their individual fields. Everything is wrapped into elements as follows and and put into a variable. The pseudo-code is already close to its implementation.

element "data"
  for each line tokenize the line by delimiter
    element "row"
      for each field n
        element "header n"
          content of field n
        end of element "header n"
      end for
    end of element "row"
  end for
end of element "data"

Here is the complete code.

<?xml version="1.0" encoding="ISO-8859-1"?>
<xsl:stylesheet version="1.0"
<!-- From strings we use:  tokenize
     From common we use:  node-set

<!-- Define delimiter and newline -->
<xsl:variable name="delim"   select="'&#x9;'" />
<xsl:variable name="newline" select="'&#xA;'" />

<!-- Define node1 and node2 for the output -->
<xsl:variable name="node1"   select="'data'" />
<xsl:variable name="node2"   select="'row'" />

<xsl:template match="/">
  <!-- Take whatever input is coming, don't care about 'fakeroot' -->
  <xsl:call-template name="root"/>

<xsl:template name="root">
    <!-- Call the line parser with the whole content of the file -->
    <xsl:call-template name="parseDelimited">
      <xsl:with-param name="delimitedText" select="." />

<xsl:template name="parseDelimited">
  <xsl:param name="delimitedText" />

  <!-- Split the file content by newline -->
  <xsl:variable name="lines">
    <xsl:for-each select="strings:tokenize($delimitedText,$newline)" >
      <xsl:value-of select='.' />
  <!-- Create a node-set out of the previous 'lines'
       in order to be able to use them as an XPATH var -->
  <xsl:variable name="lineNodes" select="common:node-set($lines)" />

  <!-- The first line containing the header fields -->
  <xsl:variable name="first" select='$lineNodes/line[1]' />
  <xsl:variable name="headers" >
    <xsl:for-each select="strings:tokenize($first,$delim)" >
      <xsl:value-of select='.' />
  <!-- Create a node-set out of the previous 'headers'
       in order to be able to use them as an XPATH var -->
  <xsl:variable name="headerNodes" select="common:node-set($headers)" />

  <!-- Loop through all lines, we can do this since it is a node set.
       This creates the actual XML content -->
  <xsl:variable name="output" >
    <!-- Start tag <data> -->
    <xsl:element name="{$node1}">
    <xsl:value-of select='$newline' />

    <xsl:for-each select="$lineNodes/line">
      <!-- Skip the first line of course -->
      <xsl:if test="position() > 1">

        <!-- Start tag <row> -->
        <xsl:element name="{$node2}">

        <!-- Split the line by 'delim'
             and create an element for each entry.
             The element name is coming from the header line -->
        <xsl:for-each select="strings:tokenize(.,$delim)" >
          <xsl:variable name="p" select="position()" />
          <xsl:element name="{$headerNodes/head[$p]}">
            <!-- Print the actual content , phew! -->
            <xsl:value-of select="." />
        <!-- End tag <row> -->
        <xsl:value-of select='$newline' />


    <!-- End tag </data> -->
    <xsl:value-of select='$newline' />


  <xsl:variable name="all" select="common:node-set($output)" />
  <!-- Output of nodified elements -->
  <xsl:copy-of select="($all)/*" />

  <!-- With a node-set one can now use its advantages
       e.g. sum up all Num values -->
  <xsl:value-of select='$newline' />
  <xsl:element name="Sum_Num">
  <xsl:value-of disable-output-escaping="yes"  select="sum(common:node-set($output)/data/row/Num)"/>


There are two interesting pieces here.
  • How to get the header names into the game? The for-each loop in bold tokenizes a line. Each field has an index which you can get via position() in XSLT. An element is created and it gets the name of the header field using this exact index (this works since the header line has the same number of fields than every other line). <xsl:element name="{$headerNodes/head[$p]}"> (the creation of variable 'p' to store the position is actually superfluous but it makes the code more readable).
  • At the end there are two more lines in bold which show how to use the XPATH function sum to get the total of the Num fields.

    This script, call it data.xsl, needs to be fed by the same wrapped input as before, here is the script which I omitted last time.

    # A shell wrapper for non-xml parsing with xslt
    FAKEROOT=fakeroot   # Important for XML completeness but will be skipped by XSLT
    echo "<?xml version=\"1.0\"?>"
    printf "<$FAKEROOT>"
    cat $FILE
    echo "</$FAKEROOT>"
    )  |
    xsltproc data.xsl -

    The result is as follows. Note the 71 in the last line which is the sum of Num (this makes the output non-XML, it's just there to show the possibilities).

    <?xml version="1.0"?>
  • 1 comment: