Tuesday, February 5, 2013

How to XML-ify a tab separated text file with xsltproc

In the past I have done a lot with plain - somehow delimited - text files, CSV where the comma can stand for any kind of character to play the delimiter.

I used the ususal UNIX tools or Perl or what and I always wondered whether the file manipulation couldn't be done with XSLT as well. I never took the time to really look into it since my normal toolset worked so well and I didn't see the need though the curiosity persisted.

So assume you have this file which consists of a header line and tab separated data (it is not really important what the entries mean, and you can see one of the well-known issues of human-readability: the column headers do not sit exactly on top of their columns in contrast to fixed field files.)

Date    Num     Duration
20120415        13      2310
20120510        9       1470
20120526        16      3817
20120701        5       2269
20120831        28      4505
and you want to transform it into an XML file (something I would describe as XML-ifying though I'm not sure this term exists elsewhere) like this
<?xml version="1.0"?>
<data>
<row><Date>20120415</Date><Num>13</Num><Duration>2310</Duration></row>
<row><Date>20120510</Date><Num>9</Num><Duration>1470</Duration></row>
<row><Date>20120526</Date><Num>16</Num><Duration>3817</Duration></row>
<row><Date>20120701</Date><Num>5</Num><Duration>2269</Duration></row>
<row><Date>20120831</Date><Num>28</Num><Duration>4505</Duration></row>
</data>
You can see that I want the header fields in the first line to become the enclosing tags for the data. For lack of something definite the root is called <data> and the various entries are <row>.

The first important consideration when working with XSLT is the question of platform and tool which will determine the XSLT version and probable extensions.
In my case that was MacOS 10.5.8 and its standarad xsltproc was on libs with XSLT 1.0.

The second important issue is of course that XSLT requires XML input which my delimited text isn't at all.

So can I be successful and what do I need?
In XSLT terms all I have is a big string. XSLT will happily recognize it if it is wrapped a little to make it look like XML. So putting the following around the file (one can use whatever tool fits, I used simple shell echo/printf) will get XSLT started. The node name 'fakeroot' is not important, the script will not check for it.

<?xml version="1.0"?>
<fakeroot>
...
</fakeroot>

Now XSLT needs to parse the string properly. XLST 2.0 or extensions of 1.0 (e.g. EXSLT 'common') would have provided nice string functions like tokenize or node-set which would easily dissect the string and allow its chunks to be used in various ways. I wanted a pure XSLT 1.0 solution though. And this meant to handcraft the tokenization and also somehow manage the header line whose fields should become the node names for each line.

The solution takes into account that the big input string can be split by newlines (they are part of the content) into lines and each line can then further be split by the delimiting character. Aside from the first 'match="/"' template I need three templates.

  • the first template 'match="fakeroot"' takes the whole input, chops off the first line and passes the remaining lines to the next template
  • the second template 'parseDelimited' does the split by newline: it chops off one line and feeds it to the next template and then calls itself recursively with one line less
  • the third template 'parseLine' does the split by delimiter: it chops off the content before the first delimiter and slso the header line before the delimiter,puts out the <tag>...</tag> entries and calls itself recursively with the remainder of the line after the delimiter

    The functions substring-before() and substring-after() and the recursive call of templates are the main ingredients in this whole process.

    <?xml version="1.0" encoding="ISO-8859-1"?>
    <xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
    
    <!-- Define delimiter, newline and root -->  
    <xsl:variable name="delim"   select="'&#x9;'" />
    <xsl:variable name="newline" select="'&#xA;'" />
    
    <!-- Define node1 and node2 for the output -->  
    <xsl:variable name="node1"   select="'data'" />
    <xsl:variable name="node2"   select="'row'" />
    
          
    <xsl:template match="/">
      <!-- Take whatever input is coming, don't care about 'fakeroot' -->
      <xsl:call-template name="root"/>
    </xsl:template>
    
    <xsl:template name="root">
      <xsl:value-of disable-output-escaping="yes" select="concat('&lt;',$node1,'&gt;')" />
      <xsl:value-of select="$newline" />
      <xsl:call-template name="parseDelimited">
        <!-- Chop off the first line with the header fields
             It will be passed as a parameter to all other templates -->
        <xsl:with-param name="headerLine" select="substring-before(.,$newline)" />
        <xsl:with-param name="delimitedText" select="substring-after(.,$newline)" />
      </xsl:call-template>
      <xsl:value-of disable-output-escaping="yes" select="concat('&lt;/',$node1,'&gt;')" />
    </xsl:template>
      
        
    <xsl:template name="parseDelimited">   
      <xsl:param name="headerLine" />
      <xsl:param name="delimitedText" />
    
      <xsl:variable name="line" select="substring-before($delimitedText,$newline)" />
      <xsl:variable name="remaining" select="substring-after($delimitedText,$newline)" />
         
      <!-- Handle one line which has been chopped off -->
      <xsl:if test="string-length($line) > 0">
        <xsl:value-of disable-output-escaping="yes" select="concat('&lt;',$node2,'&gt;')" />
    
        <xsl:call-template name="parseLine">
          <xsl:with-param name="headerLine" select="concat($headerLine,$delim)" />
          <xsl:with-param name="line" select="concat($line,$delim)" />
        </xsl:call-template>
    
        <xsl:value-of disable-output-escaping="yes" select="concat('&lt;/',$node2,'&gt;')" />
        <xsl:value-of select="$newline" />
      </xsl:if>
    
      <!-- Call the template recursively with the remaining lines -->
      <xsl:if test="string-length($remaining) > 0">
        <xsl:call-template name="parseDelimited">
          <xsl:with-param name="headerLine" select="$headerLine" />
          <xsl:with-param name="delimitedText" select="$remaining" />    
        </xsl:call-template>
      </xsl:if>
    
    </xsl:template>
    
    
    <xsl:template name="parseLine">
      <xsl:param name="headerLine" />
      <xsl:param name="line" />
    
      <!-- Retrieve the fields before the delimiter -->
      <xsl:variable name="fieldName" select="substring-before($headerLine,$delim)" />
      <xsl:variable name="field" select="substring-before($line,$delim)" />
    
      <xsl:if test="string-length($fieldName) > 0">
        <!-- This is the actual output -->
        <xsl:value-of disable-output-escaping="yes" select="concat('&lt;',$fieldName,'&gt;',$field,'&lt;/',$fieldName,'&gt;')" />
    
        <!-- Call the template recursively with the remaining fields -->
        <xsl:call-template name="parseLine">
          <xsl:with-param name="headerLine" select="substring-after($headerLine,$delim)" />
          <xsl:with-param name="line" select="substring-after($line,$delim)" />
        </xsl:call-template>
      </xsl:if>
    
    </xsl:template>
    
    </xsl:stylesheet>
    

    There are a few tricks being used here.

  • the header line is always passed as a parameter to the templates because it cannot be stored in a global variable or an array (that is XSLT after all, not an imperative programming language)
  • in order to catch the last field I simply append a delimiter at the end of each line. This ensures that substring-before will always catch something. Otherwise I would have had to use some if-else-logic to handle the case of the last field

    Of course one can add newlines, indents etc. to change the look but that is not important for this exercise as well as parameterize the script: the delimiter could be passed as --stringparam to xsltproc in order to make it more flexible and also the output node names 'data' and 'row' could be coming from command line parameters in a true production script.


    Now looking at this of course I could have achieved the XML-ifying much simpler with this little awk script which took me less than 10 minutes to write. The XSLT exercise was nice but since this was eventually all about string handling one is probably still better off with the traditional tools.
    BEGIN { FS="TAB"; }      # Put in the real TAB character here
    NR==1 { split($0,header,FS); printf "<data>\n"}
    NR>1  {
            printf "<row>";
            for(i=1;i<=NF;i++)  printf "<%s>%s</%s>", header[i],$i,header[i];
            printf "</row>\n";
    }
    END   { printf "</data>\n" }
    
    which will result too in
    <data>
    <row><Date>20120415</Date><Num>13</Num><Duration>2310</Duration></row>
    <row><Date>20120510</Date><Num>9</Num><Duration>1470</Duration></row>
    <row><Date>20120526</Date><Num>16</Num><Duration>3817</Duration></row>
    <row><Date>20120701</Date><Num>5</Num><Duration>2269</Duration></row>
    <row><Date>20120831</Date><Num>28</Num><Duration>4505</Duration></row>
    </data>
    
  • No comments:

    Post a Comment