I used the ususal UNIX tools or Perl or what and I always wondered whether the file manipulation couldn't be done with XSLT as well. I never took the time to really look into it since my normal toolset worked so well and I didn't see the need though the curiosity persisted.
So assume you have this file which consists of a header line and tab separated data (it is not really important what the entries mean, and you can see one of the well-known issues of human-readability: the column headers do not sit exactly on top of their columns in contrast to fixed field files.)
Date Num Duration 20120415 13 2310 20120510 9 1470 20120526 16 3817 20120701 5 2269 20120831 28 4505and you want to transform it into an XML file (something I would describe as XML-ifying though I'm not sure this term exists elsewhere) like this
<?xml version="1.0"?> <data> <row><Date>20120415</Date><Num>13</Num><Duration>2310</Duration></row> <row><Date>20120510</Date><Num>9</Num><Duration>1470</Duration></row> <row><Date>20120526</Date><Num>16</Num><Duration>3817</Duration></row> <row><Date>20120701</Date><Num>5</Num><Duration>2269</Duration></row> <row><Date>20120831</Date><Num>28</Num><Duration>4505</Duration></row> </data>You can see that I want the header fields in the first line to become the enclosing tags for the data. For lack of something definite the root is called
<data> and the various entries are <row>.
The first important consideration when working with XSLT is the question of platform and tool which will determine the XSLT version and probable extensions.
In my case that was MacOS 10.5.8 and its standarad xsltproc was on libs with XSLT 1.0.
The second important issue is of course that XSLT requires XML input which my delimited text isn't at all.
So can I be successful and what do I need?
In XSLT terms all I have is a big string.
XSLT will happily recognize it if it is wrapped a little to make it look like XML. So putting the following around the file (one can use whatever tool fits, I used simple shell echo/printf) will get XSLT started. The node name 'fakeroot' is not important, the script will not check for it.
<?xml version="1.0"?> <fakeroot> ... </fakeroot>
Now XSLT needs to parse the string properly. XLST 2.0 or extensions of 1.0 (e.g. EXSLT 'common') would have provided nice string functions like tokenize or node-set which would easily dissect the string and allow its chunks to be used in various ways. I wanted a pure XSLT 1.0 solution though. And this meant to handcraft the tokenization and also somehow manage the header line whose fields should become the node names for each line.
The solution takes into account that the big input string can be split by newlines (they are part of the content) into lines and each line can then further be split by the delimiting character. Aside from the first 'match="/"' template I need three templates.
The functions substring-before() and substring-after() and the recursive call of templates are the main ingredients in this whole process.
<?xml version="1.0" encoding="ISO-8859-1"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<!-- Define delimiter, newline and root -->
<xsl:variable name="delim" select="'	'" />
<xsl:variable name="newline" select="'
'" />
<!-- Define node1 and node2 for the output -->
<xsl:variable name="node1" select="'data'" />
<xsl:variable name="node2" select="'row'" />
<xsl:template match="/">
<!-- Take whatever input is coming, don't care about 'fakeroot' -->
<xsl:call-template name="root"/>
</xsl:template>
<xsl:template name="root">
<xsl:value-of disable-output-escaping="yes" select="concat('<',$node1,'>')" />
<xsl:value-of select="$newline" />
<xsl:call-template name="parseDelimited">
<!-- Chop off the first line with the header fields
It will be passed as a parameter to all other templates -->
<xsl:with-param name="headerLine" select="substring-before(.,$newline)" />
<xsl:with-param name="delimitedText" select="substring-after(.,$newline)" />
</xsl:call-template>
<xsl:value-of disable-output-escaping="yes" select="concat('</',$node1,'>')" />
</xsl:template>
<xsl:template name="parseDelimited">
<xsl:param name="headerLine" />
<xsl:param name="delimitedText" />
<xsl:variable name="line" select="substring-before($delimitedText,$newline)" />
<xsl:variable name="remaining" select="substring-after($delimitedText,$newline)" />
<!-- Handle one line which has been chopped off -->
<xsl:if test="string-length($line) > 0">
<xsl:value-of disable-output-escaping="yes" select="concat('<',$node2,'>')" />
<xsl:call-template name="parseLine">
<xsl:with-param name="headerLine" select="concat($headerLine,$delim)" />
<xsl:with-param name="line" select="concat($line,$delim)" />
</xsl:call-template>
<xsl:value-of disable-output-escaping="yes" select="concat('</',$node2,'>')" />
<xsl:value-of select="$newline" />
</xsl:if>
<!-- Call the template recursively with the remaining lines -->
<xsl:if test="string-length($remaining) > 0">
<xsl:call-template name="parseDelimited">
<xsl:with-param name="headerLine" select="$headerLine" />
<xsl:with-param name="delimitedText" select="$remaining" />
</xsl:call-template>
</xsl:if>
</xsl:template>
<xsl:template name="parseLine">
<xsl:param name="headerLine" />
<xsl:param name="line" />
<!-- Retrieve the fields before the delimiter -->
<xsl:variable name="fieldName" select="substring-before($headerLine,$delim)" />
<xsl:variable name="field" select="substring-before($line,$delim)" />
<xsl:if test="string-length($fieldName) > 0">
<!-- This is the actual output -->
<xsl:value-of disable-output-escaping="yes" select="concat('<',$fieldName,'>',$field,'</',$fieldName,'>')" />
<!-- Call the template recursively with the remaining fields -->
<xsl:call-template name="parseLine">
<xsl:with-param name="headerLine" select="substring-after($headerLine,$delim)" />
<xsl:with-param name="line" select="substring-after($line,$delim)" />
</xsl:call-template>
</xsl:if>
</xsl:template>
</xsl:stylesheet>
There are a few tricks being used here.
Of course one can add newlines, indents etc. to change the look but that is not important for this exercise as well as parameterize the script: the delimiter could be passed as --stringparam to xsltproc in order to make it more flexible and also the output node names 'data' and 'row' could be coming from command line parameters in a true production script.
Now looking at this of course I could have achieved the XML-ifying much simpler with this little awk script which took me less than 10 minutes to write. The XSLT exercise was nice but since this was eventually all about string handling one is probably still better off with the traditional tools.
BEGIN { FS="TAB"; } # Put in the real TAB character here
NR==1 { split($0,header,FS); printf "<data>\n"}
NR>1 {
printf "<row>";
for(i=1;i<=NF;i++) printf "<%s>%s</%s>", header[i],$i,header[i];
printf "</row>\n";
}
END { printf "</data>\n" }
which will result too in
<data> <row><Date>20120415</Date><Num>13</Num><Duration>2310</Duration></row> <row><Date>20120510</Date><Num>9</Num><Duration>1470</Duration></row> <row><Date>20120526</Date><Num>16</Num><Duration>3817</Duration></row> <row><Date>20120701</Date><Num>5</Num><Duration>2269</Duration></row> <row><Date>20120831</Date><Num>28</Num><Duration>4505</Duration></row> </data>
No comments:
Post a Comment