Friday, November 23, 2012

A little exercise about a recent forum question (input field handling in awk and Perl)

Just recently the following question was posted to the UNIX scripting group in Linkedin:
remove all duplicate entries in a colon separated list of strings
e.g. a: b:c: b: d:a:a: b:e::f should be transformed to a: b:c: d:e::f

Some of the fields contain spaces which should be preserved in the output of course, there is an empty field too which (to me and other authors) indicates that the fields are not necessarily ordered. Here I won't discuss the suggested solutions, I also did not answer to the original posting because I read it one month too late.

But when reading the question my brain already got working and I could not help to try for myself. The obvious tool of choice for exercises like this is awk because awk has inbuilt mechanisms for viewing lines as a sequence of fields with configurable field separators.

A solution could be

  FS=":" ;    # field separator
  ORS=":"     # output record separator
{ for(i=1;i<=NF;i++) {  # for all input fields
    if( f[$i] ) {       # check if array entry for field already exists
     continue;          # if yes: go to next field
    } else {
     print $i;          # if no: print the field content
     f[$i] = 1;          # and record it in array 'f'
    }  }

which leads to this output:
a: b:c: d:e::f:

The script can be shortened by omitting superfluous braces and 'else' to

BEGIN { FS=":" ; ORS=":" } 
{ for(i=1;i<=NF;i++) { if(f[$i]) continue; f[$i]=1; print $i; } } 

The script uses a very simple straightforward logic: loop through all input fields, if a field is new then print it, if not skip it. This is achieved by storing each field in an associated array 'f' when it first occurs.
Using the field separator FS for splitting the input line and the output record separator ORS when printing (you need to know that 'print' automatically adds ORS) makes this an easy task.

There is one issue though: this solution adds an extra colon at the very end (compared to the requested output), this could be an issue or not depending on the context of this request so one might prefer this code:

BEGIN { FS=":" } 
{ printf $1; f[$1]=1; 
  for(i=2;i<=NF;i++) { if(f[$i]) continue; f[$i]=1; printf FS $i } }

which uses a slightly different logic: the first field is printed straight away (and recorded), the loop checks the remaining fields 2..NF and prints the field separator as a prefix to the field content. This code also works for the extreme case where there is just one field and no colon.

I then wondered if this couldn't be done equivalently or even shorter in Perl but my best solution is a little bit lengthier because I have to use 'split' to get the individual fields.

@s = split($FS,<>);
for($i=0;$i<=$#s;$i++) {$e=$s[$i]; next if(exists($f{$e})); $f{$e}=1; print $e,$FS }

I could have used command line options "-a -F:" to avoid the 'split' but I need FS to be defined anyway for the output (I don't know if the split pattern defined by -F can be accessed in Perl).
I use 'split' to chop up the input line and put it into an array 's'. Then the same logic applies as in awk. Instead of an associative array I'm using a hash table 'f' in Perl. The variable 'e' is only used to avoid repeated occurances of $s[$i]. In the end tit's a matter of personal preference which solution you take.

It should be noted that I tested with

echo -n "...." | awk '...' or perl -e '...'

which feeds a string without newline to the pipe which helped to avoid 'chomp' in Perl for removing the newline in the last field.

No comments:

Post a Comment