Wednesday, February 2, 2011

Extracting strings: field delimited vs. regular expression solutions

Imagine that you have the following string describing a directory name
  tmp dir: /var/tmp/somedir
and you want to extract the directory name.

In shell scripts one can often find quick solutions like
... | awk -F: '{print $2}'
or
... | awk -F: '{print $NF}'
each leading to
 /var/tmp/somedir
(note the leading blank).

Now this solution is destined to fail if the directory name may contain a colon.
echo "  tmp dir: /var/tmp/some:dir"  | awk -F: '{print $2}'
 /var/tmp/some

echo "  tmp dir: /var/tmp/some:dir"  | awk -F: '{print $NF}'
dir

The awk solution assumes that the input line is a set of fields with a defined field delimiter which can never be part of the field values (and one could use cut or any other tool which handles field delimited input).

Though this looks nice and easy there's a safer solution which does not rely on this requirement but uses a regular expression to extract the necessary data:
s/[^:]*: //
   Skip everything up to the first colon and a subsequent space
s/[^:]*tmp dir: //
   This is more precise if one knows that there will always be a 
   string 'tmp dir: ' in the input line:
   skip everything up to the first 'tmp dir: '

   Or if you want to save the remainder use something like this
s/[^:]*: \(.*\)/\1/
   (this is for sed)

Both can be used in sed or Perl or any tool that can handle strings and regular expressions and they also have the nice side effect to remove the blank at the beginning of the resulting string.

echo "  tmp dir: /var/tmp/some:dir"  | sed 's/[^:]*: //'
/var/tmp/some:dir

echo "  tmp dir: /var/tmp/some:dir"  | perl -n -e 's/[^:]*: //; print'
/var/tmp/some:dir

The field delimiter solution seems more accessible to users and people sometimes seem to be afraid of regular expressions but the reg exp solution achieves the same and gives you more freedom (the freedom to eventually have the field delimiter as part of the value, just in case).

And an important thing is of course to test solutions if they work when your input does not follow the expected format, in the case above what happens if there is no colon in the input line.
echo "  tmp dir /var/tmp/somedir" | awk -F: '{print $2}'
     <-- There is an empty string here

echo "  tmp dir /var/tmp/somedir" | awk -F: '{print $NF}'
  tmp dir /var/tmp/somedir
     This is a copy of the original input line

echo "  tmp dir /var/tmp/somedir" | sed 's/[^:]*: //'
  tmp dir /var/tmp/somedir
     This is a copy of the original input line

echo "  tmp dir /var/tmp/somedir" | perl -n -e 's/[^:]*: //; print'
  tmp dir /var/tmp/somedir
     This is a copy of the original input line
You need to take this into account when you are using the result of this extraction further on in a script. If you want to use the reg exp solution but in case of the input line not adhering to the format you would want an empty string rather than the original line then do this:
echo "  tmp dir /var/tmp/somedir" | sed -n 's/[^:]*: //p'
     <-- There is an empty string here
     The -n ensures that sed prints only what you want, 
     the p prints the changed input line if there was a successful match,
     no match, no output

echo "  tmp dir /var/tmp/somedir" |perl -n -e 's/[^:]*: // && print'
     In Perl you simply can print if the previous substition was successful

No comments:

Post a Comment