Andreas' Technical Tidbits: String extracts in Perl with split, match and regular expressions

Monday, April 8, 2013

String extracts in Perl with split, match and regular expressions

Lately I had to solve the following issue:
extract process id (pid) and program name from the header line of pmap.

The strings can take these forms from simple to complex:

123:     cmd
123:     cmd -x foo
123:     /usr/bin/cmd
123:     /usr/bin/cmd -x foo

and more complex with more parameters which are trickier to parse

123:     /usr/bin/cmd -x /home/foo
123:     /usr/bin/cmd -x 456: -d /home/foo

i.e. very genereally speaking there is a pid followed by a colon and then a more or less complex command line where the program name can be fully qualified and carry a number of parameters. The last example deliberately introduces the digit and colon again as parameters.

Here is a try to express the string more verbally as a sequence of

a number of digits

a colon

a tab

a program name, optionally qualified

optionally: an arbitrary number of space separated parameters (could me multiple spaces)

There a various solutions to this in Perl and here I'll show two.

# Example string
$str = "123:     /usr/bin/cmd -x /home/foo";
#           ^ should be a tab here

# First I split the string using an optional colon :* 
# and a sequence of white space \s+ as field delimiters.
# This will give me the pid and the program name and strip of the parameters
($pid,$cmd) = split /:*\s+/,$str;

# In case of a fully qualified program nane 
# everything up to the last slash needs to be removed
$cmd =~ s/.*\///;

print "pid = $pid  X  cmd = $cmd\n";

Always looking for more concise code I wondered whether these two lines couldn't be shortened. Here is a one liner which requires explanation of course.

# Example string
$str = "123:    /usr/bin/cmd -x /home/foo";
#           ^ should be a tab here

# I try to match the following reqular expression
#   a sequence of digits    (\d+)    which will become $1 if successful
#   a colon and a tab
#   an optional sequence of characters ending in slash   (\S+\/)*   
#                which will become $2
#   a sequence of characters   (\S+)    which will become $3
# The remainder of the string is not important as 
# we anchor the regular expression at the beginning.
$str =~ /^(\d+):\t(\S+\/)*(\S+)/ ;

print "pid = $1  X  cmd = $3\n";

For easier readability I would have preferred the first code but when taking a deeper look I found some flaws in it namely the handling of incorrect strings. Assume this string below where the colon is missing and a string sits between pid and program name

$str = "123 xyz        /usr/bin/cmd -x 456:  /home/foo";

The codes will result in

# Code 1
pid = 123 xyz /usr/bin/cmd -x 456  X  cmd = foo

# Code 2
pid = /home/  X  cmd =

In both cases the split happens at the wrong place with unforeseeable results.
I can use the second code though to its advantage by applying a check.

if( $str =~ /^(\d+):\t(\S+\/)*(\S+)/ ) {
  print "pid = $1  X  cmd = $3\n";
}

i.e. only when the regular expression is really matched I will use its values. The check gives me assurance.
I can't do this with the split in the first code other than doing a post-check by checking whether the pid really consists of digits etc. which would increase the code.

So I decided to use the regular expression in my code since it is still fairly readable by extracting just three parts of the overall string.
Would I want to extract more, say five or eight components, I probably would fall back to the split and a subsequent validity check.

Andreas' Technical Tidbits

Monday, April 8, 2013

String extracts in Perl with split, match and regular expressions

No comments:

Post a Comment