Monday, April 8, 2013

String extracts in Perl with split, match and regular expressions

Lately I had to solve the following issue:
extract process id (pid) and program name from the header line of pmap.

The strings can take these forms from simple to complex:

123:     cmd
123:     cmd -x foo
123:     /usr/bin/cmd
123:     /usr/bin/cmd -x foo
and more complex with more parameters which are trickier to parse
123:     /usr/bin/cmd -x /home/foo
123:     /usr/bin/cmd -x 456: -d /home/foo
i.e. very genereally speaking there is a pid followed by a colon and then a more or less complex command line where the program name can be fully qualified and carry a number of parameters. The last example deliberately introduces the digit and colon again as parameters.

Here is a try to express the string more verbally as a sequence of

  • a number of digits
  • a colon
  • a tab
  • a program name, optionally qualified
  • optionally: an arbitrary number of space separated parameters (could me multiple spaces)

    There a various solutions to this in Perl and here I'll show two.

    # Example string
    $str = "123:     /usr/bin/cmd -x /home/foo";
    #           ^ should be a tab here
    
    # First I split the string using an optional colon :* 
    # and a sequence of white space \s+ as field delimiters.
    # This will give me the pid and the program name and strip of the parameters
    ($pid,$cmd) = split /:*\s+/,$str;
    
    # In case of a fully qualified program nane 
    # everything up to the last slash needs to be removed
    $cmd =~ s/.*\///;
    
    print "pid = $pid  X  cmd = $cmd\n";
    

    Always looking for more concise code I wondered whether these two lines couldn't be shortened. Here is a one liner which requires explanation of course.

    # Example string
    $str = "123:    /usr/bin/cmd -x /home/foo";
    #           ^ should be a tab here
    
    # I try to match the following reqular expression
    #   a sequence of digits    (\d+)    which will become $1 if successful
    #   a colon and a tab
    #   an optional sequence of characters ending in slash   (\S+\/)*   
    #                which will become $2
    #   a sequence of characters   (\S+)    which will become $3
    # The remainder of the string is not important as 
    # we anchor the regular expression at the beginning.
    $str =~ /^(\d+):\t(\S+\/)*(\S+)/ ;
    
    print "pid = $1  X  cmd = $3\n";
    

    For easier readability I would have preferred the first code but when taking a deeper look I found some flaws in it namely the handling of incorrect strings. Assume this string below where the colon is missing and a string sits between pid and program name

    $str = "123 xyz        /usr/bin/cmd -x 456:  /home/foo";
    
    The codes will result in
    # Code 1
    pid = 123 xyz /usr/bin/cmd -x 456  X  cmd = foo
    
    # Code 2
    pid = /home/  X  cmd =
    
    In both cases the split happens at the wrong place with unforeseeable results.
    I can use the second code though to its advantage by applying a check.
    if( $str =~ /^(\d+):\t(\S+\/)*(\S+)/ ) {
      print "pid = $1  X  cmd = $3\n";
    }
    
    i.e. only when the regular expression is really matched I will use its values. The check gives me assurance.
    I can't do this with the split in the first code other than doing a post-check by checking whether the pid really consists of digits etc. which would increase the code.

    So I decided to use the regular expression in my code since it is still fairly readable by extracting just three parts of the overall string.
    Would I want to extract more, say five or eight components, I probably would fall back to the split and a subsequent validity check.

  • No comments:

    Post a Comment