Monday, April 8, 2013

String extracts in Perl with split, match and regular expressions

Lately I had to solve the following issue:
extract process id (pid) and program name from the header line of pmap.

The strings can take these forms from simple to complex:

123:     cmd
123:     cmd -x foo
123:     /usr/bin/cmd
123:     /usr/bin/cmd -x foo
and more complex with more parameters which are trickier to parse
123:     /usr/bin/cmd -x /home/foo
123:     /usr/bin/cmd -x 456: -d /home/foo
i.e. very genereally speaking there is a pid followed by a colon and then a more or less complex command line where the program name can be fully qualified and carry a number of parameters. The last example deliberately introduces the digit and colon again as parameters.

Here is a try to express the string more verbally as a sequence of

  • a number of digits
  • a colon
  • a tab
  • a program name, optionally qualified
  • optionally: an arbitrary number of space separated parameters (could me multiple spaces)

    There a various solutions to this in Perl and here I'll show two.

    # Example string
    $str = "123:     /usr/bin/cmd -x /home/foo";
    #           ^ should be a tab here
    # First I split the string using an optional colon :* 
    # and a sequence of white space \s+ as field delimiters.
    # This will give me the pid and the program name and strip of the parameters
    ($pid,$cmd) = split /:*\s+/,$str;
    # In case of a fully qualified program nane 
    # everything up to the last slash needs to be removed
    $cmd =~ s/.*\///;
    print "pid = $pid  X  cmd = $cmd\n";

    Always looking for more concise code I wondered whether these two lines couldn't be shortened. Here is a one liner which requires explanation of course.

    # Example string
    $str = "123:    /usr/bin/cmd -x /home/foo";
    #           ^ should be a tab here
    # I try to match the following reqular expression
    #   a sequence of digits    (\d+)    which will become $1 if successful
    #   a colon and a tab
    #   an optional sequence of characters ending in slash   (\S+\/)*   
    #                which will become $2
    #   a sequence of characters   (\S+)    which will become $3
    # The remainder of the string is not important as 
    # we anchor the regular expression at the beginning.
    $str =~ /^(\d+):\t(\S+\/)*(\S+)/ ;
    print "pid = $1  X  cmd = $3\n";

    For easier readability I would have preferred the first code but when taking a deeper look I found some flaws in it namely the handling of incorrect strings. Assume this string below where the colon is missing and a string sits between pid and program name

    $str = "123 xyz        /usr/bin/cmd -x 456:  /home/foo";
    The codes will result in
    # Code 1
    pid = 123 xyz /usr/bin/cmd -x 456  X  cmd = foo
    # Code 2
    pid = /home/  X  cmd =
    In both cases the split happens at the wrong place with unforeseeable results.
    I can use the second code though to its advantage by applying a check.
    if( $str =~ /^(\d+):\t(\S+\/)*(\S+)/ ) {
      print "pid = $1  X  cmd = $3\n";
    i.e. only when the regular expression is really matched I will use its values. The check gives me assurance.
    I can't do this with the split in the first code other than doing a post-check by checking whether the pid really consists of digits etc. which would increase the code.

    So I decided to use the regular expression in my code since it is still fairly readable by extracting just three parts of the overall string.
    Would I want to extract more, say five or eight components, I probably would fall back to the split and a subsequent validity check.


    1. Harvard Business Review named data scientist the "sexiest job of the 21st century".This Data Science course will cover the whole data life cycle ranging from Data Acquisition and Data Storage using R-Hadoop concepts, Applying modelling through R programming using Machine learning algorithms and illustrate impeccable Data Visualization by leveraging on 'R' capabilities.With companies across industries striving to bring their research and analysis (R&A) departments up to speed, the demand for qualified data scientists is rising.

      data science training in bangalore

    2. myTectra offers Big Data and Hadoop training in Bangalore using Class Room.
      myTectra offers Live Online Big Data and Hadoop training Globally.
      Big Data and Hadoop training Unlike traditional systems, Big Data and Hadoop enables multiple types of analytic workloads to run on the same data, at the same time, at massive scale on industry-standard hardware.myTectra Big Data and Hadoop training is designed to help you become a expert Hadoop developer. myTectra offers Big Data Hadoop Training in Bangalore using Class Room. myTectra offers Live Online Big Data and Hadoop training Globally.
      hadoop training in bangalore

    3. Python has adopted as a language of choice for almost all the domain in IT including the most trending technologies such as Artificial Intelligence, Machine Learning, Data Science, Internet of Things (IoT), Cloud Computing technologies such as AWS, OpenStack, VMware, Google Cloud, etc.., Big Data Analytics, DevOps and Python is prepared language in traditional IT domain such as Web Application Development, Infrastructure Automation ,Software Testing, Mobile Testing.

      python online training