Andreas' Technical Tidbits: sed scripts

Thursday, December 16, 2010

sed scripts

I always wanted to use complex sed scripts in parts of my work. I wrote some but always ended up replacing them by something else (awk mostly) so maybe this page is superfluous but after having spent some time on it I wanted to keep a record.

sed usage in scripts is most of the times reduced to these cases:

simple substitution (one instance or whole line): 's/.../.../' or 's/.../.../g'
deletion of ranges or special matches eg. empty lines: '/^$/d'
multiple substitutions in sequential order: -e 's/.../.../' -e 's/.../.../'
capture partial strings: 's/...$...$.../\1/'
output only matching lines: sed -n '....p'

One can do much more comples patterns with sed when invoking its concept of hold space and pattern space which are basically two internal buffers to store (multiple) lines.
The usage is though a bit tricky and thus it has not found widespread use (no offense to dedicated sed scripters out there but in my 20 years in the UNIX world I've seen a lot of scripts and I have seen very rare occurances of this).

Imagine the following example:
you have the output of timex in a file with some text before and some text after.
You want to put a start tag just before the timex output and an ending tag right after.

some text might appear here
real        32.04
user        0.14
sys         0.22
... and some more text there

Your script should achieve the following output:

<start_timex>
real        32.04
user        0.14
sys         0.22
<end_timex>
some text might appear here
... and some more text there

The script below does it and it got complex because I wanted to handle a lot of cases:

there might be zero, one or multiple lines before the timex output
there might be zero, one or multiple lines after the timex output (especially 'sys' being the last line made the script very complex and unreadable basically)
there might be one or many sections of timex output with arbitrary text in between which all should appear (separately tagged) at the beginning

i.e.

some text might appear here
real        32.04
user        0.14
sys         0.22
and something in between
real        5.00
user        0.01
sys         0.01
... and some more text there

should end up like this

<start_timex>
real        32.04
user        0.14
sys         0.22
<end_timex>
<start_timex>
real        5.00
user        0.01
sys         0.01
<end_timex>
some text might appear here
and something in between
... and some more text there

This script does it (and it still has one bug: when there is no surrounding text at all it will add an empty line to the bottom to the output which is due to the last 'p' statement in 'sys').

sed -n '
# We call sed with '-n' so that we control exactly what we print
/^real/ {
        # print start tag 
        i\
<start_timex>
        # print current line (in pattern space) and go to next line of input
        p
        d
}
/^user/ {
        # print current line (in pattern space) and go to next line of input
        p
        d
}
/^sys/ {
        # print current line (in pattern space)
        p
        # print end tag
        i\
</end_timex>
     $ { # Check if 'sys' is the last line
        # Exchange hold space and pattern space, there might be something left in hold space
        x
        # remove first newline in ex-hold space
        s/^\n//
        # print pattern space
        p
     }
     # Go to next line of input
     d
}
$ { # Last line
    # at the end check if something is still in hold space and print that too
    # Append last line to hold space
    H
    # Exchange hold space and pattern space
    x
    # remove first newline in ex-hold space
    s/^\n//
    # print pattern space
    p
    # Go to next line, there is none ie. end the program here
    d
}
{       # This action is processed for each(!) line (thus we needed to delete the previous matches)
        # add pattern space to hold space where it will stay until we call it back
        H
        # delete pattern space and start next cycle
        d
}'  filename

Compare this to the following nawk script:

/^real/   { print "<start_timex>"; print; next }
/^user/   { print; next }
/^sys/    {print; print "<end_timex>"; next }
{  # capture non-timex lines in buffer 'text'

   if(text=="") text=$0;
   else text=text "\n" $0;
}
END { if(text!="") print text }

This is - in my view - better readable, better to understand, and mainly because of the better control structures.

The hold and pattern space management (append to, exchange with, automated addition of newlines etc.) is confusing and adds so little value that there is no real reason to learn it in my view.
Don't get me wrong: I use sed all the time with the usages outlined above. But anything more complex like rearranging lines should be left to other tools.

Andreas' Technical Tidbits

Thursday, December 16, 2010

sed scripts

No comments:

Post a Comment