Wednesday, January 18, 2012

How to track sub processes

I write a lot of scripts and one of the common problems (at least in my area of work) is how do I keep track of all sub processes and how do I cleanup all processes which a script might have started.
(Note: this has been developed on a Solaris 10 system which features the particular ptree command to easily check the process tree of a given process)
In this article I will only deal with the tracking of sub processes, eventually one would want to kill them if needed which is either fairly easy with kill -9 (but risking leftovers like temporary files) or can become complex if a script spawns new processes when receiving a weaker kill signal.

So here is the scenario:
script a.sh runs another script b.sh.
Before a.sh exits it wants to ensure that b.sh has not left any processes behind i.e. it wants to identify b.sh and all of its child processes (so that they can be killed if still running).

Running another shell script in the background


Example 1: a.sh runs b.sh in the background

ptree suffices in such a case
b.sh:
#!/bin/sh
sleep 200

a.sh:
#!/bin/sh
b.sh&
ptree $!
ps -u $USER -o pid,ppid,args |grep $!
(ptree should show the process tree of the last background process.
the ps command should show process ids (pid) , parent process ids (ppid) and script arguments (args) of all processes of user $USER)

Output of ptree:
59691 /bin/csh -c a.sh
   59716 /bin/sh a.sh
     59717 /bin/sh b.sh
       59719 sleep 200

Output of ps:
59731 59716 grep 59716
59717 59716 /bin/sh b.sh
59719 59717 sleep 200

So both b.sh and its sub process 'sleep' are shown in the process list and one could get the pids and kill them if needed.

There are more complex situations where ptree/ps don't help, and these are covered in the next parts.

Sub process detaching

This time we consider an example where the sub process detaches itself from the current process tree.

What do I mean by that?
Every process has a parent id so that if a process spawns a process which spawns another process they are all connected via their parent id (process' 1 id becomes the parent id of process 2, process' 2 id becomes the parent id of process' 3 aso.).
A process can break this chain though and can detach itself from its parent so that it gets the init pid 1 as parent id (all processes can be traced back to pid 1 in a UNIX system).

Example 2: here b.sh runs a process in the background itself
b.sh:
#!/bin/sh
sleep 200&

If you run this script and check your process list you will find something like this, a sleep process with ppid 1
7968     1 sleep 200

If you run a.sh from the previous example with the new b.sh your ptree and ps output will look as follows:
Output of ptree:
8932  /bin/sh ./a.sh
   8933  <defunct>
Output of ps:
8936  8932 grep 8933
ie. ps does not show anything at all and ptree shows a.sh with a defunct sub process. This defunct sub process is the leftover of b.sh.
Why is it a defunct? Because it has ended but its parent a.sh has not (yet) waited for it to finish.

Here is a new a.sh which solves that (remember this rule: a defunct process is always due to bad code in the parent, not the process which became defunct):
#!/bin/sh
./b.sh&
wait
ptree $!
ps -u $USER -o pid,ppid,args |grep $!
Running this a.sh will generate no ptree output at all:
b.sh has finished when running ptree, the sleep process is detached from the b.sh process hierarchy.

So how can we track down the 'sleep' process?
We need to use another process attribute: the process group id (pgid).

In the new a.sh I have removed the ptree call (since it won't return anything as shown above) and enhanced the ps command to show also the pgid, this time greping for the pid of a.sh (rather than b.sh as before).
#!/bin/sh
./b.sh&
wait
ps -u $USER -o pid,ppid,pgid,args |grep $$
Output of ps:
18028 32741 18028 /bin/sh a.sh
18030     1 18028 sleep 200
18031 18028 18028 grep 18028
18032 18031 18028 ps -u andreash -o pid,ppid,pgid,args
So the sleep process can be found in the list of processes with pgid 18028 (the pid of a.sh) since all sub processes of a.sh seem to be grouped by pgid.

Happy? Not quite. The next part will show that this solution also might fail.

What if there is no pgid?

The former example does work under certain assumptions only:
you need to run a.sh in a shell which supports pgid creation (csh, ksh), it does not work if you run it in Bourne shell.
(all the examples above were tested in csh, the standard user's working shell in our environment).

sunflower% sh
$ ./a.sh
27103 27099 27098 grep 27099
27099 27098 27098 /bin/sh ./a.sh
$ ps -o pid,ppid,pgid,args|grep sleep
27277 27098 27098 grep sleep
27102     1 27098 sleep 200
What you notice is that the sleep process has pgid 27098 which is also the parent pid of a.sh ie. a.sh did not create its own process group. Searching for processes with pgid equal to the pid of a.sh is futile.

The solution is to write a script which puts its sub processes into a process group of its own, and one way to do it is to use the monitor option of ksh:
set -m
will put b.sh (and all sub processes of b.sh) into a process group with pgid equal to b.sh's pid
ie. again I'm greping for $! (so I reversed the $$ again)

a.sh:
#!/bin/ksh
set -m
./b.sh&
wait
ps -u $USER -o pid,ppid,pgid,args |grep $!
will lead to output of ps:
31103     1 31102 sleep 200
31105 31101 31101 grep 31102

This seemed to me a very nice solution until it dawned upon me how this could fail too.

Recursive use of pgid creation

Using the same technique as described in the last part a sub process can not just detach itself from the process hierarchy but can also create its own process group and thus the original script will have lost track completely.

Replace b.sh by the following code:
b.sh:
#!/bin/ksh
set -m
sleep 200&

Output of a.sh will look like this (just the grep command):
38331 38327 38327 grep 38328
and when you check the 'sleep' process it shows its pid also as pgid:
% ps -o pid,ppid,pgid,args |grep sleep
38329     1 38329 sleep 200

How can such a process be identified as being a grandchild of a.sh?

Up to know I don't have an answer, it seems to me that a process can completely hide its origins and thus cannot be tracked or followed.
(a long time ago I posted the question to comp.unix.shell but didn't receive anything at the time)

If you have wondered throughout the article why do I bother at all?
very often I'm facing the scenario that I have to write script a.sh (i.e. I own it and control what it does) but script b.sh comes from a colleague, different department or even from another company. I need/want to ensure that - if I start other scripts in my script - no processes are left behind when my script ends. This cannot be guaranteed.

Why it is impossible to track all sub processes

Over time I got suggestions to use newtask (and then kill off all processes found by pkill -T taskid) or write a C program and use setsid or a Perl program and use POSIX::setsid to create a new session leader so that basically all child processes are tagged with the same kind of attribute which then can be used to identify them and do something about.

All of these suggestions have the same flaw than the one with pgid which I described above and the following argument should prove that it is impossible to track all sub processes and its sub processes (if the sub processes can be any kind of process and its code is not controlled by you).

Assume that your flavour of UNIX supports a way that you can generate a sub process with a certain attribute which distinguishes the sub process and its offspring from the current process (and possible parent processes).
In the same fashion a sub process of the sub process can use this technique to distinguish itself from the sub process. The current process will find the sub process but it cannot find the sub process of the sub process anymore.

Solutions would be that the OS would restrict the setting of that attribute in way that the current process can set it for sub processes but sub processes of the sub process would be blocked to set that attribute or that processes need to notify their parent processes about attribute changes somehow which is not available/possibly in any of the UNIXes I know.

Summary:
  • a process can track (and kill) all of its sub processes
  • a process can track (and kill) all of a sub process's descendants
    • if the sub process sets a certain attribute equal to its process id
    • if none of the sub process descendants changes that attribute

Even if you think you are in (code) control of all sub processes and their descendants you might not be aware of all side effects: a process might unknowingly start a daemon.
Just envision the calling of gconfd: it will be started if it is not running yet. The process which actually caused the start of gconfd will very likely have no idea that it is there since it is only trying to get a service. That the service required a daemon and that proper cleanup would mean the daemon to be killed and that the daemon maybe services other processes too (and thus should not be killed) are all considerations with no easy answers.

No comments:

Post a Comment