Monday, December 10, 2018

How to write better/safer functions for UNIX commands through the example of 'mkdir'

UNIX shell scripts contain loads of UNIX commands. Very often these commands are invoked in a simple manner which do not quite meet the standards of being production worthy.
I don't want to discuss here what standards of production worthiness are but I want to show a few steps to improve your scripts from using a very simple command invocation to a more refined usage of the command.
My main consideration is this:
  • easiness of trouble shooting: scripts will fail. The script should behave in a way to make it easy for a troubleshooter to identify the root cause of the failure.

This implies: errors should be handled properly i.e. they should be
  • caught,
  • documented e.g. in log files and
  • finished. Finishing means e.g. revoking certain actions being done beforehand, cleaning up temporary stuff, continuing with other actions or exiting the script (if appropriate).
Before I start with my example I'd like to highlight an important thing which is also a main driver for this tutorial: many UNIX commands fail in a very simple manner. There can be a number of reasons for failure but the command always exits with exit code 1 and writes a more or less descriptive error message (to stderr). Not all of the failure reasons might lead to the same behaviour, reason A might be grave and imply exiting the script, reason B might be skipped.

I'll explain this through the example of mkdir. As a teaser here is a common scenario:
scripts often create temporary directories at the start. If the temporary directory already exists you probably don't care and despite of a mkdir failure would like to continue. If mkdir fails because the script is executed in a wrong directory where you don't have write permissions you probably don't want to continue and rather exit. This should be handled in your code.

Here is the starting point: a simple invocation of mkdir.
mkdir tmpdir
If this command fails the script will show an error message and probably exit e.g. if set -e is set in bash.
If you want to catch the error and decide what to do there are some options. You could use the shell binary operators AND (&&) or OR (||).

  • you definitely want to exit: mkdir tmpdir || exit 1
  • you do not want to exit the script but write a specific message:
    mkdir tmpdir || echo "WARNING - mkdir of tmpdir failed"
  • you want to define some actions (which might include an exit):
    mkdir tmpdir
    if [ $? -ne 0 ] ; then
      # Do something here e.g.
      echo "ERROR - mkdir of tmpdir failed. Exiting." >&2
      exit 1
    fi
    
Now assume that sometimes you want to exit if mkdir fails and sometimes not. An idea is to write a wrapper function which contains all the logic.
# Function 'mkDir'
# $1: Y/N flag to exit (Y) or not (N) in case of failure
# $2: the directory to be created
function mkDir() {
  mkdir "$2"     # execute mkdir command
  rc=$?          # capture return code
  # If successful return immediately
  [ $rc -eq 0 ] && return 0
  # Check if exit is required
  if [ "$1" = "Y" -o "$1" = "y" ] ; then
    echo "ERROR - mkdir failed for $2" >&2
    exit 1
  fi
  echo "WARNING - mkdir failed for $2" >&2
  return $rc
}

# invocation

mkDir Y tmpdir

# A loop to create a number of directories.
# We want to loop through all directories 
# regardless if a mkdir fails or not and 
# create a tmpfile in each of them.
for tmpdir in a b c ; do
  mkDir n $tmpdir || continue
  touch $tmpdir/tmpfile
done
A trickier handling would be to distinguish different errors. Two main reasons for a mkdir failure are
  • the directory already exists
  • the directory cannot be created because of issues with parent directory of the new directory (which could be . or an absolute path) like missing write permissions (w missing for owner, group or other, depending on who runs the script where)

So here is an extended version of our wrapper function, also adding different return codes for various findings.
# Function 'mkDir'
# $1: Y/N flag to exit (Y) or not (N) in case of failure
# $2: the directory to be created
# Return codes:
#   0: all ok, directory could be created
#   1: mkdir failed (for a different reason than the assertions)
#   2: directory already exists
function mkDir() {
  DIR="$2"

  # Before executing mkdir we run a couple of assertions.

  # Check if directory already exists
  [ -d "$DIR" ] && echo "WARNING - directory $DIR already exists" && return 2

  # Check if parent directory exists and is writable
  # If not: we exit here 
  PARENTDIR=`dirname $DIR`
  [ ! -d "$PARENTDIR" ] && echo "ERROR - parent directory $PARENTDIR does not exist" && exit 1
  [ ! -w "$PARENTDIR" ] && echo "ERROR - parent directory $PARENTDIR is not writable" && exit 1

  # Now we try to run the mkdir command
  mkdir "$DIR"   
  rc=$?          # capture return code
  # If successful return immediately
  [ $rc -eq 0 ] && return 0

  # Here we know: mkdir failed but not for the two reasons we checked earlier
  # Check if exit is required
  if [ "$1" = "Y" -o "$1" = "y" ] ; then
    echo "ERROR - mkdir of $DIR failed" >&2
    exit 1
  fi
  echo "WARNING - mkdir of $DIR failed" >&2
  return $rc
}


# Invocations

mkDir n tmpdir
mkDir n tmpdir      # this one fails because tmpdir already exists

mkDir n abc/tmpdir  # this one fails if abc does not exist

mkDir n /tmpdir     # this one fails because we cannot write to a root owned directory

The examples above were mainly showing how to capture and finish failures. I probably will write about documenting errors in a separate blog which will cover separation or unification of stderr and stdout, redirecting messages to a proper log file and if needed show messages in parallel on the terminal.

Conclusion

In order to make life easier when trouble shooting script failures it it advisable to wrap UNIX commands into functions and create some error handling logic for known and common error causes. In the long run and in particular if used by a number of different persons you will save time since the script will throw out a specific defined error message. The various error causes of a command can drive different behaviour in the script which does help write better logic flows.

Note 1

Using other programming languages does not necessarily help. The Java java.io or java.nio packages would throw a Java IOException, again not very distinguishable and too simple.

Note 2

I wonder why there is no shell library around to cover this topic, basically a shell library of UNIX command wrapper functions. Maybe I have not searched enough.

Thursday, December 6, 2018

Thoughts about error messages and troubleshooting

In this article I will discuss a couple of thoughts about error messages and their helpfulness to troubleshoot the real issue or - as it turns out - often inadequacy in quickly troubleshooting errors and I also give an example at the very end how to do better.

My example is: remove a directory (in UNIX)
The usual command is something like
rm -r somedir
Things can go wrong and you will end up with a
rm: somedir: Permission denied

I bet many of the readers will have encountered this in their UNIX lives (sysadmin or not).
The unfortunate thing for troubleshooting this error is that there can be multiple causes for it. Depending on your experience you can more or less quickly spot the root cause in your real world scenario. The not so obvious fact - at least to the unexperienced user - is that in order to remove a directory you need sufficient permissions for the directory and also sufficent permissions for its parent directory.
In more abstract terms going beyond directory removal: any action failing with permission denied probably requires permissions (1) for the object to be handled but (2) also for one or more contextual objects.
Going back to the directory removal here are number of permission issues which are all negatives of the necessary rule: the user executing the rm needs to have read and write permissions on the directory and its parent directory (which requires execute permission too). This can be achieved via user, group or other permission and ownership settings.
(Note: I am not discussing even more refined settings via ACLs which is adding to complexity).
Even if both directory and parent directory are owned by the executing user there are number of error causes.

  • directory does not have write permissions
  • directory has write permissions but no read permission
  • parent directory does not have write or execute permissions
Now after this long explanation I am coming to my actual message: Error messages are often not precise enough to help in quick trouble shooting. You need some meta knowledge and experience to identify the root cause which will allow you to resolve the issue.
In the example above the standard UNIX rm is mapping a list of different root causes into a single permission denied error message leaving it to the user to know and investigate the various causes.
I tried a different approach using Java and the Java java.io.File and java.nio.file.Files packages to delete a directory but their exceptions (IOExcepction) are equally simplistic and do not help to identify the root cause quickly.

I am wondering why this is the case, in general, not just reduced to this example.
My experience in Application Support is that it is most important to quickly identify root causes. Why should a user be left with poking around rather than giving him/her the actual root cause right away?
If the rm command knows that the write permission for the directory is missing why doesn't it tell us right away?
I am seeing this way of thinking at all kinds of levels in Software development. Rather than giving all the information about an error scenario the user is stuck with some general blurb and needs to spend time. My example is from a rather low level and things get more frustrating the further one moves up the software ladder. Complex GUI based applications are even more frustrating to troubleshoot.


Here is a script to generate some directories with wrong permissions and test rm. I am creating 3 directories with different permissions. Two removals will fail
mkdir -m 000 somedir1
mkdir -m 200 somedir2
mkdir -m 600 somedir3
ls -la
rm -r somedir1 somedir2 somedir3
ls -la

Here is the output:
total 0
drwxr-x---   5 ahaupt  2006807681  170 Dec  6 16:51 .
drwxr-xr-x  10 ahaupt  2006807681  340 Dec  6 13:46 ..
d---------   2 ahaupt  2006807681   68 Dec  6 16:51 somedir1
d-w-------   2 ahaupt  2006807681   68 Dec  6 16:51 somedir2
drw-------   2 ahaupt  2006807681   68 Dec  6 16:51 somedir3

rm: somedir1: Permission denied
rm: somedir2: Permission denied

drwxr-x---   4 ahaupt  2006807681  136 Dec  6 16:51 .
drwxr-xr-x  10 ahaupt  2006807681  340 Dec  6 13:46 ..
d---------   2 ahaupt  2006807681   68 Dec  6 16:51 somedir1
d-w-------   2 ahaupt  2006807681   68 Dec  6 16:51 somedir2
Now let's re-create somedir3 again but remove the write permission for the current directory. All 3 removals will fail.
mkdir -m 600 somedir3
chmod u-w .
ls -la
rm -r somedir1 somedir2 somedir3
ls -la

dr-xr-x---   5 ahaupt  2006807681  170 Dec  6 16:55 .
drwxr-xr-x  10 ahaupt  2006807681  340 Dec  6 13:46 ..
d---------   2 ahaupt  2006807681   68 Dec  6 16:51 somedir1
d-w-------   2 ahaupt  2006807681   68 Dec  6 16:51 somedir2
drw-------   2 ahaupt  2006807681   68 Dec  6 16:55 somedir3

rm: somedir1: Permission denied
rm: somedir2: Permission denied
rm: somedir3: Permission denied

total 0
dr-xr-x---   5 ahaupt  2006807681  170 Dec  6 16:55 .
drwxr-xr-x  10 ahaupt  2006807681  340 Dec  6 13:46 ..
d---------   2 ahaupt  2006807681   68 Dec  6 16:51 somedir1
d-w-------   2 ahaupt  2006807681   68 Dec  6 16:51 somedir2
drw-------   2 ahaupt  2006807681   68 Dec  6 16:55 somedir3


Finally here is a *shell snippet* which gives proper information by catching some of the root causes of a failing rm. This is not exhaustive but should give you an idea what could be done to improve the error handling and make it more user friendly.
function rmDir {
  DIR="$1"
  PARENTDIR=`dirname $1`
  MSG=""
  # Check write permission of dir
  [ ! -w "$DIR" ] && MSG="$MSG
ERROR - not writable $DIR"
  # Check read permission of dir
  [ ! -r "$DIR" ] && MSG="$MSG
ERROR - not readable $DIR"
  # Check write permission of parent dir
  [ ! -w "$PARENTDIR" ] && MSG="$MSG
ERROR - not writable $PARENTDIR"
  # Check execute permission of parent dir
  [ ! -x "$PARENTDIR" ] && MSG="$MSG
ERROR - not executable $PARENTDIR"
  [ ! -z "$MSG" ] && echo "$MSG" && exit 1
  rm -r $DIR
}
In a scenario like below you will get adequate error messages and there is not even a try to invoke the real rm.
dr-xr-x---   4 ahaupt  2006807681  136 Dec  6 22:52 .
drwxr-xr-x  11 ahaupt  2006807681  374 Dec  6 22:57 ..
d-w-------   2 ahaupt  2006807681   68 Dec  6 16:51 somedir2

rmDir somedir2

ERROR - not readable somedir2
ERROR - not writable .

Of course it is difficult to think about all possible error cases in advance but I claim that there are many cases where it would be possible to enhance the code and show more detailed and more precise messages than a simple mapping of many kinds of errors into a vague error statement.