Thursday, December 6, 2018

Thoughts about error messages and troubleshooting

In this article I will discuss a couple of thoughts about error messages and their helpfulness to troubleshoot the real issue or - as it turns out - often inadequacy in quickly troubleshooting errors and I also give an example at the very end how to do better.

My example is: remove a directory (in UNIX)
The usual command is something like
rm -r somedir
Things can go wrong and you will end up with a
rm: somedir: Permission denied

I bet many of the readers will have encountered this in their UNIX lives (sysadmin or not).
The unfortunate thing for troubleshooting this error is that there can be multiple causes for it. Depending on your experience you can more or less quickly spot the root cause in your real world scenario. The not so obvious fact - at least to the unexperienced user - is that in order to remove a directory you need sufficient permissions for the directory and also sufficent permissions for its parent directory.
In more abstract terms going beyond directory removal: any action failing with permission denied probably requires permissions (1) for the object to be handled but (2) also for one or more contextual objects.
Going back to the directory removal here are number of permission issues which are all negatives of the necessary rule: the user executing the rm needs to have read and write permissions on the directory and its parent directory (which requires execute permission too). This can be achieved via user, group or other permission and ownership settings.
(Note: I am not discussing even more refined settings via ACLs which is adding to complexity).
Even if both directory and parent directory are owned by the executing user there are number of error causes.

  • directory does not have write permissions
  • directory has write permissions but no read permission
  • parent directory does not have write or execute permissions
Now after this long explanation I am coming to my actual message: Error messages are often not precise enough to help in quick trouble shooting. You need some meta knowledge and experience to identify the root cause which will allow you to resolve the issue.
In the example above the standard UNIX rm is mapping a list of different root causes into a single permission denied error message leaving it to the user to know and investigate the various causes.
I tried a different approach using Java and the Java java.io.File and java.nio.file.Files packages to delete a directory but their exceptions (IOExcepction) are equally simplistic and do not help to identify the root cause quickly.

I am wondering why this is the case, in general, not just reduced to this example.
My experience in Application Support is that it is most important to quickly identify root causes. Why should a user be left with poking around rather than giving him/her the actual root cause right away?
If the rm command knows that the write permission for the directory is missing why doesn't it tell us right away?
I am seeing this way of thinking at all kinds of levels in Software development. Rather than giving all the information about an error scenario the user is stuck with some general blurb and needs to spend time. My example is from a rather low level and things get more frustrating the further one moves up the software ladder. Complex GUI based applications are even more frustrating to troubleshoot.


Here is a script to generate some directories with wrong permissions and test rm. I am creating 3 directories with different permissions. Two removals will fail
mkdir -m 000 somedir1
mkdir -m 200 somedir2
mkdir -m 600 somedir3
ls -la
rm -r somedir1 somedir2 somedir3
ls -la

Here is the output:
total 0
drwxr-x---   5 ahaupt  2006807681  170 Dec  6 16:51 .
drwxr-xr-x  10 ahaupt  2006807681  340 Dec  6 13:46 ..
d---------   2 ahaupt  2006807681   68 Dec  6 16:51 somedir1
d-w-------   2 ahaupt  2006807681   68 Dec  6 16:51 somedir2
drw-------   2 ahaupt  2006807681   68 Dec  6 16:51 somedir3

rm: somedir1: Permission denied
rm: somedir2: Permission denied

drwxr-x---   4 ahaupt  2006807681  136 Dec  6 16:51 .
drwxr-xr-x  10 ahaupt  2006807681  340 Dec  6 13:46 ..
d---------   2 ahaupt  2006807681   68 Dec  6 16:51 somedir1
d-w-------   2 ahaupt  2006807681   68 Dec  6 16:51 somedir2
Now let's re-create somedir3 again but remove the write permission for the current directory. All 3 removals will fail.
mkdir -m 600 somedir3
chmod u-w .
ls -la
rm -r somedir1 somedir2 somedir3
ls -la

dr-xr-x---   5 ahaupt  2006807681  170 Dec  6 16:55 .
drwxr-xr-x  10 ahaupt  2006807681  340 Dec  6 13:46 ..
d---------   2 ahaupt  2006807681   68 Dec  6 16:51 somedir1
d-w-------   2 ahaupt  2006807681   68 Dec  6 16:51 somedir2
drw-------   2 ahaupt  2006807681   68 Dec  6 16:55 somedir3

rm: somedir1: Permission denied
rm: somedir2: Permission denied
rm: somedir3: Permission denied

total 0
dr-xr-x---   5 ahaupt  2006807681  170 Dec  6 16:55 .
drwxr-xr-x  10 ahaupt  2006807681  340 Dec  6 13:46 ..
d---------   2 ahaupt  2006807681   68 Dec  6 16:51 somedir1
d-w-------   2 ahaupt  2006807681   68 Dec  6 16:51 somedir2
drw-------   2 ahaupt  2006807681   68 Dec  6 16:55 somedir3


Finally here is a *shell snippet* which gives proper information by catching some of the root causes of a failing rm. This is not exhaustive but should give you an idea what could be done to improve the error handling and make it more user friendly.
function rmDir {
  DIR="$1"
  PARENTDIR=`dirname $1`
  MSG=""
  # Check write permission of dir
  [ ! -w "$DIR" ] && MSG="$MSG
ERROR - not writable $DIR"
  # Check read permission of dir
  [ ! -r "$DIR" ] && MSG="$MSG
ERROR - not readable $DIR"
  # Check write permission of parent dir
  [ ! -w "$PARENTDIR" ] && MSG="$MSG
ERROR - not writable $PARENTDIR"
  # Check execute permission of parent dir
  [ ! -x "$PARENTDIR" ] && MSG="$MSG
ERROR - not executable $PARENTDIR"
  [ ! -z "$MSG" ] && echo "$MSG" && exit 1
  rm -r $DIR
}
In a scenario like below you will get adequate error messages and there is not even a try to invoke the real rm.
dr-xr-x---   4 ahaupt  2006807681  136 Dec  6 22:52 .
drwxr-xr-x  11 ahaupt  2006807681  374 Dec  6 22:57 ..
d-w-------   2 ahaupt  2006807681   68 Dec  6 16:51 somedir2

rmDir somedir2

ERROR - not readable somedir2
ERROR - not writable .

Of course it is difficult to think about all possible error cases in advance but I claim that there are many cases where it would be possible to enhance the code and show more detailed and more precise messages than a simple mapping of many kinds of errors into a vague error statement.

No comments:

Post a Comment