Saturday, March 9, 2013

How LC_NUMERIC affects various awk versions

Many people write scripts to be executed just in one place by one person: on the script writer's machine. Writing portable scripts (to be executed by many people on their machines) poses bigger challenges since one has to think about probable dependencies which could influence the proper execution of scripts.

In this article I will discuss the relationship between the environment variable LC_NUMERIC and various versions of awk.

My example is a very simple one: calculate the integer part and the square root of a given value

echo 5.29 | awk '{ print int($1), sqrt($1) }'
The expected output:
5 2.3

Now imagine a user in Germany who is using a 'german' environment and part of it is the following setting:

LC_NUMERIC=de_DE.UTF-8
The output of the code above will be this:
0 0
What a surprise one might think. Whatever your real code would have been doing with these values it would have ended up somewhere else where you wanted it to be.
(The examples above have been run in Solaris (10 or 11) with the old original awk.)

Let's see what we get with Solaris' nawk, MacOS' awk (on 10.5.8) and Linux' (Ubuntu 11) mawk:

LC_NUMERIC
not set or e.g. en_USde_DE.UTF-8
(de_DE.utf8 on Ubuntu)
Solaris awk5 2.30 0
Solaris nawk or MacOS awk5 2.35 2,23607
Ubuntu mawk5 2.35 2.3

So what you should note:

  • with LC_NUMERIC not set or being set to english everything is ok
  • with LC_NUMERIC being set to german the results do not just differ from the english setting but also differ extremely depending which version of awk one is using
  • the output of square root for Solaris nawk and MacOS awk contains a comma
  • Ubuntu's mawk is the only one to create the same result in both cases

    How can these seemingly strange results be explained?

    Well, languages and cultures differ in many things and one of them is the mark which is used as decimal separator. In the english speaking world the '.' (dot) is being used whereas others like e.g. germans are using ',' (comma). That has important consequences when working in a german environment: a string like '5.29' is not viewed as a real number but as a string. nawk and MacOS awk try to get as much 'number' out of this string as possible which is simply the '5' before the dot and the integer and square root functions are applied to '5' (not to '5.29') and of these two functions the square root is off track.
    The original awk for some reasons treats the whole '5.29' as a string and numeric functions applied to a string lead to zero for both functions.
    Why Ubuntu's mawk does not honor locale settings at all seems strange to me after having seen how the other awks behave. I have not researched though whether this is known, a bug or considered to be a feature and I haven't tested on newer Ubuntu or other Linux versions.

    Conclusion: even when writing seemingly simple scripts the environment (or more precise: locale setting) plays an important role in the well-functioning and portability.
    If one is working for some time on a particular system one tends to take for granted some of the functionality which is not present or different on other systems.

    An input value of '5,29' would conversely work well with the german setting but not the english one.

    Note that also the LANG parameter has the same influence on the proper working of real numbers.

    Food for thought:
    the decimal separator issue (and others as well e.g. date format) can create bigger problems. Think about the simple case of a text file which was created by users entering decimal numbers for this and that. You have one text file from the US and one from Germany and you want to use its contents with one script. But one file delivers dot separated and the other comma separated input fields! A new hurdle.
    Also a CSV file in its original sense 'comma separated' does not work well in a german environment where the field separator would be equal to the decimal separator and thus another non-textual field separator should be chosen (TAB is preferrable anyway).
    Therefore more complex business applications with attached databases or even the Office programs like LibreOffice (calc) store numbers in just one format and they use the presentation layer to show numbers in whichever way the actual user wants to see them. This guarantees that the raw data stored in the databases can be mixed and used together regardless of the language setting of the users.

    And of course there are a couple of more languages out there with other features being represented in the locale settings and thus creating possible pitfalls to the unaware coder.


    Update April 2017

    Having familiarized myself a bit more with gawk the issue is how awk versions interpret input and output (which can be different, at least in gawk). gawk supports a particular option --use-lc-numeric to interpret the output. Below I am comparing inputs in C numeric with dot as a decimal separator and de_DE.utf8 numeric with comma as a decimal separator. For all commands I had set LC_NUMERIC="de_DE.utf8".
    InputCommandOutput
    C5.29echo 5.29 |
    awk '{ print int($1), sqrt($1) }'
    C5 2.3
    C5.29echo 5.29 |
    awk --use-lc-numeric '{ print int($1), sqrt($1) }'
    de_DE.utf85 2,23607
    de_DE.utf85,29echo 5.29 |
    awk '{ print int($1), sqrt($1) }'
    C5 2.23607
    de_DE.utf85,29echo 5.29 |
    awk --use-lc-numeric '{ print int($1), sqrt($1) }'
    de_DE.utf85 2,3

  • Per default gawk assumes that input is in C numeric and it will output in C (row 1).
  • If your input is not in C numeric you need to tell gawk explicitly to consider the LC_NUMERIC setting (row 4). gawk does not honour LC_NUMERIC automatically.
  • All other combinations lead to wrong results (rows 2 and 3)
  • 1 comment:

    1. Thanks for the good explanation and tips on this subject! Cheers!

      ReplyDelete