In this article I will discuss the relationship between the environment variable LC_NUMERIC and various versions of awk.
My example is a very simple one: calculate the integer part and the square root of a given value
echo 5.29 | awk '{ print int($1), sqrt($1) }'The expected output:
5 2.3
Now imagine a user in Germany who is using a 'german' environment and part of it is the following setting:
LC_NUMERIC=de_DE.UTF-8The output of the code above will be this:
0 0What a surprise one might think. Whatever your real code would have been doing with these values it would have ended up somewhere else where you wanted it to be.
(The examples above have been run in Solaris (10 or 11) with the old original awk.)
Let's see what we get with Solaris' nawk, MacOS' awk (on 10.5.8) and Linux' (Ubuntu 11) mawk:
LC_NUMERIC | ||
---|---|---|
not set or e.g. en_US | de_DE.UTF-8 (de_DE.utf8 on Ubuntu) | |
Solaris awk | 5 2.3 | 0 0 |
Solaris nawk or MacOS awk | 5 2.3 | 5 2,23607 |
Ubuntu mawk | 5 2.3 | 5 2.3 |
So what you should note:
How can these seemingly strange results be explained?
Well, languages and cultures differ in many things and one of them is the mark which is used as decimal separator.
In the english speaking world the '.' (dot) is being used whereas others like e.g. germans are using ',' (comma).
That has important consequences when working in a german environment: a string like '5.29' is not viewed as a real number but as a string.
nawk and MacOS awk try to get as much 'number' out of this string as possible which is simply the '5' before the dot and the integer and square root functions are applied to '5' (not to '5.29') and of these two functions the square root is off track.
The original awk for some reasons treats the whole '5.29' as a string and numeric functions applied to a string lead to zero for both functions.
Why Ubuntu's mawk does not honor locale settings at all seems strange to me after having seen how the other awks behave. I have not researched though whether this is known, a bug or considered to be a feature and I haven't tested on newer Ubuntu or other Linux versions.
Conclusion: even when writing seemingly simple scripts the environment (or more precise: locale setting) plays an important role in the well-functioning and portability.
If one is working for some time on a particular system one tends to take for granted some of the functionality which is not present or different on other systems.
An input value of '5,29' would conversely work well with the german setting but not the english one.
Note that also the LANG parameter has the same influence on the proper working of real numbers.
Food for thought:
the decimal separator issue (and others as well e.g. date format) can create bigger problems.
Think about the simple case of a text file which was created by users entering decimal numbers for this and that.
You have one text file from the US and one from Germany and you want to use its contents with one script.
But one file delivers dot separated and the other comma separated input fields! A new hurdle.
Also a CSV file in its original sense 'comma separated' does not work well in a german environment where the field separator would be equal to the decimal separator and thus another non-textual field separator should be chosen (TAB is preferrable anyway).
Therefore more complex business applications with attached databases or even the Office programs like LibreOffice (calc) store numbers in just one format and they use the presentation layer to show numbers in whichever way the actual user wants to see them.
This guarantees that the raw data stored in the databases can be mixed and used together regardless of the language setting of the users.
And of course there are a couple of more languages out there with other features being represented in the locale settings and thus creating possible pitfalls to the unaware coder.
Update April 2017
Having familiarized myself a bit more with gawk the issue is how awk versions interpret input and output (which can be different, at least in gawk). gawk supports a particular option --use-lc-numeric
to interpret the output.
Below I am comparing inputs in C numeric with dot as a decimal separator and de_DE.utf8 numeric with comma as a decimal separator.
For all commands I had set LC_NUMERIC="de_DE.utf8"
.
Input | Command | Output | ||
---|---|---|---|---|
C | 5.29 | echo 5.29 | | C | 5 2.3 |
C | 5.29 | echo 5.29 | | de_DE.utf8 | 5 2,23607 |
de_DE.utf8 | 5,29 | echo 5.29 | | C | 5 2.23607 |
de_DE.utf8 | 5,29 | echo 5.29 | | de_DE.utf8 | 5 2,3 |
Thanks for the good explanation and tips on this subject! Cheers!
ReplyDelete