classification
Title: locale.format bug if thousand separator is space (french separator as example)
Type: behavior Stage:
Components: Library (Lib) Versions: Python 2.6
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: loewis Nosy List: edlm10, eric.smith, georg.brandl, loewis, mark.dickinson, pitrou, waveform, wdoekes
Priority: normal Keywords: patch

Created on 2007-09-30 14:15 by edlm10, last changed 2009-03-18 21:30 by eric.smith. This issue is now closed.

Files
File name Uploaded Description Edit
thousands_sep.patch pitrou, 2008-07-25 21:02
Messages (23)
msg56200 - (view) Author: Éric de la Musse (edlm10) Date: 2007-09-30 17:17
locale.format function delete spaces in result which is a problem when
thousand separator is space (in French for example).

The problem seems in the code below (extract from locale.py):

145     while seps:
146         # If the number was formatted for a specific width, then it
147         # might have been filled with spaces to the left or right. If
148         # so, kill as much spaces as there where separators.
149         # Leading zeroes as fillers are not yet dealt with, as it is
150         # not clear how they should interact with grouping.
151         sp = result.find(" ")
152         if sp==-1:break
153         result = result[:sp]+result[sp+1:]
154         seps -= 1

Example :

>>> import locale
>>> locale.setlocale(locale.LC_NUMERIC)
'C'
>>> locale.setlocale(locale.LC_NUMERIC, "fr_FR.UTF-8")
'fr_FR.UTF-8'
>>> locale.format("%.2f", 12345.67, True)
'12345,67'

The correct result is '12 345,67' not '12345,67'

and if I call

>>> locale.format("%9.2f", 12345.67, True)
'12 345,67'

the result is correct

Is this behavior correct or a bug?
msg59829 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2008-01-12 17:18
Just tried on 2.5.1 and on trunk, I can't reproduce it. On both I get:

>>> import locale
>>> locale.setlocale(locale.LC_NUMERIC, "fr_FR.UTF-8")
'fr_FR.UTF-8'
>>> locale.format("%9.2f", 12345.67, True)
' 12345,67'
>>> locale.format("%.2f", 12345.67, True)
'12345,67'
>>> locale.localeconv()['thousands_sep']
''

Which means that there is no thousands separator for the French locale
on my system (a Mandriva Linux box).
msg60084 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2008-01-18 10:49
Ok, I manage to reproduce it on an Ubuntu box with both 2.5.2a0 and SVN
trunk.
IMO it's a bug since the expected behaviour would be to enforce the
locale settings for number formatting. That's the whole purpose of
locale.format() after all. (no to mention the discrepancy between %.2f
and %9.2f)
msg60093 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2008-01-18 14:18
Eric, your diagnostic looks right, format() gets confused when it tries
to remove padding characters to account for the added thousands
separators. It does not check that there were padding characters in the
first place, and it assumes that the thousands separator is not a space
character itself.

Since Georg Brandl and Martin von Loewis are listed as co-authors, I add
them to the cc list ;)
msg60136 - (view) Author: Georg Brandl (georg.brandl) * (Python committer) Date: 2008-01-19 09:52
The space removal was there before my change; assigning to Martin.
msg61929 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2008-01-31 21:16
Here is a patch. It is fairly big and I'm sorry about it, here is why:
1. it reworks the test suite using unittest, and makes it more comprehensive
2. it adds a feature in locale.py which allows overriding the localeconv
dictionary (this is necessary for proper testing)
3. it fixes the present bug, and also fixes integer formatting with
grouping and padding (which was broken, as well as untested)

Point 2 above may be controversial, it is just a dictionary (by default
empty) named _override_localeconv, if you prefer we can make it a real
API, e.g. _set_localeconv().
msg61956 - (view) Author: Georg Brandl (georg.brandl) * (Python committer) Date: 2008-02-01 10:55
Looks good to me.
msg65558 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2008-04-16 18:20
Here is an updated patch against trunk.
msg65582 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2008-04-17 10:12
I see that http://wiki.python.org/moin/PythonBugDay says "need to test
on non-MacOS platform". Actually, I've tested the patch under Mandriva
and Debian Linux.
msg65912 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2008-04-28 12:00
I've had the opportunity to test on a Windows box and there are various
failures in the TestStringMethods test case. If someone with more
knowledge of the Windows world could take a lookm it would be nice.
msg65914 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2008-04-28 12:30
Ok, here is an updated patch for Windows compatibility of the test suite.
msg70173 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2008-07-23 14:57
Would someone object to committing this before beta3?
For clarity I would first commit the rewrite of test_locale to use
unittest, and then the fix for the thousands separator bug.
msg70272 - (view) Author: Eric V. Smith (eric.smith) * (Python committer) Date: 2008-07-25 20:01
I'd recommend breaking the patch into the 3 parts mentioned in msg61929,
then committing the first 2 parts.  That should make the 3rd part much
easier to evaluate.
msg70274 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2008-07-25 20:55
The first 2 parts have been committed in r65237. I'll soon provide a
patch for the third part.
msg70275 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2008-07-25 21:02
Here is the patch for the third part.
msg75854 - (view) Author: Walter Doekes (wdoekes) Date: 2008-11-14 09:46
@Antoine Pitrou

Regarding "# XXX is the trailing space a bug?": I'm inclined to believe
that it is. Man 7 locale does not mention that p_sep_by_space should be
used for non-int currency symbols, nor that it shouldn't. However, it
does say:

""" char *int_curr_symbol; /* First three chars are a currency symbol
from ISO 4217.  Fourth char is the separator. Fifth char is ’ ’. */ """

I haven't seen that fifth character, and in the libc sources I can't
find one either:

glibc-2.7/localedata/locales/nl_NL:66:int_curr_symbol
"<U0045><U0055><U0052><U0020>"

I do however see a separator.

In libc documentation I found here (
http://www.chemie.fu-berlin.de/chemnet/use/info/libc/libc_19.html#SEC325
), it says the following:

""" In other words, treat all nonzero values alike in these members.
These members apply only to currency_symbol. When you use
int_curr_symbol, you never print an additional space, because
int_curr_symbol itself contains the appropriate separator. The POSIX
standard says that these two members apply to the int_curr_symbol as
well as the currency_symbol. But an example in the ANSI C standard
clearly implies that they should apply only to the
currency_symbol---that the int_curr_symbol contains any appropriate
separator, so you should never print an additional space. Based on what
we know now, we recommend you ignore these members when printing
international currency symbols, and print no extra space. """

This is probably not right either, because this forces you to use an
n_sign_posn and p_sign_posn that have the symbol on the same side of the
value. (Although that might not be such an awful assumption.) And, more
importantly, a grep through the sources reveal that no language has a
preceding space. (But they do, I assume, have *_sign_posn's that want a
trailing symbol.)

""" glibc-2.7/localedata/locales$ grep ^int_curr_symbol * | grep -vE
'(<U0020>| )"' | wc -l
0 """

That leaves us with two more options. Option three: the fourth character
is optional and defines what the separator is but not _where_ it should
be. I.e. you might have to move it according to what *_sign_posn says.

And finally, option four: international formatting should ignore all of
*_cs_precedes, *_sep_by_space and *_sign_posn. Locale(7) explicitly
mentions currency_symbol, not int_cur_symbol. Perhaps one should assume
that international notation is universal. (I would guess that most
common is:
<int_curr_symbol><space><optional_minus><num><mon_thousands_sep><num><mon_decimal_point><num>)


Personally I would go with option three. It has the least impact on
formatting because it only cleans up spaces. I'm guessing however that
option four is the Right One.
msg83549 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2009-03-14 00:16
Committed in r70356, r70357, r70358, r70359. I didn't try to change
anything about the trailing space since it's minor anyway.
msg83753 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2009-03-18 15:50
r70357 seems to have caused the sparc solaris py3k buildbot to start failing on 
test_locale:


======================================================================
FAIL: test_integer_grouping_and_padding (test.test_locale.TestNumberFormatting)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home2/buildbot/slave/3.x.loewis-sun/build/Lib/test/test_locale.py", line 181, 
in test_integer_grouping_and_padding
    out=('4%s200' % self.sep).rjust(10))
  File "/home2/buildbot/slave/3.x.loewis-sun/build/Lib/test/test_locale.py", line 143, 
in _test_format
    func=locale.format, **format_opts)
  File "/home2/buildbot/slave/3.x.loewis-sun/build/Lib/test/test_locale.py", line 139, 
in _test_formatfunc
    func(format, value, **format_opts), out)
AssertionError: '     4200' != '      4200'


Any idea what's going on here?
msg83757 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2009-03-18 17:02
Hi Mark,

> AssertionError: '     4200' != '      4200'
> 
> Any idea what's going on here?

The problem seems to be that the thousands separator on the Solaris
variant of en_US is an empty string (rather than a comma) (*), and
apparently it hits a bug in the padding mechanism (which perhaps assumes
that the thousands separator is always a 1-character string).

(*) (another reason not to use any C locale-based mechanism for
localization, by the way...)
msg83758 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2009-03-18 17:05
By the way, Martin, do you have any idea why the Solaris buildbot
doesn't seem to be enabled on trunk?
msg83761 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2009-03-18 17:11
Trying a fix in r70458.
msg83769 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2009-03-18 18:57
It looks like that did the trick.  Thank you!
msg83779 - (view) Author: Eric V. Smith (eric.smith) * (Python committer) Date: 2009-03-18 21:29
Antoine Pitrou wrote:
> The problem seems to be that the thousands separator on the Solaris
> variant of en_US is an empty string (rather than a comma) (*), and
> apparently it hits a bug in the padding mechanism (which perhaps assumes
> that the thousands separator is always a 1-character string).
> 
> (*) (another reason not to use any C locale-based mechanism for
> localization, by the way...)

I've come to believe this, too. I'm working on cleaning up the C 
implementations so I can do all of the locale-based formating without 
using the locale functions. I'll use the localeconv values, but that's it.
History
Date User Action Args
2009-03-18 21:30:00eric.smithsetmessages: + msg83779
title: locale.format bug if thousand separator is space (french separator as example) -> locale.format bug if thousand separator is space (french separator as example)
2009-03-18 18:57:53mark.dickinsonsetmessages: + msg83769
2009-03-18 17:11:42pitrousetmessages: + msg83761
2009-03-18 17:05:27pitrousetmessages: + msg83758
2009-03-18 17:02:31pitrousetmessages: + msg83757
2009-03-18 15:50:26mark.dickinsonsetnosy: + mark.dickinson
messages: + msg83753
2009-03-14 00:16:26pitrousetstatus: open -> closed

messages: + msg83549
resolution: fixed
2008-11-14 09:46:36wdoekessetmessages: + msg75854
2008-11-13 13:49:11wdoekessetnosy: + wdoekes
2008-07-25 21:02:50pitrousetfiles: - 1222.patch
2008-07-25 21:02:46pitrousetfiles: - bug1222.patch
2008-07-25 21:02:41pitrousetfiles: + thousands_sep.patch
messages: + msg70275
2008-07-25 20:55:45pitrousetmessages: + msg70274
2008-07-25 20:01:50eric.smithsetmessages: + msg70272
2008-07-23 14:57:28pitrousetpriority: normal
messages: + msg70173
2008-04-28 12:30:33pitrousetfiles: + bug1222.patch
messages: + msg65914
2008-04-28 12:01:02pitrousetmessages: + msg65912
2008-04-23 16:33:51eric.smithsetnosy: + eric.smith
2008-04-17 10:12:33pitrousetmessages: + msg65582
2008-04-16 18:20:52pitrousetfiles: + 1222.patch
keywords: + patch
messages: + msg65558
2008-04-16 18:20:06pitrousetfiles: - 1222.patch
2008-02-01 10:55:32georg.brandlsetmessages: + msg61956
2008-01-31 21:16:44pitrousetfiles: + 1222.patch
messages: + msg61929
2008-01-19 09:52:59georg.brandlsetassignee: loewis
messages: + msg60136
2008-01-18 14:18:34pitrousetnosy: + loewis, georg.brandl
messages: + msg60093
2008-01-18 10:49:30pitrousetversions: + Python 2.6, - Python 2.4
2008-01-18 10:49:10pitrousetmessages: + msg60084
2008-01-12 17:18:20pitrousetnosy: + pitrou
messages: + msg59829
2007-11-24 21:34:12waveformsetnosy: + waveform
2007-09-30 17:17:48edlm10setnosy: + edlm10
messages: + msg56200
2007-09-30 14:15:46edlm10create