Message190881
As a design principle, "accept what's unambiguous in any locale" is reasonable, but it is hard to apply consistently. I would agree that the status quo is hard to defend. After a long discussion, it has been accepted that fullwidth digits should be accepted and now float(u'123') is valid, but not float('+123'), float('-123') or float('12⒊'). The last example is
>>> '\N{FULLWIDTH DIGIT ONE}\N{FULLWIDTH DIGIT TWO}\N{DIGIT THREE FULL STOP}'
'12⒊'
All these variations can be neatly addressed by applying NFKC or NFKD normalization to unicode data before conversion:
>>> float(unicodedata.normalize('NFKD', '+123'))
123.0
>>> float(unicodedata.normalize('NFKD', '-123'))
-123.0
>>> float(unicodedata.normalize('NFKC', '12⒊'))
123.0
This would even allow parsing fullwidth hexadecimal numbers:
>>> float.fromhex(unicodedata.normalize('NFKC', '0x⒈7p3'))
11.5
>>> int(unicodedata.normalize('NFKC', '7F'), 16)
127
but would not help with the MINUS SIGN.
Allowing '\N{MINUS SIGN}' is particularly attractive because arguably unicode text should prefer it to ambiguous '\N{HYPHEN-MINUS}', but on the same token fractions.Fraction() should accept '\N{FRACTION SLASH}' in addition to the legacy '\N{SOLIDUS}'.
Overall, I think this situation calls for a PEP-size proposal and discussion about handling unicode numerical data throughout stdlib rather that a case by case discussion of the various quirks in the curent version. |
|
Date |
User |
Action |
Args |
2013-06-10 01:55:58 | belopolsky | set | recipients:
+ belopolsky, lemburg, loewis, rhettinger, terry.reedy, mark.dickinson, ggenellina, pitrou, eric.smith, ezio.melotti, skrah, lukasz.langa, tchrist |
2013-06-10 01:55:58 | belopolsky | set | messageid: <1370829358.56.0.512052871642.issue6632@psf.upfronthosting.co.za> |
2013-06-10 01:55:58 | belopolsky | link | issue6632 messages |
2013-06-10 01:55:57 | belopolsky | create | |
|