classification
Title: Charmap decoding of no-BMP characters
Type: behavior Stage: resolved
Components: Interpreter Core Versions: Python 2.7
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: amaury.forgeotdarc, benjamin.peterson, ezio.melotti, haypo, lemburg, pitrou, python-dev, serhiy.storchaka
Priority: low Keywords: needs review, patch

Created on 2012-07-17 08:15 by serhiy.storchaka, last changed 2012-11-17 20:17 by pitrou. This issue is now closed.

Files
File name Uploaded Description Edit
decode_charmap_maxchar-3.3_2.patch serhiy.storchaka, 2012-09-21 20:14 Patch for 3.3 review
decode_charmap_maxchar-3.2_2.patch serhiy.storchaka, 2012-09-21 20:14 Patch for 3.2 review
decode_charmap_maxchar-2.7.patch serhiy.storchaka, 2012-10-02 16:39 Patch for 2.7 review
Messages (16)
msg165688 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-07-17 08:15
Yet one inconsistency in charmap codec.

>>> import codecs
>>> codecs.charmap_decode(b'\x00', 'strict', '\U0002000B')
('𠀋', 1)
>>> codecs.charmap_decode(b'\x00', 'strict', {0: '\U0002000B'})
('𠀋', 1)
>>> codecs.charmap_decode(b'\x00', 'strict', {0: 0x2000B})
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: character mapping must be in range(65536)

The suggested patch removes this unnecessary limitation in charmap decoder.
msg165690 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2012-07-17 08:54
Could you add a test to your patch?
Is the issue 3.3-specific?
msg165710 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-07-17 11:36
Fixing for 3.2 and lesser is possible, but expensive, because of narrow build limitation. If necessary, I will give the patch, but it is easier to mark it as "wont fix" for pre-3.3 versions.

Here is a tests for charmap decoding. Tests added not only for this issue, but for all non-covered cases with int2str and int2str mappings.
msg165753 - (view) Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer) Date: 2012-07-18 11:02
In 3.2, narrow build is also broken when the "charmap" is a string:
>>> codecs.charmap_decode(b'\0', 'strict', '\U0002000B')
returns ('𠀋', 1) with a wide unicode build, but ('\ud840', 1) with a narrow build.

3.2 could be fixed to allow characters up to sys.maxunicode, though.
msg165786 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-07-18 15:48
Well, here is a patch for 3.2.
msg165796 - (view) Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer) Date: 2012-07-18 20:26
About the patch for 3.2:
   "needed = 6 - extrachars"
Where does this 6 come from?  There is another part which uses this "extrachars".  Why is the allocation strategy different here?
msg165798 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-07-18 20:33
It's the same strategy.
"needed = (targetsize - extrachars) + (targetsize << 2)". targetsize == 2.
msg165801 - (view) Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer) Date: 2012-07-18 21:07
Ah, I was worried by the possible quadratic behavior.  So the other (existing) case is quadratic as well (I was mislead by the <<, which made me think there is something clever there).
That's good enough for 3.2, I guess.
msg170567 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-09-16 18:43
Ping.
msg170913 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-09-21 20:14
Patches updated. Added a few new tests, used MAX_UNICODE, a little changed extrachars grow step.
msg171069 - (view) Author: Roundup Robot (python-dev) Date: 2012-09-23 18:01
New changeset 620d23f7ad41 by Antoine Pitrou in branch '3.2':
Issue #15379: Fix passing of non-BMP characters as integers for the charmap decoder (already working as unicode strings).
http://hg.python.org/cpython/rev/620d23f7ad41

New changeset c64dec45d46f by Antoine Pitrou in branch 'default':
Issue #15379: Fix passing of non-BMP characters as integers for the charmap decoder (already working as unicode strings).
http://hg.python.org/cpython/rev/c64dec45d46f
msg171070 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2012-09-23 18:02
Thank you, I've committed the patches. There was a test failure in test_codeccallbacks in 3.2, which I fixed simply by replacing sys.maxunicode with a hardcoded 0x110000.
msg171814 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-10-02 16:39
We forgot about 2.7 (because I had not thought to apply it even for a 3.2). Here is backported patch.
msg173356 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-10-19 19:05
The 2.7 patch is just a backport of 3.2 patch (including the last Antoine's fix).  Please look and commit.
msg175802 - (view) Author: Roundup Robot (python-dev) Date: 2012-11-17 20:17
New changeset c7ce91756472 by Antoine Pitrou in branch '2.7':
Issue #15379: Fix passing of non-BMP characters as integers for the charmap decoder (already working as unicode strings).
http://hg.python.org/cpython/rev/c7ce91756472
msg175803 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2012-11-17 20:17
Thanks for the backport, committed!
History
Date User Action Args
2012-11-17 20:17:42pitrousetstage: commit review -> resolved
2012-11-17 20:17:36pitrousetstatus: open -> closed

messages: + msg175803
2012-11-17 20:17:14python-devsetmessages: + msg175802
2012-10-24 09:10:51serhiy.storchakasetstage: resolved -> commit review
versions: - Python 3.2, Python 3.3
2012-10-19 19:05:41serhiy.storchakasetmessages: + msg173356
2012-10-02 16:40:09serhiy.storchakasetversions: + Python 2.7
2012-10-02 16:39:25serhiy.storchakasetstatus: closed -> open
files: + decode_charmap_maxchar-2.7.patch
messages: + msg171814
2012-09-23 18:02:19pitrousetstage: patch review -> resolved
2012-09-23 18:02:05pitrousetstatus: open -> closed
resolution: fixed
messages: + msg171070
2012-09-23 18:01:12python-devsetnosy: + python-dev
messages: + msg171069
2012-09-21 20:16:26serhiy.storchakasetfiles: - decode_charmap_maxchar-3.2.patch
2012-09-21 20:16:16serhiy.storchakasetfiles: - decode_charmap_tests.patch
2012-09-21 20:16:09serhiy.storchakasetfiles: - decode_charmap_maxchar.patch
2012-09-21 20:14:09serhiy.storchakasetfiles: + decode_charmap_maxchar-3.3_2.patch, decode_charmap_maxchar-3.2_2.patch

messages: + msg170913
2012-09-16 18:43:26serhiy.storchakasetmessages: + msg170567
2012-08-05 10:48:52serhiy.storchakasetstage: needs patch -> patch review
2012-08-05 10:47:32serhiy.storchakasetkeywords: + needs review
priority: normal -> low
stage: patch review -> needs patch
2012-07-18 21:07:56amaury.forgeotdarcsetmessages: + msg165801
2012-07-18 20:33:37serhiy.storchakasetmessages: + msg165798
2012-07-18 20:26:53amaury.forgeotdarcsetmessages: + msg165796
2012-07-18 15:48:30serhiy.storchakasetfiles: + decode_charmap_maxchar-3.2.patch

messages: + msg165786
versions: + Python 3.2
2012-07-18 11:02:19amaury.forgeotdarcsetnosy: + amaury.forgeotdarc
messages: + msg165753
2012-07-17 11:36:02serhiy.storchakasetfiles: + decode_charmap_tests.patch

messages: + msg165710
2012-07-17 08:54:29pitrousetnosy: + lemburg, pitrou, haypo, benjamin.peterson, ezio.melotti

messages: + msg165690
stage: patch review
2012-07-17 08:15:57serhiy.storchakacreate