classification
Title: The inconsistency of codecs.charmap_decode
Type: behavior Stage: resolved
Components: Interpreter Core, Unicode Versions: Python 3.4, Python 3.2, Python 3.3, Python 2.7
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: serhiy.storchaka Nosy List: doerwalter, eric.araujo, ezio.melotti, haypo, lemburg, loewis, python-dev, serhiy.storchaka
Priority: normal Keywords: needs review, patch

Created on 2012-05-18 14:46 by serhiy.storchaka, last changed 2013-01-15 14:09 by serhiy.storchaka. This issue is now closed.

Files
File name Uploaded Description Edit
decode_charmap_fffe-3.3.patch serhiy.storchaka, 2012-10-02 16:17 Patch for 3.3 and 3.4 review
decode_charmap_fffe-3.2.patch serhiy.storchaka, 2012-10-02 16:17 Patch for 3.2 review
decode_charmap_fffe-2.7.patch serhiy.storchaka, 2012-10-02 16:17 Patch for 2.7 review
Messages (14)
msg161054 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-05-18 14:46
codecs.charmap_decode behaves differently with native and user string as decode table.

>>> import codecs
>>> print(ascii(codecs.charmap_decode(b'\x00', 'replace', '\uFFFE')))
('\ufffd', 1)
>>> class S(str): pass
... 
>>> print(ascii(codecs.charmap_decode(b'\x00', 'replace', S('\uFFFE'))))
('\ufffe', 1)

It's because charmap decoder (function PyUnicode_DecodeCharmap in Objects/unicodeobject.c) uses different algorithms for exact strings and for other.

We need to fix it? If yes, what should return `codecs.charmap_decode(b'\x00', 'replace', {0:'\uFFFE'})`? What should return `codecs.charmap_decode(b'\x00', 'replace', {0:0xFFFE})`?
msg162574 - (view) Author: Éric Araujo (eric.araujo) * (Python committer) Date: 2012-06-10 01:34
What is the use case for passing a string subclass to charmap_decode?  Or in other words, how did you stumble upon the bug?
msg162580 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2012-06-10 07:59
U+FFFE is documented as representing an undefined mapping, see

http://docs.python.org/dev/c-api/unicode.html?highlight=charmap#PyUnicode_DecodeCharmap

So the base string case is correct; the derived string implementation also needs to invoke the error handler.
msg162582 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-06-10 10:51
> What is the use case for passing a string subclass to charmap_decode?  Or in other words, how did you stumble upon the bug?

I stumbled upon it, rewriting the charmap decoder (issue14874). Now
charmap decoder processes the two cases -- a more effective case of
string table and a general slower case of general mapping. I proposed a
more optimized case of 256-character UCS2 string (covers all standard
charmap encodings). If processing general strings and maps was
consistent, these cases can be merged. A string subclass is just an
example that illustrates the inconsistency.
msg162583 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-06-10 11:00
> U+FFFE is documented as representing an undefined mapping,

Yes, using U+FFFE for representing an undefined mapping in strings is
normal, the question was about string subclasses. And if we will correct
it for string subclasses, how far we go any further? How about general
mapping?
msg162584 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2012-06-10 11:26
>> U+FFFE is documented as representing an undefined mapping,
>
> Yes, using U+FFFE for representing an undefined mapping in strings is
> normal, the question was about string subclasses.

What is the question? U+FFFE also represents an undefined mapping in 
string subclasses.

 > And if we will correct it for string subclasses, how far we go any
 > further?

This is a single issue, a single bug. If the bug is fixed, it is fixed. 
No need to go further (unless there is another bug somewhere).
msg162588 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-06-10 15:59
> What is the question? U+FFFE also represents an undefined mapping in 
> string subclasses.

What about classes that not subclassed string but ducktyped string by
implementing all string method? What about list/tuple/array.array of
integers or 1-character strings? What about general mapping? Should any
of them have 0xFFFE or '\uFFFE' represent an undefined mapping?

> This is a single issue, a single bug. If the bug is fixed, it is fixed. 
> No need to go further (unless there is another bug somewhere).

My question, where is the limit of this bug.
msg162589 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2012-06-10 17:07
> integers or 1-character strings? What about general mapping? Should
 > any of them have 0xFFFE or '\uFFFE' represent an undefined mapping?

The documentation says that the parameter "can be a dictionary mapping 
byte or a unicode string, which is treated as a lookup table". So 
anything that supports GetItem with a small integer index can be passed.
It then says '...  U+FFFE “characters” are treated as “undefined mapping”'.

So the answer to your last question is "yes". I hope that the answer to
your other questions follows from that (strictly speaking, it's only
U+FFFE, not 0xFFFE, that is documented as indicating an undefined
mapping; a patch should probably fix that).

(I also wonder where the support for LookupError comes from - that 
appears to be undocumented)
msg162594 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-06-10 18:51
> So the answer to your last question is "yes". I hope that the answer to
> your other questions follows from that

Thank you, this is the answer to all my questions. I've prepared a patch
to treat U+FFFE in general mapping as “undefined mapping”.

>  (strictly speaking, it's only
> U+FFFE, not 0xFFFE, that is documented as indicating an undefined
> mapping; a patch should probably fix that).

As both integer 0xXXXX and string '\uXXXX' denote U+XXXX, I do not think
it necessary fixes.

> (I also wonder where the support for LookupError comes from - that 
> appears to be undocumented)

I believe, this is what is meant by the words "undefined mapping".
msg171813 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-10-02 16:17
Patch updated to resolve conflict with issue15379. Added tests. Added patches 
for 3.2 and 2.7.
msg173357 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-10-19 19:10
Does anyone have objections against the idea or the implementation of the patch?  Please review.
msg178321 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-12-27 20:36
I no one objects I will commit this next year.
msg180017 - (view) Author: Roundup Robot (python-dev) Date: 2013-01-15 13:40
New changeset 33a8ef498b1e by Serhiy Storchaka in branch '2.7':
Issue #14850: Now a chamap decoder treates U+FFFE as "undefined mapping"
http://hg.python.org/cpython/rev/33a8ef498b1e

New changeset 13cd78a2a17b by Serhiy Storchaka in branch '3.2':
Issue #14850: Now a chamap decoder treates U+FFFE as "undefined mapping"
http://hg.python.org/cpython/rev/13cd78a2a17b

New changeset 6ac4f1609847 by Serhiy Storchaka in branch '3.3':
Issue #14850: Now a chamap decoder treates U+FFFE as "undefined mapping"
http://hg.python.org/cpython/rev/6ac4f1609847

New changeset 03e22cc9407a by Serhiy Storchaka in branch 'default':
Issue #14850: Now a chamap decoder treates U+FFFE as "undefined mapping"
http://hg.python.org/cpython/rev/03e22cc9407a
msg180019 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2013-01-15 14:08
Fixed. Thank you for your answers, Martin.
History
Date User Action Args
2013-01-15 14:09:39serhiy.storchakasetstatus: open -> closed
resolution: fixed
stage: patch review -> resolved
2013-01-15 14:08:56serhiy.storchakasetmessages: + msg180019
2013-01-15 13:40:02python-devsetnosy: + python-dev
messages: + msg180017
2012-12-27 20:48:27serhiy.storchakasetassignee: serhiy.storchaka
2012-12-27 20:36:51serhiy.storchakasetmessages: + msg178321
2012-10-24 09:05:58serhiy.storchakasetstage: patch review
2012-10-19 23:38:00pitrousetnosy: + haypo
2012-10-19 19:10:20serhiy.storchakasetmessages: + msg173357
2012-10-02 16:20:25serhiy.storchakasetkeywords: + needs review
components: + Unicode
versions: + Python 3.4
2012-10-02 16:17:31serhiy.storchakasetfiles: + decode_charmap_fffe-3.3.patch, decode_charmap_fffe-3.2.patch, decode_charmap_fffe-2.7.patch

messages: + msg171813
2012-10-02 15:05:15serhiy.storchakasetfiles: - decode_charmap_fffe.patch
2012-06-16 16:31:55pitrousetnosy: + ezio.melotti
2012-06-10 18:51:28serhiy.storchakasetfiles: + decode_charmap_fffe.patch
keywords: + patch
messages: + msg162594
2012-06-10 17:07:59loewissetmessages: + msg162589
2012-06-10 15:59:14serhiy.storchakasetmessages: + msg162588
2012-06-10 11:26:39loewissetmessages: + msg162584
2012-06-10 11:00:13serhiy.storchakasetmessages: + msg162583
2012-06-10 10:51:16serhiy.storchakasetmessages: + msg162582
2012-06-10 07:59:24loewissetmessages: + msg162580
2012-06-10 01:34:04eric.araujosetnosy: + eric.araujo
messages: + msg162574
2012-05-18 20:59:10terry.reedysetnosy: + lemburg, doerwalter
2012-05-18 14:53:23pitrousetnosy: + loewis
2012-05-18 14:46:35serhiy.storchakacreate