New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The inconsistency of codecs.charmap_decode #59055
Comments
codecs.charmap_decode behaves differently with native and user string as decode table. >>> import codecs
>>> print(ascii(codecs.charmap_decode(b'\x00', 'replace', '\uFFFE')))
('\ufffd', 1)
>>> class S(str): pass
...
>>> print(ascii(codecs.charmap_decode(b'\x00', 'replace', S('\uFFFE'))))
('\ufffe', 1) It's because charmap decoder (function PyUnicode_DecodeCharmap in Objects/unicodeobject.c) uses different algorithms for exact strings and for other. We need to fix it? If yes, what should return |
What is the use case for passing a string subclass to charmap_decode? Or in other words, how did you stumble upon the bug? |
U+FFFE is documented as representing an undefined mapping, see http://docs.python.org/dev/c-api/unicode.html?highlight=charmap#PyUnicode_DecodeCharmap So the base string case is correct; the derived string implementation also needs to invoke the error handler. |
I stumbled upon it, rewriting the charmap decoder (bpo-14874). Now |
Yes, using U+FFFE for representing an undefined mapping in strings is |
What is the question? U+FFFE also represents an undefined mapping in > And if we will correct it for string subclasses, how far we go any This is a single issue, a single bug. If the bug is fixed, it is fixed. |
What about classes that not subclassed string but ducktyped string by
My question, where is the limit of this bug. |
The documentation says that the parameter "can be a dictionary mapping So the answer to your last question is "yes". I hope that the answer to (I also wonder where the support for LookupError comes from - that |
Thank you, this is the answer to all my questions. I've prepared a patch
As both integer 0xXXXX and string '\uXXXX' denote U+XXXX, I do not think
I believe, this is what is meant by the words "undefined mapping". |
Patch updated to resolve conflict with bpo-15379. Added tests. Added patches |
Does anyone have objections against the idea or the implementation of the patch? Please review. |
I no one objects I will commit this next year. |
New changeset 33a8ef498b1e by Serhiy Storchaka in branch '2.7': New changeset 13cd78a2a17b by Serhiy Storchaka in branch '3.2': New changeset 6ac4f1609847 by Serhiy Storchaka in branch '3.3': New changeset 03e22cc9407a by Serhiy Storchaka in branch 'default': |
Fixed. Thank you for your answers, Martin. |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: