Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The inconsistency of codecs.charmap_decode #59055

Closed
serhiy-storchaka opened this issue May 18, 2012 · 14 comments
Closed

The inconsistency of codecs.charmap_decode #59055

serhiy-storchaka opened this issue May 18, 2012 · 14 comments
Assignees
Labels
interpreter-core (Objects, Python, Grammar, and Parser dirs) topic-unicode type-bug An unexpected behavior, bug, or error

Comments

@serhiy-storchaka
Copy link
Member

BPO 14850
Nosy @malemburg, @loewis, @doerwalter, @vstinner, @ezio-melotti, @merwok, @serhiy-storchaka
Files
  • decode_charmap_fffe-3.3.patch: Patch for 3.3 and 3.4
  • decode_charmap_fffe-3.2.patch: Patch for 3.2
  • decode_charmap_fffe-2.7.patch: Patch for 2.7
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/serhiy-storchaka'
    closed_at = <Date 2013-01-15.14:09:39.024>
    created_at = <Date 2012-05-18.14:46:35.980>
    labels = ['interpreter-core', 'type-bug', 'expert-unicode']
    title = 'The inconsistency of codecs.charmap_decode'
    updated_at = <Date 2013-01-15.14:09:39.024>
    user = 'https://github.com/serhiy-storchaka'

    bugs.python.org fields:

    activity = <Date 2013-01-15.14:09:39.024>
    actor = 'serhiy.storchaka'
    assignee = 'serhiy.storchaka'
    closed = True
    closed_date = <Date 2013-01-15.14:09:39.024>
    closer = 'serhiy.storchaka'
    components = ['Interpreter Core', 'Unicode']
    creation = <Date 2012-05-18.14:46:35.980>
    creator = 'serhiy.storchaka'
    dependencies = []
    files = ['27387', '27388', '27389']
    hgrepos = []
    issue_num = 14850
    keywords = ['patch', 'needs review']
    message_count = 14.0
    messages = ['161054', '162574', '162580', '162582', '162583', '162584', '162588', '162589', '162594', '171813', '173357', '178321', '180017', '180019']
    nosy_count = 8.0
    nosy_names = ['lemburg', 'loewis', 'doerwalter', 'vstinner', 'ezio.melotti', 'eric.araujo', 'python-dev', 'serhiy.storchaka']
    pr_nums = []
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue14850'
    versions = ['Python 2.7', 'Python 3.2', 'Python 3.3', 'Python 3.4']

    @serhiy-storchaka
    Copy link
    Member Author

    codecs.charmap_decode behaves differently with native and user string as decode table.

    >>> import codecs
    >>> print(ascii(codecs.charmap_decode(b'\x00', 'replace', '\uFFFE')))
    ('\ufffd', 1)
    >>> class S(str): pass
    ... 
    >>> print(ascii(codecs.charmap_decode(b'\x00', 'replace', S('\uFFFE'))))
    ('\ufffe', 1)

    It's because charmap decoder (function PyUnicode_DecodeCharmap in Objects/unicodeobject.c) uses different algorithms for exact strings and for other.

    We need to fix it? If yes, what should return codecs.charmap_decode(b'\x00', 'replace', {0:'\uFFFE'})? What should return codecs.charmap_decode(b'\x00', 'replace', {0:0xFFFE})?

    @serhiy-storchaka serhiy-storchaka added interpreter-core (Objects, Python, Grammar, and Parser dirs) type-bug An unexpected behavior, bug, or error labels May 18, 2012
    @merwok
    Copy link
    Member

    merwok commented Jun 10, 2012

    What is the use case for passing a string subclass to charmap_decode? Or in other words, how did you stumble upon the bug?

    @loewis
    Copy link
    Mannequin

    loewis mannequin commented Jun 10, 2012

    U+FFFE is documented as representing an undefined mapping, see

    http://docs.python.org/dev/c-api/unicode.html?highlight=charmap#PyUnicode_DecodeCharmap

    So the base string case is correct; the derived string implementation also needs to invoke the error handler.

    @serhiy-storchaka
    Copy link
    Member Author

    What is the use case for passing a string subclass to charmap_decode? Or in other words, how did you stumble upon the bug?

    I stumbled upon it, rewriting the charmap decoder (bpo-14874). Now
    charmap decoder processes the two cases -- a more effective case of
    string table and a general slower case of general mapping. I proposed a
    more optimized case of 256-character UCS2 string (covers all standard
    charmap encodings). If processing general strings and maps was
    consistent, these cases can be merged. A string subclass is just an
    example that illustrates the inconsistency.

    @serhiy-storchaka
    Copy link
    Member Author

    U+FFFE is documented as representing an undefined mapping,

    Yes, using U+FFFE for representing an undefined mapping in strings is
    normal, the question was about string subclasses. And if we will correct
    it for string subclasses, how far we go any further? How about general
    mapping?

    @loewis
    Copy link
    Mannequin

    loewis mannequin commented Jun 10, 2012

    > U+FFFE is documented as representing an undefined mapping,

    Yes, using U+FFFE for representing an undefined mapping in strings is
    normal, the question was about string subclasses.

    What is the question? U+FFFE also represents an undefined mapping in
    string subclasses.

    > And if we will correct it for string subclasses, how far we go any
    > further?

    This is a single issue, a single bug. If the bug is fixed, it is fixed.
    No need to go further (unless there is another bug somewhere).

    @serhiy-storchaka
    Copy link
    Member Author

    What is the question? U+FFFE also represents an undefined mapping in
    string subclasses.

    What about classes that not subclassed string but ducktyped string by
    implementing all string method? What about list/tuple/array.array of
    integers or 1-character strings? What about general mapping? Should any
    of them have 0xFFFE or '\uFFFE' represent an undefined mapping?

    This is a single issue, a single bug. If the bug is fixed, it is fixed.
    No need to go further (unless there is another bug somewhere).

    My question, where is the limit of this bug.

    @loewis
    Copy link
    Mannequin

    loewis mannequin commented Jun 10, 2012

    integers or 1-character strings? What about general mapping? Should
    > any of them have 0xFFFE or '\uFFFE' represent an undefined mapping?

    The documentation says that the parameter "can be a dictionary mapping
    byte or a unicode string, which is treated as a lookup table". So
    anything that supports GetItem with a small integer index can be passed.
    It then says '... U+FFFE “characters” are treated as “undefined mapping”'.

    So the answer to your last question is "yes". I hope that the answer to
    your other questions follows from that (strictly speaking, it's only
    U+FFFE, not 0xFFFE, that is documented as indicating an undefined
    mapping; a patch should probably fix that).

    (I also wonder where the support for LookupError comes from - that
    appears to be undocumented)

    @serhiy-storchaka
    Copy link
    Member Author

    So the answer to your last question is "yes". I hope that the answer to
    your other questions follows from that

    Thank you, this is the answer to all my questions. I've prepared a patch
    to treat U+FFFE in general mapping as “undefined mapping”.

    (strictly speaking, it's only
    U+FFFE, not 0xFFFE, that is documented as indicating an undefined
    mapping; a patch should probably fix that).

    As both integer 0xXXXX and string '\uXXXX' denote U+XXXX, I do not think
    it necessary fixes.

    (I also wonder where the support for LookupError comes from - that
    appears to be undocumented)

    I believe, this is what is meant by the words "undefined mapping".

    @serhiy-storchaka
    Copy link
    Member Author

    Patch updated to resolve conflict with bpo-15379. Added tests. Added patches
    for 3.2 and 2.7.

    @serhiy-storchaka
    Copy link
    Member Author

    Does anyone have objections against the idea or the implementation of the patch? Please review.

    @serhiy-storchaka
    Copy link
    Member Author

    I no one objects I will commit this next year.

    @serhiy-storchaka serhiy-storchaka self-assigned this Dec 27, 2012
    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Jan 15, 2013

    New changeset 33a8ef498b1e by Serhiy Storchaka in branch '2.7':
    Issue bpo-14850: Now a chamap decoder treates U+FFFE as "undefined mapping"
    http://hg.python.org/cpython/rev/33a8ef498b1e

    New changeset 13cd78a2a17b by Serhiy Storchaka in branch '3.2':
    Issue bpo-14850: Now a chamap decoder treates U+FFFE as "undefined mapping"
    http://hg.python.org/cpython/rev/13cd78a2a17b

    New changeset 6ac4f1609847 by Serhiy Storchaka in branch '3.3':
    Issue bpo-14850: Now a chamap decoder treates U+FFFE as "undefined mapping"
    http://hg.python.org/cpython/rev/6ac4f1609847

    New changeset 03e22cc9407a by Serhiy Storchaka in branch 'default':
    Issue bpo-14850: Now a chamap decoder treates U+FFFE as "undefined mapping"
    http://hg.python.org/cpython/rev/03e22cc9407a

    @serhiy-storchaka
    Copy link
    Member Author

    Fixed. Thank you for your answers, Martin.

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    interpreter-core (Objects, Python, Grammar, and Parser dirs) topic-unicode type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    No branches or pull requests

    2 participants