Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

disallow the "surrogatepass" handler for non utf-* encodings #58124

Open
kennyluck mannequin opened this issue Jan 31, 2012 · 20 comments
Open

disallow the "surrogatepass" handler for non utf-* encodings #58124

kennyluck mannequin opened this issue Jan 31, 2012 · 20 comments
Assignees
Labels
topic-unicode type-feature A feature request or enhancement

Comments

@kennyluck
Copy link
Mannequin

kennyluck mannequin commented Jan 31, 2012

BPO 13916
Nosy @loewis, @vstinner, @ezio-melotti, @serhiy-storchaka
Files
  • surrogatepass_non_utf.patch
  • surrogatepass_cp_utf8.patch
  • surrogatepass_cp65001.patch
  • cp_encoding_name.patch
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/vstinner'
    closed_at = None
    created_at = <Date 2012-01-31.23:51:10.468>
    labels = ['type-feature', 'expert-unicode']
    title = 'disallow the "surrogatepass" handler for non utf-* encodings'
    updated_at = <Date 2014-08-17.15:19:53.801>
    user = 'https://bugs.python.org/kennyluck'

    bugs.python.org fields:

    activity = <Date 2014-08-17.15:19:53.801>
    actor = 'serhiy.storchaka'
    assignee = 'vstinner'
    closed = False
    closed_date = None
    closer = None
    components = ['Unicode']
    creation = <Date 2012-01-31.23:51:10.468>
    creator = 'kennyluck'
    dependencies = []
    files = ['35257', '35259', '35262', '35264']
    hgrepos = []
    issue_num = 13916
    keywords = ['patch']
    message_count = 20.0
    messages = ['152416', '159520', '159525', '159528', '218597', '218600', '218601', '218602', '218603', '218605', '218611', '218612', '218613', '218615', '218617', '218655', '218658', '218659', '218740', '225448']
    nosy_count = 6.0
    nosy_names = ['loewis', 'vstinner', 'ezio.melotti', 'python-dev', 'kennyluck', 'serhiy.storchaka']
    pr_nums = []
    priority = 'normal'
    resolution = None
    stage = None
    status = 'open'
    superseder = None
    type = 'enhancement'
    url = 'https://bugs.python.org/issue13916'
    versions = ['Python 3.5']

    @kennyluck
    Copy link
    Mannequin Author

    kennyluck mannequin commented Jan 31, 2012

    Currently the "surrogatepass" handler always encodes the surrogates in UTF-8 and hence the behavior for, say, "\udc80".encode("latin-1", "surrogatepass").decode("latin-1") might be unexpected and I don't even know what would, say, "\udc80\udc80".encode("big5", "surrogatepass").decode("big5"), return. Regardless of the fact that the documentation says "surrogatepass" is specific to utf-8", the currently behavior is arguably not too harmful thanks to PyBytesObject's '\0' ending (so that ((p[0] & 0xf0) == 0xe0 || (p[1] & 0xc0) == 0x80 || (p[2] & 0xc0) == 0x80) in PyCodec_SurrogatePassErrors would not crash).

    However, I suggest we have the system either 1) raise early LookupError 2) raise the original Unicode(Decode|Encoding)Exception as soon as PyCodec_SurrogatePassErrors is called. I prefer the former.

    Having this could shorten PyCodec_SurrogatePassErrors significantly in the patch I will shortly submit for issue bpo-12892 as all the error conditions for utf-8, utf-16 and utf-32 are predicable* and almost all the conditionals could be removed. (The * statement is arguable if someone initializes interp->codec_search_path before _PyCodecRegistry_Init and the utf-16/32 encoders are overwritten. I don't think we need to worry about this too much though. Or am I wrong here?)

    @kennyluck kennyluck mannequin added type-bug An unexpected behavior, bug, or error topic-unicode labels Jan 31, 2012
    @loewis
    Copy link
    Mannequin

    loewis mannequin commented Apr 28, 2012

    I fail to see the problem. If the error handler does not produce meaningful results in some context, then just don't use it.

    The whole point of error handlers is that they handle errors; using them shouldn't ever cause errors/exceptions.

    @serhiy-storchaka
    Copy link
    Member

    The problem is that "surrogatepass" specific to utf-8 and there is no standard way to decode alone surrogates in utf-16.

    >>> "\udc80\udc80".encode("utf-16", "surrogatepass").decode("utf-16", "surrogatepass")
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    UnicodeDecodeError: 'utf16' codec can't decode bytes in position 2-3: illegal encoding

    With utf-32 this "works" only thanks to the bug -- utf-32 allows alone surrogates (issue bpo-12892).

    If the "surrogatepass" worked with utf-16 and utf-32, it would be natural to throw ValueError for other encodings.

    @loewis
    Copy link
    Mannequin

    loewis mannequin commented Apr 28, 2012

    I see. The proper reaction for a codec that can't handle a certain error then is to raise the original exception. I'm -1 on raising LookupError when trying to find the error handler - this would suggest that the error handler does not exist, which is not true.

    As for simplifying the implementation: it might be reasonable to special-case surrogatepass inside the individual codecs, rather than looking up the error handler. Then the error handler could just be identical to "strict", except that UTF-8, UTF-16, and UTF-32 individually special-case this error handler in their encoders and decoders.

    @serhiy-storchaka
    Copy link
    Member

    This issue was mainly resolved in bpo-12892. The surrogatepass error handler now works with UTF-16* and UTF-32* encodings. But for other encodings it behaves as for UTF-8 (preserve old behavior). Should we change the behavior for non-UTF encodings end raise an exception (as for "strict")?

    @serhiy-storchaka
    Copy link
    Member

    Here is a patch which disallows the surrogatepass handler for non-utf encodings. Please test it on Windows.

    @serhiy-storchaka serhiy-storchaka added type-feature A feature request or enhancement and removed type-bug An unexpected behavior, bug, or error labels May 15, 2014
    @vstinner
    Copy link
    Member

    Serhiy Storchaka wrote:

    Here is a patch

    I don't see your patch.

    @serhiy-storchaka
    Copy link
    Member

    Oh, sorry.

    @loewis
    Copy link
    Mannequin

    loewis mannequin commented May 15, 2014

    LGTM

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented May 15, 2014

    New changeset 5e98a50e0f55 by Serhiy Storchaka in branch 'default':
    Issue bpo-13916: Disallowed the surrogatepass error handler for non UTF-*
    http://hg.python.org/cpython/rev/5e98a50e0f55

    @vstinner
    Copy link
    Member

    It makes sense to restrict surrogatepass to UTF-* encodings. UTF-8, UTF-16 and UTF-32 encoders reject surrogate characters, but not UTF-7. Is it a bug? I'm asking because PyCodec_SurrogatePassErrors() doesn't support UTF-7.

    IMO your change is important enough to be mentionned in What's new Python 3.5 document, and maybe also in the documentation of the codecs module.

    @vstinner
    Copy link
    Member

    Windows buildbots are unhappy.

    http://buildbot.python.org/all/builders/x86%20Windows7%203.x/builds/8355/steps/test/logs/stdio

    ======================================================================
    ERROR: test_surrogatepass_handler (test.test_codecs.CP65001Test)
    ----------------------------------------------------------------------

    Traceback (most recent call last):
      File "D:\cygwin\home\db3l\buildarea\3.x.bolen-windows7\build\lib\test\test_codecs.py", line 883, in test_surrogatepass_handler
        self.assertEqual("abc\ud800def".encode("cp65001", "surrogatepass"),
    UnicodeEncodeError: 'CP_UTF8' codec can't encode character '\ud800' in position 3: invalid character

    ======================================================================
    FAIL: test_encode (test.test_codecs.CP65001Test)
    ----------------------------------------------------------------------

    Traceback (most recent call last):
      File "D:\cygwin\home\db3l\buildarea\3.x.bolen-windows7\build\lib\test\test_codecs.py", line 818, in test_encode
        encoded = text.encode('cp65001', errors)
    UnicodeEncodeError: 'CP_UTF8' codec can't encode character '\udc80' in position 0: invalid character
    
    During handling of the above exception, another exception occurred:
    
    Traceback (most recent call last):
      File "D:\cygwin\home\db3l\buildarea\3.x.bolen-windows7\build\lib\test\test_codecs.py", line 821, in test_encode
        'errors=%r: %s' % (text, errors, err))
    AssertionError: Unable to encode '\udc80' to cp65001 with errors='surrogatepass': 'CP_UTF8' codec can't encode character '\udc80' in position 0: invalid character

    ======================================================================
    FAIL: test_cp1252 (test.test_codecs.CodePageTest)
    ----------------------------------------------------------------------

    Traceback (most recent call last):
      File "D:\cygwin\home\db3l\buildarea\3.x.bolen-windows7\build\lib\test\test_codecs.py", line 2849, in test_cp1252
        (b'[\x98]', 'surrogatepass', None),
      File "D:\cygwin\home\db3l\buildarea\3.x.bolen-windows7\build\lib\test\test_codecs.py", line 2781, in check_decode
        codecs.code_page_decode, cp, raw, errors, True)
    AssertionError: UnicodeDecodeError not raised by code_page_decode

    @vstinner vstinner reopened this May 15, 2014
    @serhiy-storchaka
    Copy link
    Member

    Here is a patch, which adds support for cp65001 and fixes test_cp1252. Please test it on Windows Vista.

    Lone surrogates are not illegal in UTF-7 (see RFC 1642), so error handler is not called and explicit support of UTF-7 is not needed.

    Could you please help with documenting this change in What's new Python 3.5 document? I don't think this change is worth special mentioning in codecs documentation, it is already documented that surrogatepass is supported only for utf-8, utf-16* and utf-32*.

    @vstinner
    Copy link
    Member

    Here is a patch, which adds support for cp65001

    The name of the encoding is "cp65001", not something like "cp-utf8".

    And there is no alias like "cp_65001", there is only "cp65001".

    @serhiy-storchaka
    Copy link
    Member

    But an exception reports about CP_UTF8.

    @serhiy-storchaka serhiy-storchaka changed the title disallow the "surrogatepass" handler for non utf-* encodings disallow the "surrogatepass" handler for non utf-* encodings May 15, 2014
    @serhiy-storchaka
    Copy link
    Member

    Here is a patch which tests encoding name with "cp65001" instead of "CP_UTF8". I can't test on Windows and don't know which of two patches are correct.

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented May 16, 2014

    New changeset 8ee2b73cda7a by Victor Stinner in branch 'default':
    Issue bpo-13916: Fix surrogatepass error handler on Windows
    http://hg.python.org/cpython/rev/8ee2b73cda7a

    @vstinner
    Copy link
    Member

    But an exception reports about CP_UTF8.

    Oh, that's my fault! And it is a bug: "CP_UTF8" is the Windows constant, but it is not a valid Python codec name.

    Attached patch cp_encoding_name.patch fixes this issue.

    I don't think that Py_LOWER() is needed because the encoding name of Unicode errors from the code page codec is "cpXXX". It cannot be "CPXXX", except if you pass create manually an Unicode error exception.

    What do you think? Py_LOWER or not?

    @serhiy-storchaka
    Copy link
    Member

    I have no opinion.

    @serhiy-storchaka serhiy-storchaka removed their assignment May 18, 2014
    @serhiy-storchaka
    Copy link
    Member

    Could you please finish this issue Victor?

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    topic-unicode type-feature A feature request or enhancement
    Projects
    Development

    No branches or pull requests

    2 participants