Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ill-formed surrogates not treated as errors during encoding/decoding #47922

Closed
Rhamphoryncus mannequin opened this issue Aug 24, 2008 · 17 comments
Closed

Ill-formed surrogates not treated as errors during encoding/decoding #47922

Rhamphoryncus mannequin opened this issue Aug 24, 2008 · 17 comments
Assignees
Labels
release-blocker topic-unicode type-bug An unexpected behavior, bug, or error

Comments

@Rhamphoryncus
Copy link
Mannequin

Rhamphoryncus mannequin commented Aug 24, 2008

BPO 3672
Nosy @malemburg, @loewis, @pitrou, @benjaminp, @jwilk, @ezio-melotti
Files
  • surrogates.diff
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/loewis'
    closed_at = <Date 2009-05-02.18:57:28.745>
    created_at = <Date 2008-08-24.21:56:51.166>
    labels = ['type-bug', 'expert-unicode', 'release-blocker']
    title = 'Ill-formed surrogates not treated as errors during encoding/decoding'
    updated_at = <Date 2016-09-08.12:47:50.963>
    user = 'https://bugs.python.org/Rhamphoryncus'

    bugs.python.org fields:

    activity = <Date 2016-09-08.12:47:50.963>
    actor = 'python-dev'
    assignee = 'loewis'
    closed = True
    closed_date = <Date 2009-05-02.18:57:28.745>
    closer = 'loewis'
    components = ['Unicode']
    creation = <Date 2008-08-24.21:56:51.166>
    creator = 'Rhamphoryncus'
    dependencies = []
    files = ['13836']
    hgrepos = []
    issue_num = 3672
    keywords = ['patch']
    message_count = 17.0
    messages = ['71889', '86736', '86817', '86824', '86839', '86873', '86874', '86896', '86913', '86936', '86954', '86966', '86967', '86968', '86970', '86971', '275006']
    nosy_count = 9.0
    nosy_names = ['lemburg', 'loewis', 'Rhamphoryncus', 'pitrou', 'benjamin.peterson', 'jwilk', 'ezio.melotti', 'hippietrail', 'python-dev']
    pr_nums = []
    priority = 'release blocker'
    resolution = 'accepted'
    stage = 'test needed'
    status = 'closed'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue3672'
    versions = ['Python 3.1']

    @Rhamphoryncus
    Copy link
    Mannequin Author

    Rhamphoryncus mannequin commented Aug 24, 2008

    The Unicode FAQ makes it quite clear that any surrogates in UTF-8 or
    UTF-32 should be treated as errors. Lone surrogates in UTF-16 should
    probably be treated as errors too (but only during encoding/decoding;
    unicode objects on UTF-16 builds should allow them to be created through
    slicing).

    http://unicode.org/faq/utf_bom.html#30
    http://unicode.org/faq/utf_bom.html#42
    http://unicode.org/faq/utf_bom.html#40

    Lone surrogate in UTF-8 (effectively CESU-8):
    >>> '\xED\xA0\x81'.decode('utf-8')
    u'\ud801'
    
    Surrogate pair in UTF-8:
    >>> '\xED\xA0\x81\xED\xB0\x80'.decode('utf-8')
    u'\ud801\udc00'
    
    On a UTF-32 build, encoding a surrogate pair with UTF-16, then decoding
    again will produce the proper non-surrogate scalar value.  This has
    security implications, although rare as characters outside the BMP are rare:
    >>> u'\ud801\udc00'.encode('utf-16').decode('utf-16')
    u'\U00010400'
    
    Also on a UTF-32 build, decoding of a lone surrogate in UTF-16 fails
    (correctly), but encoding one does not:
    >>> u'\ud801'.encode('utf-16')
    '\xff\xfe\x01\xd8'

    I have gotten a report of a user decoding bad data using
    x.decode('utf-8', 'replace'), then getting an error from Gtk+ when the
    ill-formed surrogates reached it.

    Fixing this would cause bpo-3297 to blow up loudly, rather than silently.

    @Rhamphoryncus Rhamphoryncus mannequin added topic-unicode type-bug An unexpected behavior, bug, or error labels Aug 24, 2008
    @pitrou
    Copy link
    Member

    pitrou commented Apr 28, 2009

    We could fix it for 3.1, and perhaps leave 2.7 unchanged if some people
    rely on this (for whatever reason).

    @malemburg
    Copy link
    Member

    While it's probably ok to fix the codecs, there's an issue which makes
    this difficult at least for the utf-8 codec:

    The marshal module uses utf-8 to write Unicode objects and these can and
    need to be able to store the full range of supported UCS2/UCS4 code
    points, including lone surrogates.

    If the utf-8 codec were changed to raise an error for these, marshal
    would no longer be able to write/read Unicode objects.

    It is likely that other existing Python code (outside the std lib) also
    relies on this ability.

    Changing this would only be possible in 3.1.

    The marshal module would then also have to be changed to use a different
    encoding which does support encoding lone surrogates.

    See bpo-3297 for a discussion of UTF-8/16 vs. UCS2/4, the
    implications, motivations, etc.

    @loewis
    Copy link
    Mannequin

    loewis mannequin commented Apr 29, 2009

    I think we could preserve the marshal format with yet another error
    handler - one that emits half surrogates into their intuitive form.

    @malemburg
    Copy link
    Member

    On 2009-04-29 22:39, Martin v. Löwis @psf.upfronthosting.co.za wrote:

    Martin v. Löwis <martin@v.loewis.de> added the comment:

    I think we could preserve the marshal format with yet another error
    handler - one that emits half surrogates into their intuitive form.

    That's a good idea. We could have an error handler which then let's
    the codec accept lone surrogates for utf-8 or just add a new codec
    which does this and use that for marshal.

    Still, this can only go into 3.1.

    @loewis
    Copy link
    Mannequin

    loewis mannequin commented May 1, 2009

    Here is a patch that implements this proposed approach. It introduces a
    "surrogates" error handler, useful only for the utf-8 codec.

    If this is accepted, the implementation of PEP-383 can be simplified
    significantly, essentially removing the need for a separate utf-8b codec
    (as that could be done in the error handler, as for the other codecs).

    @loewis loewis mannequin assigned benjaminp May 1, 2009
    @loewis loewis mannequin added the release-blocker label May 1, 2009
    @loewis
    Copy link
    Mannequin

    loewis mannequin commented May 1, 2009

    rietveld: http://codereview.appspot.com/52081

    @loewis
    Copy link
    Mannequin

    loewis mannequin commented May 1, 2009

    Fixed indexing error.

    @benjaminp
    Copy link
    Contributor

    http://codereview.appspot.com/52081/diff/1/5
    File Doc/library/codecs.rst (right):

    http://codereview.appspot.com/52081/diff/1/5#newcode326
    Line 326: In addition, the following error handlers are specific to only
    selected
    "In addition, the following error handlers are specific to a single
    codec." sounds better

    http://codereview.appspot.com/52081/diff/1/5#newcode335
    Line 335:
    There should probably be a versionchanged directive indicating that
    "surrogates" was added in 3.1.

    http://codereview.appspot.com/52081/diff/1/6
    File Lib/test/test_codecs.py (right):

    http://codereview.appspot.com/52081/diff/1/6#newcode544
    Line 544: def test_surrogates(self):
    I think this should be split into 2 tests. "test_lone_surrogates" and
    "test_surrogate_handler".

    http://codereview.appspot.com/52081/diff/1/4
    File Objects/unicodeobject.c (right):

    http://codereview.appspot.com/52081/diff/1/4#newcode157
    Line 157: static PyObject *unicode_encode_call_errorhandler(const char
    *errors,
    These prototypes are longer than 80 chars some places. I don't think the
    arguments need to line up with the starting parenthesis.

    http://codereview.appspot.com/52081/diff/1/4#newcode2393
    Line 2393: s, size, &exc, i-1, i, &newpos);
    "exc" is never Py_DECREFed.

    http://codereview.appspot.com/52081/diff/1/4#newcode4110
    Line 4110: if (!PyUnicode_Check(repunicode)) {
    Is there a test of this case somewhere?

    http://codereview.appspot.com/52081/diff/1/2
    File Python/codecs.c (right):

    http://codereview.appspot.com/52081/diff/1/2#newcode758
    Line 758: if (PyObject_IsInstance(exc, PyExc_UnicodeEncodeError)) {
    I believe PyErr_GivenExceptionMatches is more appropriate here, but
    given the rest of the file uses PyObject_IsInstance, it likely doesn't
    matter.

    http://codereview.appspot.com/52081/diff/1/2#newcode771
    Line 771: return NULL;
    This is leaks "object".

    http://codereview.appspot.com/52081

    @loewis
    Copy link
    Mannequin

    loewis mannequin commented May 2, 2009

    Reviewers: report_bugs.python.org, Benjamin,

    Message:
    Issues fixed in r72188.

    http://codereview.appspot.com/52081/diff/1/5
    File Doc/library/codecs.rst (right):

    http://codereview.appspot.com/52081/diff/1/5#newcode326
    Line 326: In addition, the following error handlers are specific to only
    selected
    On 2009/05/01 21:25:44, Benjamin wrote:

    "In addition, the following error handlers are specific to a single
    codec."
    sounds better

    Done.

    http://codereview.appspot.com/52081/diff/1/5#newcode335
    Line 335:
    On 2009/05/01 21:25:44, Benjamin wrote:

    There should probably be a versionchanged directive indicating that
    "surrogates"
    was added in 3.1.

    Done.

    http://codereview.appspot.com/52081/diff/1/6
    File Lib/test/test_codecs.py (right):

    http://codereview.appspot.com/52081/diff/1/6#newcode544
    Line 544: def test_surrogates(self):
    On 2009/05/01 21:25:44, Benjamin wrote:

    I think this should be split into 2 tests. "test_lone_surrogates" and
    "test_surrogate_handler".

    Done.

    http://codereview.appspot.com/52081/diff/1/4
    File Objects/unicodeobject.c (right):

    http://codereview.appspot.com/52081/diff/1/4#newcode157
    Line 157: static PyObject *unicode_encode_call_errorhandler(const char
    *errors,
    On 2009/05/01 21:25:44, Benjamin wrote:

    These prototypes are longer than 80 chars some places. I don't think
    the
    arguments need to line up with the starting parenthesis.

    Done.

    http://codereview.appspot.com/52081/diff/1/4#newcode2393
    Line 2393: s, size, &exc, i-1, i, &newpos);
    On 2009/05/01 21:25:44, Benjamin wrote:

    "exc" is never Py_DECREFed.

    Done.

    http://codereview.appspot.com/52081/diff/1/4#newcode4110
    Line 4110: if (!PyUnicode_Check(repunicode)) {
    On 2009/05/01 21:25:44, Benjamin wrote:

    Is there a test of this case somewhere?

    No. This is temporary - for PEP-383, I will have to support error
    handlers returning bytes here, also.

    http://codereview.appspot.com/52081/diff/1/2
    File Python/codecs.c (right):

    http://codereview.appspot.com/52081/diff/1/2#newcode758
    Line 758: if (PyObject_IsInstance(exc, PyExc_UnicodeEncodeError)) {
    On 2009/05/01 21:25:44, Benjamin wrote:

    I believe PyErr_GivenExceptionMatches is more appropriate here, but
    given the
    rest of the file uses PyObject_IsInstance, it likely doesn't matter.

    No. The interface is that an exception instance must be passed;
    GivenExceptionMatches would also allow for tuples and types.

    http://codereview.appspot.com/52081/diff/1/2#newcode771
    Line 771: return NULL;
    On 2009/05/01 21:25:44, Benjamin wrote:

    This is leaks "object".

    Done.

    Please review this at http://codereview.appspot.com/52081

    Affected files:
    M Doc/library/codecs.rst
    M Lib/test/test_bytes.py
    M Lib/test/test_codecs.py
    M Lib/test/test_unicode.py
    M Lib/test/test_unicodedata.py
    M Objects/unicodeobject.c
    M Python/codecs.c
    M Python/marshal.c

    @benjaminp
    Copy link
    Contributor

    I think the new patch looks fine.

    @benjaminp benjaminp assigned loewis and unassigned benjaminp May 2, 2009
    @benjaminp
    Copy link
    Contributor

    Something I overlooked is that PyCodec_SurrogateErrors isn't exposed in
    any headers.

    @loewis
    Copy link
    Mannequin

    loewis mannequin commented May 2, 2009

    Committed as r72208, blocked as r72209.

    As for PyCodec_SurrogateErrors: I'd rather make it static than expose it.

    @loewis loewis mannequin closed this as completed May 2, 2009
    @benjaminp
    Copy link
    Contributor

    2009/5/2 <"\"Martin v. Löwis\"
    <report@bugs.python.org>"@psf.upfronthosting.co.za>:

    Martin v. Löwis <martin@v.loewis.de> added the comment:

    Committed as r72208, blocked as r72209.

    As for PyCodec_SurrogateErrors: I'd rather make it static than expose it.

    Why? All the other error handlers are exposed.

    @loewis
    Copy link
    Mannequin

    loewis mannequin commented May 2, 2009

    > As for PyCodec_SurrogateErrors: I'd rather make it static than expose it.

    Why? All the other error handlers are exposed.

    Sure - but what for? IMO, they all shouldn't be exposed.

    @benjaminp
    Copy link
    Contributor

    2009/5/2  <"\"Martin v. Löwis\"
    <report@bugs.python.org>"@psf.upfronthosting.co.za>:
    >
    > Martin v. Löwis <martin@v.loewis.de> added the comment:
    >
    >>> As for PyCodec_SurrogateErrors: I'd rather make it static than expose it.
    >>
    >> Why? All the other error handlers are exposed.
    >
    > Sure - but what for? IMO, they all shouldn't be exposed.

    The only reason I can think of is consistency, but I don't care that much.

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Sep 8, 2016

    New changeset 2150eadb54c7 by Serhiy Storchaka in branch 'default':
    Remove old typo.
    https://hg.python.org/cpython/rev/2150eadb54c7

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    release-blocker topic-unicode type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    No branches or pull requests

    3 participants