Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UTF-16 and UTF-32 codecs should reject (lone) surrogates #57101

Closed
ezio-melotti opened this issue Sep 4, 2011 · 41 comments
Closed

UTF-16 and UTF-32 codecs should reject (lone) surrogates #57101

ezio-melotti opened this issue Sep 4, 2011 · 41 comments
Assignees
Labels
topic-unicode type-bug An unexpected behavior, bug, or error

Comments

@ezio-melotti
Copy link
Member

BPO 12892
Nosy @malemburg, @gvanrossum, @loewis, @mjpieters, @pitrou, @vstinner, @ezio-melotti, @serhiy-storchaka
Files
  • utf_16_32_surrogates.patch
  • utf-16&32_reject_surrogates.patch: rejects surrogate and calls the error handlers
  • surrogatepass_for_utf-16&32.patch: surrogatepass erro handler for utf-16/32
  • utf_16_32_surrogates_3.patch
  • utf_16_32_surrogates_4.patch
  • utf_16_32_surrogates_5.patch
  • utf_16_32_surrogates_6.patch
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/serhiy-storchaka'
    closed_at = <Date 2013-11-19.10:31:20.466>
    created_at = <Date 2011-09-04.10:49:30.301>
    labels = ['type-bug', 'expert-unicode']
    title = 'UTF-16 and UTF-32 codecs should reject (lone) surrogates'
    updated_at = <Date 2015-08-11.11:28:28.010>
    user = 'https://github.com/ezio-melotti'

    bugs.python.org fields:

    activity = <Date 2015-08-11.11:28:28.010>
    actor = 'tmp12342'
    assignee = 'serhiy.storchaka'
    closed = True
    closed_date = <Date 2013-11-19.10:31:20.466>
    closer = 'serhiy.storchaka'
    components = ['Unicode']
    creation = <Date 2011-09-04.10:49:30.301>
    creator = 'ezio.melotti'
    dependencies = []
    files = ['23810', '24368', '24384', '31557', '31984', '32047', '32202']
    hgrepos = []
    issue_num = 12892
    keywords = ['patch']
    message_count = 41.0
    messages = ['143490', '143495', '148603', '148613', '148614', '152310', '152316', '152420', '159129', '196785', '196788', '196790', '196798', '196803', '198803', '199151', '199179', '199183', '199184', '199186', '199187', '199188', '199189', '199190', '199191', '199192', '199193', '199194', '199195', '199197', '199371', '199375', '199500', '200233', '203344', '203353', '203375', '203591', '248362', '248392', '248404']
    nosy_count = 12.0
    nosy_names = ['lemburg', 'gvanrossum', 'loewis', 'mjpieters', 'pitrou', 'vstinner', 'ezio.melotti', 'python-dev', 'tchrist', 'kennyluck', 'serhiy.storchaka', 'tmp12342']
    pr_nums = []
    priority = 'high'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue12892'
    versions = ['Python 3.4']

    @ezio-melotti
    Copy link
    Member Author

    From Chapter 03 of the Unicode Standard 60, D91:
    """
    • UTF-16 encoding form: The Unicode encoding form that assigns each Unicode scalar value in the ranges U+0000..U+D7FF and U+E000..U+FFFF to a single unsigned 16-bit code unit with the same numeric value as the Unicode scalar value, and that assigns each Unicode scalar value in the range U+10000..U+10FFFF to a surrogate pair, according to Table 3-5.
    • Because surrogate code points are not Unicode scalar values, isolated UTF-16 code units in the range 0xD800..0xDFFF are ill-formed.
    """
    I.e. UTF-16 should be able to decode correctly a valid surrogate pair, and encode a non-BMP character using a valid surrogate pair, but it should reject lone surrogates both during encoding and decoding.

    On Python 3, the utf-16 codec can encode all the codepoints from U+0000 to U+10FFFF (including (lone) surrogates), but it's not able to decode lone surrogates (not sure if this is by design or if it just fails because it expects another (missing) surrogate).

    ----------------------------------------------

    From Chapter 03 of the Unicode Standard 60, D90:
    """
    • UTF-32 encoding form: The Unicode encoding form that assigns each Unicode scalar value to a single unsigned 32-bit code unit with the same numeric value as the Unicode scalar value.
    • Because surrogate code points are not included in the set of Unicode scalar values, UTF-32 code units in the range 0x0000D800..0x0000DFFF are ill-formed.
    """
    I.e. UTF-32 should reject both lone surrogates and valid surrogate pairs, both during encoding and during decoding.

    On Python 3, the utf-32 codec can encode and decode all the codepoints from U+0000 to U+10FFFF (including surrogates).

    ----------------------------------------------

    I think that:

    • this should be fixed in 3.3;
    • it's a bug, so the fix /might/ be backported to 3.2. Hoverver it's also a fairly big change in behavior, so it might be better to leave it for 3.3 only;
    • it's better to leave 2.7 alone, even the utf-8 codec is broken there;
    • the surrogatepass error handler should work with the utf-16 and utf-32 codecs too.

    Note that this has been already reported in bpo-3672, but eventually only the utf-8 codec was fixed.

    @ezio-melotti ezio-melotti self-assigned this Sep 4, 2011
    @ezio-melotti ezio-melotti added topic-unicode type-bug An unexpected behavior, bug, or error labels Sep 4, 2011
    @malemburg
    Copy link
    Member

    Ezio Melotti wrote:

    New submission from Ezio Melotti <ezio.melotti@gmail.com>:

    >From Chapter 03 of the Unicode Standard 6[0], D91:
    """
    • UTF-16 encoding form: The Unicode encoding form that assigns each Unicode scalar value in the ranges U+0000..U+D7FF and U+E000..U+FFFF to a single unsigned 16-bit code unit with the same numeric value as the Unicode scalar value, and that assigns each Unicode scalar value in the range U+10000..U+10FFFF to a surrogate pair, according to Table 3-5.
    • Because surrogate code points are not Unicode scalar values, isolated UTF-16 code units in the range 0xD800..0xDFFF are ill-formed.
    """
    I.e. UTF-16 should be able to decode correctly a valid surrogate pair, and encode a non-BMP character using a valid surrogate pair, but it should reject lone surrogates both during encoding and decoding.

    On Python 3, the utf-16 codec can encode all the codepoints from U+0000 to U+10FFFF (including (lone) surrogates), but it's not able to decode lone surrogates (not sure if this is by design or if it just fails because it expects another (missing) surrogate).

    ----------------------------------------------

    >From Chapter 03 of the Unicode Standard 6[0], D90:
    """
    • UTF-32 encoding form: The Unicode encoding form that assigns each Unicode scalar value to a single unsigned 32-bit code unit with the same numeric value as the Unicode scalar value.
    • Because surrogate code points are not included in the set of Unicode scalar values, UTF-32 code units in the range 0x0000D800..0x0000DFFF are ill-formed.
    """
    I.e. UTF-32 should reject both lone surrogates and valid surrogate pairs, both during encoding and during decoding.

    On Python 3, the utf-32 codec can encode and decode all the codepoints from U+0000 to U+10FFFF (including surrogates).

    ----------------------------------------------

    I think that:

    • this should be fixed in 3.3;
    • it's a bug, so the fix /might/ be backported to 3.2. Hoverver it's also a fairly big change in behavior, so it might be better to leave it for 3.3 only;
    • it's better to leave 2.7 alone, even the utf-8 codec is broken there;
    • the surrogatepass error handler should work with the utf-16 and utf-32 codecs too.

    Note that this has been already reported in bpo-3672, but eventually only the utf-8 codec was fixed.

    All UTF codecs should reject lone surrogates in strict error mode,
    but let them pass using the surrogatepass error handler (the UTF-8
    codec already does) and apply the usual error handling for ignore
    and replace.

    @vstinner
    Copy link
    Member

    Python 3.3 has a strange behaviour:

    >>> '\uDBFF\uDFFF'.encode('utf-16-le').decode('utf-16-le')
    '\U0010ffff'
    >>> '\U0010ffff'.encode('utf-16-le').decode('utf-16-le')
    '\U0010ffff'

    I would expect text.decode(encoding).encode(encoding)==text or an encode or decode error.

    So I agree that the encoder should reject lone surogates.

    @vstinner
    Copy link
    Member

    Patch rejecting surrogates in UTF-16 and UTF-32 encoders.

    I don't think that Python 2.7 and 3.2 should be changed in a minor version.

    @vstinner
    Copy link
    Member

    Hum, my patch doesn't call the error handler.

    @kennyluck
    Copy link
    Mannequin

    kennyluck mannequin commented Jan 30, 2012

    Attached patch does the following beyond what the patch from haypo does:

    • call the error handler
    • reject 0xd800~0xdfff when decoding utf-32

    The followings are on my TODO list, although this patch doesn't depend on any of these and can be reviewed and landed separately:

    • make the surrogatepass error handler work for utf-16 and utf-32. (I should be able to finish this by today)
    • fix an error in the error handler for utf-16-le. (In, Python3.2 b'\xdc\x80\x00\x41'.decode('utf-16-be', 'ignore') returns "\x00" instead of "A" for some reason)
    • make unicode_encode_call_errorhandler return bytes so that we can simplify this patch. (This arguably belongs to a separate bug so I'll file it when needed)

    All UTF codecs should reject lone surrogates in strict error mode,

    Should we really reject lone surrogates for UTF-7? There's a test in test_codecs.py that tests "\udc80" to be encoded b"+3IA-" (. Given that UTF-7 is not really part of the Unicode Standard and it is more like a "data encoding" than a "text encoding" to me, I am not sure it's a good idea.

    but let them pass using the surrogatepass error handler (the UTF-8
    codec already does) and apply the usual error handling for ignore
    and replace.

    For 'replace', the patch now emits b"\x00?" instead of b"?" so that UTF-16 stream doesn't get corrupted. It is not "usual" and not matching

    Implements the replace error handling: malformed data is replaced

    with a suitable replacement character such as '?' in bytestrings

    and '\ufffd' in Unicode strings.

    in the documentation. What do we do? Are there other encodings that are not ASCII compatible besides UTF-7, UTF-16 and UTF-32 that Python supports? I think it would be better to use encoded U+fffd whenever possible and fall back to '?'. What do you think?

    Some other self comments on my patch:

    • In the STORECHAR macro for utf-16 and utf-32, I change all instances of "ch & 0xFF" to (unsigned char) ch. I don't have enough C knowledge to know if this is actually better or if this makes any difference at all.
    • The code for utf-16 and utf-32 are duplicates of the uft-8 one. That one's complexity comes from issue bpo-8092 . Not sure if there are ways to simplify these. For example, are there suitable functions there so that we don't need to check integer overflow at these places?

    @ezio-melotti
    Copy link
    Member Author

    Thanks for the patch!

    • fix an error in the error handler for utf-16-le. (In, Python3.2
      b'\xdc\x80\x00\x41'.decode('utf-16-be', 'ignore') returns "\x00"
      instead of "A" for some reason)

    This should probably be done on a separate patch that will be applied to 3.2/3.3 (assuming that it can go to 3.2). Rejecting surrogates will go in 3.3 only. (Note that lot of Unicode-related code changed between 3.2 and 3.3.)

    Should we really reject lone surrogates for UTF-7?

    No, I meant only UTF-8/16/32; UTF-7 is fine as is.

    @kennyluck
    Copy link
    Mannequin

    kennyluck mannequin commented Feb 1, 2012

    The followings are on my TODO list, although this patch doesn't depend
    on any of these and can be reviewed and landed separately:

    • make the surrogatepass error handler work for utf-16 and utf-32. (I
      should be able to finish this by today)

    Unfortunately this took longer than I thought but here comes the patch.

    > * fix an error in the error handler for utf-16-le. (In, Python3.2
    > b'\xdc\x80\x00\x41'.decode('utf-16-be', 'ignore') returns "\x00"
    > instead of "A" for some reason)

    This should probably be done on a separate patch that will be applied
    to 3.2/3.3 (assuming that it can go to 3.2). Rejecting surrogates will
    go in 3.3 only. (Note that lot of Unicode-related code changed between
    3.2 and 3.3.)

    This turns out to be just two liners so I fixed that on the way. I can create separate patch with separate test for 3.2 (certainly doable) and even for 3.3, but since the test is now part of test_lone_surrogates, I feel less willing to do that for 3.3.

    You might notice the codec naming inconsistency (utf-16-be and utf16be for encoding and decoding respectively). I have filed issue bpo-13913 for this.

    Also, the strcmps are quite crappy. I am working on issue bpo-13916 (disallow the "surrogatepass" handler for non utf-* encodings). As long as we have that we can examine individual character instead...

    In this patch, The "encoding" attribute for UnicodeDecodeException is now changed to return utf16(be|le) for utf-16. This is necessary info for "surrogatepass" to work although admittedly this is rather ugly. Any good idea? A new attribute for Unicode(Decode|Encode)Exception might be helpful but utf-16/32 are fairly uncommon encodings anyway and we should not add more burden for, say, utf-8.

    > Should we really reject lone surrogates for UTF-7?

    No, I meant only UTF-8/16/32; UTF-7 is fine as is.

    Good to know.

    @serhiy-storchaka
    Copy link
    Member

    • fix an error in the error handler for utf-16-le. (In, Python3.2 b'\xdc\x80\x00\x41'.decode('utf-16-be', 'ignore') returns "\x00" instead of "A" for some reason)

    The patch for bpo-14579 fixes this in Python 3.2.

    The patch for bpo-14624 fixes this in Python 3.3.

    @serhiy-storchaka
    Copy link
    Member

    Here is a patch which combines both Kang-Hao's patches, synchronized with tip, fixed and optimized.

    Unfortunately even optimized this patch slowdown encoding/decoding some data. Here are some benchmark results (benchmarking tools are here: https://bitbucket.org/storchaka/cpython-stuff/src/default/bench).

    3.3 3.4 3.4
    unpatched patched

    969 (+12%) 978 (+11%) 1087 encode utf-16le 'A'*10000
    2453 (-62%) 2356 (-61%) 923 encode utf-16le '\u0100'*10000
    222 (+12%) 224 (+11%) 249 encode utf-16le '\U00010000'+'\u0100'*9999
    784 (+6%) 791 (+5%) 831 encode utf-16be 'A'*10000
    750 (-4%) 752 (-4%) 719 encode utf-16be '\u0100'*10000
    233 (+2%) 235 (+1%) 238 encode utf-16be '\U00010000'+'\u0100'*9999

    531 (-7%) 545 (-9%) 494 encode utf-32le 'A'*10000
    383 (-38%) 385 (-38%) 239 encode utf-32le '\u0100'*10000
    324 (-24%) 325 (-25%) 245 encode utf-32le '\U00010000'+'\u0100'*9999
    544 (-10%) 545 (-10%) 492 encode utf-32be 'A'*10000
    384 (-38%) 384 (-38%) 239 encode utf-32be '\u0100'*10000
    325 (-25%) 325 (-25%) 245 encode utf-32be '\U00010000'+'\u0100'*9999

    682 (+5%) 679 (+5%) 715 decode utf-16le 'A'*10000
    607 (+1%) 610 (+1%) 614 decode utf-16le '\u0100'*10000
    550 (+1%) 554 (+0%) 556 decode utf-16le '\U00010000'+'\u0100'*9999
    609 (+0%) 600 (+2%) 610 decode utf-16be 'A'*10000
    464 (+1%) 466 (+1%) 470 decode utf-16be '\u0100'*10000
    432 (+1%) 431 (+1%) 435 decode utf-16be '\U00010000'+'\u0100'*9999

    103 (+272%) 374 (+2%) 383 decode utf-32le 'A'*10000
    91 (+264%) 390 (-15%) 331 decode utf-32le '\u0100'*10000
    90 (+257%) 393 (-18%) 321 decode utf-32le '\U00010000'+'\u0100'*9999
    103 (+269%) 393 (-3%) 380 decode utf-32be 'A'*10000
    91 (+263%) 406 (-19%) 330 decode utf-32be '\u0100'*10000
    90 (+257%) 393 (-18%) 321 decode utf-32be '\U00010000'+'\u0100'*9999

    Performance of utf-16 decoding is not changed. utf-16 encoder is 2.5 times slowed for UCS2 data (it was just memcpy) but still 3+ times faster than 2.7 and 3.2 (bpo-15026). Due to additional optimization it now even slightly faster for some other data. There is a patch for speed up UTF-32 encoding (bpo-15027), it should help to compensate it's performance degradation. UTF-32 decoder already 3-4 times faster than in 3.3 (bpo-14625).

    I don't see performance objection against this patch.

    @malemburg
    Copy link
    Member

    You should be able to squeeze out some extra cycles by
    avoiding the bit calculations using a simple range check
    for ch >= 0xd800:

    +# if STRINGLIB_MAX_CHAR >= 0xd800
    + if (((ch1 ^ 0xd800) &
    + (ch1 ^ 0xd800) &
    + (ch1 ^ 0xd800) &
    + (ch1 ^ 0xd800) & 0xf800) == 0)
    + break;
    +# endif

    @serhiy-storchaka
    Copy link
    Member

    Oh, I were blind. Thank you Marc-Andre. Here is corrected patch. Unfortunately it 1.4-1.5 times slower on UTF-16 encoding UCS2 strings than previous wrong patch.

    @malemburg
    Copy link
    Member

    On 02.09.2013 18:56, Serhiy Storchaka wrote:

    Oh, I were blind. Thank you Marc-Andre. Here is corrected patch. Unfortunately it 1.4-1.5 times slower on UTF-16 encoding UCS2 strings than previous wrong patch.

    I think it would be faster to do this in two steps:

    1. check the ranges of the input

    2. do a memcpy() if there are no lone surrogates

    Part 1 can be further optimized by adding a simple range
    check (ch >= 0xd800).

    @serhiy-storchaka
    Copy link
    Member

    No, it isn't faster. I tested this variant, it is 1.5x slower.

    And simple range checking actually is slower.

    @serhiy-storchaka
    Copy link
    Member

    Could you please make a review Ezio?

    @serhiy-storchaka
    Copy link
    Member

    Updated whatsnew and Misc/ files.

    @pitrou
    Copy link
    Member

    pitrou commented Oct 8, 2013

    utf-16 isn't that widely used, so it's probably fine if it becomes a bit slower.

    @malemburg
    Copy link
    Member

    On 08.10.2013 10:46, Antoine Pitrou wrote:

    utf-16 isn't that widely used, so it's probably fine if it becomes a bit slower.

    It's the default encoding for Unicode text files and APIs on Windows,
    so I'd say it *is* widely used :-)

    http://en.wikipedia.org/wiki/UTF-16#Use_in_major_operating_systems_and_environments

    @pitrou
    Copy link
    Member

    pitrou commented Oct 8, 2013

    On 08.10.2013 10:46, Antoine Pitrou wrote:
    >
    > utf-16 isn't that widely used, so it's probably fine if it becomes
    > a bit slower.

    It's the default encoding for Unicode text files and APIs on Windows,
    so I'd say it *is* widely used :-)

    I've never seen any UTF-16 text files. Do you have other data?

    APIs are irrelevant. You only pass very small strings to then (e.g.
    file paths).

    @malemburg
    Copy link
    Member

    On 08.10.2013 11:03, Antoine Pitrou wrote:
    > 
    >>> utf-16 isn't that widely used, so it's probably fine if it becomes
    >>> a bit slower.
    >>
    >> It's the default encoding for Unicode text files and APIs on Windows,
    >> so I'd say it *is* widely used :-)
    > 
    > I've never seen any UTF-16 text files. Do you have other data?

    See the link I posted.

    MS Notepad and MS Office save Unicode text files in UTF-16-LE,
    unless you explicitly specify UTF-8, just like many other Windows
    applications that support Unicode text files:

    http://msdn.microsoft.com/en-us/library/windows/desktop/dd374101%28v=vs.85%29.aspx
    http://superuser.com/questions/294219/what-are-the-differences-between-linux-and-windows-txt-files-unicode-encoding

    This is simply due to the fact that MS introduced Unicode plain
    text files as UTF-16-LE files and only later added the possibility
    to also use UTF-8 with BOM versions.

    APIs are irrelevant. You only pass very small strings to then (e.g.
    file paths).

    You are forgetting that wchar_t is UTF-16 on Windows, so UTF-16
    is all around you when working on Windows, not only in the OS APIs,
    but also in most other Unicode APIs you find on Windows:

    http://msdn.microsoft.com/en-us/library/windows/desktop/dd374089%28v=vs.85%29.aspx
    http://msdn.microsoft.com/en-us/library/windows/desktop/dd374061%28v=vs.85%29.aspx

    @pitrou
    Copy link
    Member

    pitrou commented Oct 8, 2013

    MS Notepad and MS Office save Unicode text files in UTF-16-LE,
    unless you explicitly specify UTF-8, just like many other Windows
    applications that support Unicode text files:

    I'd be curious to know if people actually edit *text files* using
    Microsoft Word (rather than Word documents).
    Same for Notepad, which is much too poor to edit something else
    than a 10-line configuration file.

    You are forgetting that wchar_t is UTF-16 on Windows, so UTF-16
    is all around you when working on Windows, not only in the OS APIs,
    but also in most other Unicode APIs you find on Windows:

    Still, unless those APIs get passed rather large strings, the performance
    different should be irrelevant IMHO. We're talking about using those APIs
    from Python, not from a raw optimized C program.

    @serhiy-storchaka
    Copy link
    Member

    UTF-16 codec still fast enough. Let first make it correct and then will try optimize it. I have an idea how restore 3.3 performance (if it worth, the code already complicated enough).

    The converting to/from wchar_t* uses different code.

    @malemburg
    Copy link
    Member

    On 08.10.2013 11:33, Antoine Pitrou wrote:

    Antoine Pitrou added the comment:

    > MS Notepad and MS Office save Unicode text files in UTF-16-LE,
    > unless you explicitly specify UTF-8, just like many other Windows
    > applications that support Unicode text files:

    I'd be curious to know if people actually edit *text files* using
    Microsoft Word (rather than Word documents).
    Same for Notepad, which is much too poor to edit something else
    than a 10-line configuration file.

    The question is not so much which program they use for editing.
    The format "Unicode text file" is defined as UTF-16-LE on
    Windows (see the links I posted).

    > You are forgetting that wchar_t is UTF-16 on Windows, so UTF-16
    > is all around you when working on Windows, not only in the OS APIs,
    > but also in most other Unicode APIs you find on Windows:

    Still, unless those APIs get passed rather large strings, the performance
    different should be irrelevant IMHO. We're talking about using those APIs
    from Python, not from a raw optimized C program.

    Antoine, I'm just pointing out that your statement that UTF-16
    is not widely used may apply to the Unix world, but
    it doesn't apply to Windows. Java also uses UTF-16
    internally and makes this available via JNI as jchar*.

    The APIs on those platforms are used from Python (the interpreter
    and also by extensions) and do use the UTF-16 Python codec to
    convert the data to Python Unicode/string objects, so the fact
    that UTF-16 is used widely on some of the more popular
    platforms does matter.

    UTF-8, UTF-16 and UTF-32 codecs need to be as fast as possible
    in Python to not create performance problems when converting
    between platform Unicode data and the internal formats
    used in Python.

    The real question is: Can the UTF-16/32 codecs be made fast
    while still detecting lone surrogates ? Not whether UTF-16
    is widely used or not.

    @serhiy-storchaka
    Copy link
    Member

    I repeat myself. Even with the patch, UTF-16 codec is faster than UTF-8 codec (except ASCII-only data). This is fastest Unicode codec in Python (perhaps UTF-32 can be made faster, but this is another issue).

    The real question is: Can the UTF-16/32 codecs be made fast
    while still detecting lone surrogates ? Not whether UTF-16
    is widely used or not.

    Yes, they can. But let defer this to other issues.

    @pitrou
    Copy link
    Member

    pitrou commented Oct 8, 2013

    UTF-8, UTF-16 and UTF-32 codecs need to be as fast as possible
    in Python to not create performance problems when converting
    between platform Unicode data and the internal formats
    used in Python.

    "As fast as possible" is a platonic dream.
    They only need to be fast enough not to be bottlenecks.
    If you know of a *Python* workload where UTF-16 decoding is the
    bottleneck, I'd like to know about it :-)

    @malemburg
    Copy link
    Member

    On 08.10.2013 11:42, Serhiy Storchaka wrote:

    UTF-16 codec still fast enough. Let first make it correct and then will try optimize it. I have an idea how restore 3.3 performance (if it worth, the code already complicated enough).

    That's a good plan :-)

    @malemburg
    Copy link
    Member

    On 08.10.2013 12:30, Antoine Pitrou wrote:

    > UTF-8, UTF-16 and UTF-32 codecs need to be as fast as possible
    > in Python to not create performance problems when converting
    > between platform Unicode data and the internal formats
    > used in Python.

    "As fast as possible" is a platonic dream.
    They only need to be fast enough not to be bottlenecks.

    No, they need to be as fast as possible, without sacrificing
    correctness.

    This has always been our guideline for codec implementations
    and string methods. As a result, our implementations are some
    of the best out there.

    @vstinner
    Copy link
    Member

    vstinner commented Oct 8, 2013

    I don't think that performances on a microbenchmark is the good question.
    The good question is: does Python conform to Unicode? The answer is simple
    and explicit: no. Encoding lone surrogates may lead to bugs and even
    security vulnerabilities.

    Please open a new performance issue after fixing this one if you have
    another patch improving performances.

    I didn't read the patch yet, but strict, surrogatepass and surrogateescape
    error handlers must be checked.

    @serhiy-storchaka
    Copy link
    Member

    Here is my idea: http://permalink.gmane.org/gmane.comp.python.ideas/23521.

    I see that a discussion about how fast UTF-16 codec should be already larger than discussion about patches. Could you please review this not so simple patch instead?

    Yet one help which I need is writing a note in "Porting to Python 3.4" section in Doc/whatsnew/3.4.rst.

    @loewis
    Copy link
    Mannequin

    loewis mannequin commented Oct 8, 2013

    Marc-Andre: please don't confuse "use in major operating systems" with "major use in operating systems". I agree with Antoine that UTF-16 isn't widely used on Windows, despite notepad and Office supporting it. Most users on Windows using notepad continue to use the ANSI code page, most users of Word use Word files (instead of plain text).

    Also, wchar_t on Windows isn't *really* UTF-16. Many APIs support lone surrogates just fine; they really are UCS-2 instead (e.g. the file system APIs). Only starting with Vista, MultiByteToWideChar will complain about lone surrogates.

    @vstinner
    Copy link
    Member

    I tested utf_16_32_surrogates_4.patch: surrogateescape with as encoder does not work as expected.

    >>> b'[\x00\x80\xdc]\x00'.decode('utf-16-le', 'ignore')
    '[]'
    >>> b'[\x00\x80\xdc]\x00'.decode('utf-16-le', 'replace')
    '[�]'
    >>> b'[\x00\x80\xdc]\x00'.decode('utf-16-le', 'surrogateescape')
    '[\udc80\udcdc\uffff'

    => I expected '[\udc80\udcdc]'.

    With a decoder, surrogateescape does not work neither:

    >>> '[\uDC80]'.encode('utf-16-le', 'surrogateescape')
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    UnicodeEncodeError: 'utf-16-le' codec can't encode character '\udc80' in position 1: surrogates not allowed

    Using the PEP-383, I expect that data.decode(encoding, 'surrogateescape') does never fail, data.decode(encoding, 'surrogateescape').encode(encoding, 'surrogateescape') should give data.

    --

    With UTF-16, there is a corner case:

    >>> b'[\x00\x00'.decode('utf-16-le', 'surrogateescape')
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "/home/haypo/prog/python/default/Lib/encodings/utf_16_le.py", line 16, in decode
        return codecs.utf_16_le_decode(input, errors, True)
    UnicodeDecodeError: 'utf-16-le' codec can't decode byte 0x00 in position 2: truncated data
    >>> b'[\x00\x80'.decode('utf-16-le', 'surrogateescape')
    '[\udc80'

    The incomplete sequence b'\x00' raises a decoder error, wheras b'\x80' does not. Should we extend the PEP-383 to bytes in range [0; 127]? Or should we keep this behaviour?

    Sorry, this question is unrelated to this issue.

    @vstinner
    Copy link
    Member

    Could you please review this not so simple patch instead?

    I did a first review of your code on rietveld.

    @serhiy-storchaka
    Copy link
    Member

    Updated patch addresses Victor's comments on Rietveld. Thank you Victor. The "surrogatepass" error handler now works with different spellings of encodings ("utf_32le", "UTF-32-LE", etc).

    I tested utf_16_32_surrogates_4.patch: surrogateescape with as encoder does not work as expected.

    Yes, surrogateescape doesn't work with ASCII incompatible encodings and can't. First, it can't represent the result of decoding b'\x00\xd8' from utf-16-le or b'ABCD' from utf-32*. This problem is worth separated issue (or even PEP) and discussion on Python-Dev.

    @serhiy-storchaka
    Copy link
    Member

    Changed the documentation as was discussed with Ezio on IRC.

    Ezio, do you want commit this patch? Feel free to reword the documentation if you are feeling be better.

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Nov 19, 2013

    New changeset 0d9624f2ff43 by Serhiy Storchaka in branch 'default':
    Issue bpo-12892: The utf-16* and utf-32* codecs now reject (lone) surrogates.
    http://hg.python.org/cpython/rev/0d9624f2ff43

    @serhiy-storchaka
    Copy link
    Member

    Ezio have approved the patch and I have committed it.

    Thank you Victor and Kang-Hao for your patches. Thanks all for the reviews.

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Nov 19, 2013

    New changeset 130597102dac by Serhiy Storchaka in branch 'default':
    Remove dead code committed in issue bpo-12892.
    http://hg.python.org/cpython/rev/130597102dac

    @vstinner
    Copy link
    Member

    Thanks Ezio and Serhiy for having fix UTF-16 and UTF-32 codecs!

    @mjpieters
    Copy link
    Mannequin

    mjpieters mannequin commented Aug 10, 2015

    I don't understand why encoding with surrogateescape isn't supported still; is it the fact that a surrogate would have to produce single bytes rather than double? E.g. b'\x80' -> '\udc80' -> b'\x80' doesn't work because that would mean the UTF-16 and UTF-32 codec could then end up producing an odd number of bytes?

    @serhiy-storchaka
    Copy link
    Member

    There are two causes:

    1. UTF-16 and UTF-32 are based on 2- and 4-bytes units. If the surrogateescape error handler will support UTF-16 and UTF-32, encoding could produce the data that can't be decoded back correctly. For example '\udcac \udcac' -> b'\xac\x20\x00\xac' -> '\u20ac\uac20' == '€가'.

    2. ASCII bytes (0x00-0x80) can't be escaped with surrogateescape. UTF-16 and UTF-32 data can contain illegal ASCII bytes (b'\xD8\x00' in UTF-16-BE, b'abcd' in UTF-32). For the same reason surrogateescape is not compatible with UTF-7 and CP037.

    @tmp12342
    Copy link
    Mannequin

    tmp12342 mannequin commented Aug 11, 2015

    Serhiy, I understand the first reason, but https://docs.python.org/3/library/codecs.html says

    applicable to text encodings:
    [...]
    This code will then be turned back into the same byte when the 'surrogateescape' error handler is used when encoding the data.
    Shouldn't it be corrected? Text encoding is defined as "A codec which encodes Unicode strings to bytes."

    And about second one, could you explain a bit more? I mean, I don't know how to interpret it.

    You say b'\xD8\x00' are invalid ASCII bytes, but from these two only 0xD8 is invalid. Also, we are talking about encoding here, str -> bytes, so who cares are resulting bytes ASCII compatible or not?

    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    topic-unicode type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    No branches or pull requests

    5 participants