This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: _Py_DecodeUTF8Ex() creates surrogate pairs on Windows
Type: Stage: resolved
Components: Interpreter Core Versions: Python 3.8, Python 3.7
process
Status: closed Resolution: not a bug
Dependencies: Superseder:
Assigned To: Nosy List: serhiy.storchaka, vstinner
Priority: normal Keywords:

Created on 2018-06-21 10:55 by vstinner, last changed 2022-04-11 14:59 by admin. This issue is now closed.

Messages (6)
msg320154 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2018-06-21 10:55
_Py_DecodeUTF8Ex() creates surrogate pairs with 16-bit wchar_t (on Windows), whereas input bytes should be escaped. I'm quite sure that it's a bug.
msg320155 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2018-06-21 10:57
Extract of _Py_DecodeUTF8Ex() code, there is an explicit "write a surrogate pair" comment:

#if SIZEOF_WCHAR_T == 4
        ch = ucs4lib_utf8_decode(&s, e, (Py_UCS4 *)unicode, &outpos);
#else
        ch = ucs2lib_utf8_decode(&s, e, (Py_UCS2 *)unicode, &outpos);
#endif
        if (ch > 0xFF) {
#if SIZEOF_WCHAR_T == 4
            Py_UNREACHABLE();
#else
            assert(ch > 0xFFFF && ch <= MAX_UNICODE);
            /* write a surrogate pair */
            unicode[outpos++] = (wchar_t)Py_UNICODE_HIGH_SURROGATE(ch);
            unicode[outpos++] = (wchar_t)Py_UNICODE_LOW_SURROGATE(ch);
#endif
        }
msg320158 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2018-06-21 11:00
Could you show an example please?
msg320163 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2018-06-21 11:10
> Could you show an example please?

I saw an issue when reading the code, I didn't try to trigger the issue using real code yet.
msg320170 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2018-06-21 11:43
I don't see anything wrong.
msg320193 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2018-06-21 16:03
> I don't see anything wrong.

I write a C function to test _Py_DecodeUTF8Ex():

* surrogateescape=0 fails with a decoding error as expected
* surrogateescape=1 escapes the bytes as expected as: '\udced\udcb2\udc80'

Ok, I just misunderstood the code: the decoder is fine!
History
Date User Action Args
2022-04-11 14:59:02adminsetgithub: 78109
2018-06-21 16:03:48vstinnersetstatus: open -> closed
resolution: not a bug
messages: + msg320193

stage: resolved
2018-06-21 11:43:19serhiy.storchakasetmessages: + msg320170
2018-06-21 11:10:16vstinnersetmessages: + msg320163
2018-06-21 11:00:41serhiy.storchakasetmessages: + msg320158
2018-06-21 10:57:13vstinnersetmessages: + msg320155
2018-06-21 10:55:58vstinnercreate