Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

_Py_DecodeUTF8Ex() creates surrogate pairs on Windows #78109

Closed
vstinner opened this issue Jun 21, 2018 · 6 comments
Closed

_Py_DecodeUTF8Ex() creates surrogate pairs on Windows #78109

vstinner opened this issue Jun 21, 2018 · 6 comments
Labels
3.7 (EOL) end of life 3.8 only security fixes interpreter-core (Objects, Python, Grammar, and Parser dirs)

Comments

@vstinner
Copy link
Member

BPO 33928
Nosy @vstinner, @serhiy-storchaka

Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

Show more details

GitHub fields:

assignee = None
closed_at = <Date 2018-06-21.16:03:48.574>
created_at = <Date 2018-06-21.10:55:58.602>
labels = ['interpreter-core', 'invalid', '3.7', '3.8']
title = '_Py_DecodeUTF8Ex() creates surrogate pairs on Windows'
updated_at = <Date 2018-06-21.16:03:48.572>
user = 'https://github.com/vstinner'

bugs.python.org fields:

activity = <Date 2018-06-21.16:03:48.572>
actor = 'vstinner'
assignee = 'none'
closed = True
closed_date = <Date 2018-06-21.16:03:48.574>
closer = 'vstinner'
components = ['Interpreter Core']
creation = <Date 2018-06-21.10:55:58.602>
creator = 'vstinner'
dependencies = []
files = []
hgrepos = []
issue_num = 33928
keywords = []
message_count = 6.0
messages = ['320154', '320155', '320158', '320163', '320170', '320193']
nosy_count = 2.0
nosy_names = ['vstinner', 'serhiy.storchaka']
pr_nums = []
priority = 'normal'
resolution = 'not a bug'
stage = 'resolved'
status = 'closed'
superseder = None
type = None
url = 'https://bugs.python.org/issue33928'
versions = ['Python 3.7', 'Python 3.8']

@vstinner
Copy link
Member Author

_Py_DecodeUTF8Ex() creates surrogate pairs with 16-bit wchar_t (on Windows), whereas input bytes should be escaped. I'm quite sure that it's a bug.

@vstinner vstinner added 3.7 (EOL) end of life 3.8 only security fixes interpreter-core (Objects, Python, Grammar, and Parser dirs) labels Jun 21, 2018
@vstinner
Copy link
Member Author

Extract of _Py_DecodeUTF8Ex() code, there is an explicit "write a surrogate pair" comment:

#if SIZEOF_WCHAR_T == 4
        ch = ucs4lib_utf8_decode(&s, e, (Py_UCS4 *)unicode, &outpos);
#else
        ch = ucs2lib_utf8_decode(&s, e, (Py_UCS2 *)unicode, &outpos);
#endif
        if (ch > 0xFF) {
#if SIZEOF_WCHAR_T == 4
            Py_UNREACHABLE();
#else
            assert(ch > 0xFFFF && ch <= MAX_UNICODE);
            /* write a surrogate pair */
            unicode[outpos++] = (wchar_t)Py_UNICODE_HIGH_SURROGATE(ch);
            unicode[outpos++] = (wchar_t)Py_UNICODE_LOW_SURROGATE(ch);
#endif
        }

@serhiy-storchaka
Copy link
Member

Could you show an example please?

@vstinner
Copy link
Member Author

Could you show an example please?

I saw an issue when reading the code, I didn't try to trigger the issue using real code yet.

@serhiy-storchaka
Copy link
Member

I don't see anything wrong.

@vstinner
Copy link
Member Author

I don't see anything wrong.

I write a C function to test _Py_DecodeUTF8Ex():

  • surrogateescape=0 fails with a decoding error as expected
  • surrogateescape=1 escapes the bytes as expected as: '\udced\udcb2\udc80'

Ok, I just misunderstood the code: the decoder is fine!

@ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3.7 (EOL) end of life 3.8 only security fixes interpreter-core (Objects, Python, Grammar, and Parser dirs)
Projects
None yet
Development

No branches or pull requests

2 participants