classification
Title: ascii() does not always join surrogate pairs
Type: behavior Stage: resolved
Components: Interpreter Core, Unicode Versions: Python 3.1, Python 3.2
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: amaury.forgeotdarc, pitrou, vstinner
Priority: normal Keywords: patch

Created on 2010-09-08 21:47 by pitrou, last changed 2010-09-09 20:34 by pitrou. This issue is now closed.

Files
File name Uploaded Description Edit
backslashsurrogates.patch pitrou, 2010-09-08 23:34
backslashsurrogates2.patch pitrou, 2010-09-09 11:18
Messages (10)
msg115905 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2010-09-08 21:47
This is on an UCS-2 py3k build:

>>> ascii('\U00012FFF')
"'\\U00012fff'"
>>> ascii('\U0001D121')
"'\\ud834\\udd21'"
msg115914 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010-09-08 22:50
For unicode, ascii(x) is implemented as repr(x).encode('ascii', 'backslashreplace').decode('ascii').

repr(x) is "'" + x + "'" for printable characters (eg. U+1D121), and "'U+%08x'" % ord(x) for not printable characters (eg. U+12FFF).

About the unexpected output, the problem is that ascii+backslashreplace encodes non-BMP printable characters as b'\\uXXXX\\uXXXX' in narrow builds.

I don't see simple solution to encode non-BMP characters as b'\\UXXXXXXXX' because the principle of error handler is that it escapes non encodable characters one by one.
msg115915 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2010-09-08 23:04
How about the following solution:

>>> def a(s):
...    s = s.encode('unicode-escape').decode('ascii')
...    s = s.replace("'", r"\'")
...    return "'" + s + "'"
... 
>>> s = "'\0\"\n\r\t abcd\x85é\U00012fff\U0001D121xxx\uD800."
>>> print(ascii(s)); print(a(s)); print(repr(s))
'\'\x00"\n\r\t abcd\x85\xe9\U00012fff\ud834\udd21xxx\ud800.'
'\'\x00"\n\r\t abcd\x85\xe9\U00012fff\U0001d121xxx\ud800.'
'\'\x00"\n\r\t abcd\x85é\U00012fff𝄡xxx\ud800.'


(I think I've included everything:
- normal chars
- control chars
- one-byte non-ASCII
- two-byte non-ASCII (and lone surrogate)
- printable and non-printable surrogate pairs)
- single and double quotes)
msg115916 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2010-09-08 23:12
Actually, it would probably be simpler to export a _PyUnicode_Repr(PyUnicodeObject *, int only_ascii) function since all the code is already there in unicodeobject.c.
msg115917 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2010-09-08 23:16
Or perhaps not, since we would like surrogate pairs to be fused in other cases (ascii() of other types) as well.

So "backslashreplace" would need to be changed instead:

>>> print("\U00012345".encode('ascii', 'backslashreplace'))
b'\\ud808\\udf45'

Expected result (already works in UCS4 builds):

>>> print("\U00012345".encode('ascii', 'backslashreplace'))
b'\\U00012345'
msg115918 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2010-09-08 23:34
Here is a patch (lacking tests for now).
msg115920 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010-09-09 00:12
> >>> s = "'\0\"\n\r\t abcd\x85é\U00012fff\U0001D121xxx\uD800."
> (...)
> (I think I've included everything:
> - normal chars
> - control chars
> - one-byte non-ASCII
> - two-byte non-ASCII (and lone surrogate)
> - printable and non-printable surrogate pairs)
> - single and double quotes)

Add maybe a lone suroggate followed directly by a surrogate pair, eg. 
'\uD800\U0001D121'.
msg115938 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2010-09-09 11:18
New patch with tests.
msg115959 - (view) Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer) Date: 2010-09-09 19:11
I agree with the feature and the patch, with two minor nits:
- Py_UCS4 should be used in place of "unsigned long"
- "*p >= 0xD800" is the most selective test and should be the first
msg115971 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2010-09-09 20:34
Modified patch committed in r84655 (3.x) and r84656 (3.1). Thanks!
History
Date User Action Args
2010-09-09 20:34:39pitrousetstatus: open -> closed
resolution: fixed
messages: + msg115971

stage: needs patch -> resolved
2010-09-09 19:11:25amaury.forgeotdarcsetnosy: + amaury.forgeotdarc
messages: + msg115959
2010-09-09 11:18:02pitrousetfiles: + backslashsurrogates2.patch

messages: + msg115938
2010-09-09 00:12:51vstinnersetmessages: + msg115920
2010-09-08 23:34:37pitrousetfiles: + backslashsurrogates.patch
keywords: + patch
messages: + msg115918
2010-09-08 23:16:33pitrousetmessages: + msg115917
2010-09-08 23:12:00pitrousetmessages: + msg115916
2010-09-08 23:04:52pitrousetmessages: + msg115915
2010-09-08 22:50:56vstinnersetmessages: + msg115914
2010-09-08 21:47:47pitroucreate