classification
Title: PyUnicode_FromFormat is broken on python 2
Type: behavior Stage: resolved
Components: Interpreter Core, Unicode Versions: Python 2.7
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: alex, ezio.melotti, haypo, python-dev
Priority: normal Keywords: patch

Created on 2014-07-21 17:38 by alex, last changed 2014-08-04 23:32 by ezio.melotti. This issue is now closed.

Files
File name Uploaded Description Edit
unicode_fromformat.patch haypo, 2014-07-21 22:18 review
Messages (9)
msg223594 - (view) Author: Alex Gaynor (alex) * (Python committer) Date: 2014-07-21 17:38
http://hg.python.org/cpython/file/2.7/Objects/unicodeobject.c#l840

Specifically it calls "PyObject_Str", which will return a PyStringObject * (cast to a PyObject *), and then calls "PyUnicode_GET_SIZE", which is of course totally incorrect.

This code was originally back-ported from 3.0 -> 2.6, so I imagine no one caught the bug then.
msg223596 - (view) Author: Alex Gaynor (alex) * (Python committer) Date: 2014-07-21 18:02
It's worth noting that this function is replete with incorrect assumptions about unicode vs. strings that came from the backport, the one I initially pointed out was merely the first.

The motivation for this issue is the SSL module backport (issue21308) for the record :-)
msg223601 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2014-07-21 19:34
Python 3 has many unit tests for PyUnicode_FromFormat(): see test_unicode.test_from_format().
msg223612 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2014-07-21 22:18
Here is a patch fixing %S and %R formats and supporting %li and %zi. It fixes also %S, %R and %V for non-ASCII characters.

PyUnicode_FromFormat() of Python 2 doesn't support width, precision and padding. For example, "%100i" does crash. For %S, %R and %V, the function decodes byte strings from ISO-8859-1 in Python 2, whereas it decodes from UTF-8 in Python 3.
msg223613 - (view) Author: Alex Gaynor (alex) * (Python committer) Date: 2014-07-21 22:22
Hi Victor, thanks for working on this. I don't know the Unicode codebase that well, but this looks like an obvious improvement to me (much less broken :-)).
msg223670 - (view) Author: Alex Gaynor (alex) * (Python committer) Date: 2014-07-22 15:54
Reviewed more closely today, I think the docs probably need updating, but otherwise this LGTM, massive improvement! Thanks!
msg223679 - (view) Author: Alex Gaynor (alex) * (Python committer) Date: 2014-07-22 18:28
I've tested in my local dev with my SSL patches applied, and I've confirmed that it fixes the segfaults.
msg224267 - (view) Author: Roundup Robot (python-dev) Date: 2014-07-29 22:39
New changeset 263701e0b77e by Victor Stinner in branch '2.7':
Issue #22023: Fix %S, %R and %V formats of PyUnicode_FromFormat().
http://hg.python.org/cpython/rev/263701e0b77e
msg224269 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2014-07-29 22:43
Oh, supporting %li and %zi is a new feature, it's too late to add this in Python 2.7. I removed this part of my patch. I commited the other part.

PyUnicode_FromFormat() decodes byte strings from ISO 8859-1 for %S, %R and %V formats. I don't like this choice, we should use the default encoding or UTF-8. But it's also probably too late to change that. At least, b'\xff' is decoded to '\xe4' instead of '\xffe4' (signed/unsigned integer issue, also fixed by my patch). It's a little bit better than before.
History
Date User Action Args
2014-08-04 23:32:52ezio.melottisettype: behavior
stage: resolved
2014-07-29 22:43:25hayposetstatus: open -> closed
resolution: fixed
messages: + msg224269
2014-07-29 22:39:22python-devsetnosy: + python-dev
messages: + msg224267
2014-07-22 18:28:29alexsetmessages: + msg223679
2014-07-22 15:54:57alexsetmessages: + msg223670
2014-07-21 22:22:38alexsetmessages: + msg223613
2014-07-21 22:18:19hayposetfiles: + unicode_fromformat.patch
keywords: + patch
messages: + msg223612
2014-07-21 19:34:08hayposetmessages: + msg223601
2014-07-21 18:02:16alexsetmessages: + msg223596
2014-07-21 17:38:29alexcreate