Title: UTF-7 decoder can produce inconsistent Unicode string
Type: security Stage: resolved
Components: Library (Lib), Unicode Versions: Python 3.3, Python 3.4, Python 2.7
Status: closed Resolution: fixed
Assigned To: Nosy List: BreamoreBoy, barry, benjamin.peterson, ezio.melotti, georg.brandl, glebourgeois, larry, mcepl, mrabarnett, ncoghlan, piotr.dobrogost, python-dev, serhiy.storchaka, vstinner
msg200117 - (view) Author: Guillaume Lebourgeois (glebourgeois) Date: 2013-10-17 09:55
After the fetch of a webpage with a wrongly declared encoding, the use of codecs module for a conversion crashes.

The issue is reproducible this way : 

>>> content = b"+1911\' rel=\'stylesheet\' type=\'text/css\' />\n<link rel="alternate" type="application/rss+xml"
>>> codecs.utf_7_decode(content, "replace", True)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
SystemError: invalid maximum character passed to PyUnicode_New

Original issue here  :
msg200132 - (view) Author: Matthew Barnett (mrabarnett) * (Python triager) Date: 2013-10-17 14:54
The bytestring literal isn't valid. It starts with b" and later on has an unescaped " followed by more characters.

Also, the usual way to decode by using the .decode method.

I get this:

>>> content = b"+1911\' rel=\'stylesheet\' type=\'text/css\' />\n<link rel=\"alternate\" type=\"application/rss+xml\""
>>> content.decode("utf-7", "strict")
Traceback (most recent call last):
  File "<pyshell#10>", line 1, in <module>
    content.decode("utf-7", "strict")
  File "C:\Python33\lib\encodings\", line 12, in decode
    return codecs.utf_7_decode(input, errors, True)
UnicodeDecodeError: 'utf7' codec can't decode bytes in position 0-5: partial character in shift sequence
msg200133 - (view) Author: Guillaume Lebourgeois (glebourgeois) Date: 2013-10-17 15:07
My fault, bad paste. Should have written : 

>>> content = b'+1911\' rel=\'stylesheet\' type=\'text/css\' />\n<link rel="alternate" type="application/rss+xml'
>>> codecs.utf_7_decode(content, "replace", True)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
SystemError: invalid maximum character passed to PyUnicode_New
msg200134 - (view) Author: Guillaume Lebourgeois (glebourgeois) Date: 2013-10-17 15:13
"Also, the usual way to decode by using the .decode method."

The original bug happened using requests library, so I have no leverage on the used method for decoding.

But if you used the "replace" mode with your methodology, you would have raised the same Exception : 

>>> content = b'+1911\' rel=\'stylesheet\' type=\'text/css\' />\n<link rel="alternate" type="application/rss+xml'
>>> content.decode("utf-7", "replace")
File "<stdin>", line 1, in <module>
  File "/lib/python3.3/encodings/", line 12, in decode
    return codecs.utf_7_decode(input, errors, True)
SystemError: invalid maximum character passed to PyUnicode_New
msg200135 - (view) Author: Alyssa Coghlan (ncoghlan) * (Python committer) Date: 2013-10-17 15:41
Indeed, 'utf-7' and the 'replace' error handler don't get along in this case.
msg200136 - (view) Author: Alyssa Coghlan (ncoghlan) * (Python committer) Date: 2013-10-17 15:41
That is, I can locally reproduce the behaviour Guillaume describes on the latest tip build.
msg200144 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2013-10-17 16:29
Here is a patch for 3.3+.

Other versions are affected too. They don't raise SystemError, but produce illegal unicode string on wide build.

E.g. in Python 2.7:

>>> 'a+/,+IKw-b'.decode('utf-7', 'replace')

\U003f20ac is illegal code.

As encoding and encoded data can come from external source, this can be used in secure attacks.
msg200253 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2013-10-18 13:47
And here is a patch for 2.7.
msg200263 - (view) Author: Barry A. Warsaw (barry) * (Python committer) Date: 2013-10-18 14:33
2.6.9 doesn't produce a SystemError afaict:

Python 2.6.9rc1+ (unknown, Oct 18 2013, 10:29:22) 
[GCC 4.4.3] on linux3
Type "help", "copyright", "credits" or "license" for more information.
>>> content = b'+1911\' rel=\'stylesheet\' type=\'text/css\' />\n<link rel="alternate" type="application/rss+xml'
>>> content.decode("utf-7", "replace")
u'\ud7dd\ufffd rel=\'stylesheet\' type=\'text\ufffdcss\' \ufffd>\n<link rel="alternate" type="application\ufffdrss\uc669\ufffd'
msg200264 - (view) Author: Barry A. Warsaw (barry) * (Python committer) Date: 2013-10-18 14:36
On Oct 18, 2013, at 02:33 PM, Barry A. Warsaw wrote:

>2.6.9 doesn't produce a SystemError afaict:

Please note that 2.6.9 is security only, so the threshold for worrying about
things is a remotely exploitable security vulnerability that cannot be
reasonably worked around in Python code.
msg200353 - (view) Author: Larry Hastings (larry) * (Python committer) Date: 2013-10-19 01:24
Ping.  Please fix before "beta 1".
msg200450 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2013-10-19 17:39
New changeset 214c0aac7540 by Serhiy Storchaka in branch '2.7':
Issue #19279: UTF-7 decoder no more produces illegal unicode strings.

New changeset f471f2f05621 by Serhiy Storchaka in branch '3.3':
Issue #19279: UTF-7 decoder no more produces illegal strings.

New changeset 7dde9c553f16 by Serhiy Storchaka in branch 'default':
Issue #19279: UTF-7 decoder no more produces illegal strings.
msg200465 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2013-10-19 18:17
New changeset 73ab6aba24e5 by Serhiy Storchaka in branch '3.3':
Fixed tests for issue #19279.
msg201508 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2013-10-27 23:26
@Serhiy: What is the status of the issue?
msg201515 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2013-10-28 06:27
The bug is fixed on maintenance releases. Maintainer of 3.2 can backport the fix to 3.2 if it worth.
msg207788 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2014-01-09 19:39
Georg, is this issue wort to be fixed in 3.2? If yes, use the patch against 2.7.
msg215458 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2014-04-03 17:00
> Georg, is this issue wort to be fixed in 3.2? If yes, use the patch against 2.7.

msg222203 - (view) Author: Mark Lawrence (BreamoreBoy) * Date: 2014-07-03 17:51
To repeat the question do we or don't we fix this in 3.2?
msg222223 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2014-07-03 21:41
I suggest to close the issue. It's "just" another way to crash Python 3.2, like any other bug fix. Python 3.2 does not accept bug fixes anymore.
