This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: UTF-7 decoder can produce inconsistent Unicode string
Type: security Stage: resolved
Components: Library (Lib), Unicode Versions: Python 3.3, Python 3.4, Python 2.7
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: BreamoreBoy, barry, benjamin.peterson, ezio.melotti, georg.brandl, glebourgeois, larry, mcepl, mrabarnett, ncoghlan, piotr.dobrogost, python-dev, serhiy.storchaka, vstinner
Priority: release blocker Keywords: patch

Created on 2013-10-17 09:55 by glebourgeois, last changed 2022-04-11 14:57 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
utf7_errors.patch serhiy.storchaka, 2013-10-17 16:29 review
utf7_errors-2.7.patch serhiy.storchaka, 2013-10-18 13:47 review
Messages (19)
msg200117 - (view) Author: Guillaume Lebourgeois (glebourgeois) Date: 2013-10-17 09:55
After the fetch of a webpage with a wrongly declared encoding, the use of codecs module for a conversion crashes.

The issue is reproducible this way : 

>>> content = b"+1911\' rel=\'stylesheet\' type=\'text/css\' />\n<link rel="alternate" type="application/rss+xml"
>>> codecs.utf_7_decode(content, "replace", True)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
SystemError: invalid maximum character passed to PyUnicode_New

Original issue here  : https://github.com/kennethreitz/requests/issues/1682
msg200132 - (view) Author: Matthew Barnett (mrabarnett) * (Python triager) Date: 2013-10-17 14:54
The bytestring literal isn't valid. It starts with b" and later on has an unescaped " followed by more characters.

Also, the usual way to decode by using the .decode method.

I get this:

>>> content = b"+1911\' rel=\'stylesheet\' type=\'text/css\' />\n<link rel=\"alternate\" type=\"application/rss+xml\""
>>> content.decode("utf-7", "strict")
Traceback (most recent call last):
  File "<pyshell#10>", line 1, in <module>
    content.decode("utf-7", "strict")
  File "C:\Python33\lib\encodings\utf_7.py", line 12, in decode
    return codecs.utf_7_decode(input, errors, True)
UnicodeDecodeError: 'utf7' codec can't decode bytes in position 0-5: partial character in shift sequence
msg200133 - (view) Author: Guillaume Lebourgeois (glebourgeois) Date: 2013-10-17 15:07
My fault, bad paste. Should have written : 

>>> content = b'+1911\' rel=\'stylesheet\' type=\'text/css\' />\n<link rel="alternate" type="application/rss+xml'
>>> codecs.utf_7_decode(content, "replace", True)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
SystemError: invalid maximum character passed to PyUnicode_New
msg200134 - (view) Author: Guillaume Lebourgeois (glebourgeois) Date: 2013-10-17 15:13
"Also, the usual way to decode by using the .decode method."

The original bug happened using requests library, so I have no leverage on the used method for decoding.

But if you used the "replace" mode with your methodology, you would have raised the same Exception : 

>>> content = b'+1911\' rel=\'stylesheet\' type=\'text/css\' />\n<link rel="alternate" type="application/rss+xml'
>>> content.decode("utf-7", "replace")
File "<stdin>", line 1, in <module>
  File "/lib/python3.3/encodings/utf_7.py", line 12, in decode
    return codecs.utf_7_decode(input, errors, True)
SystemError: invalid maximum character passed to PyUnicode_New
msg200135 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2013-10-17 15:41
Indeed, 'utf-7' and the 'replace' error handler don't get along in this case.
msg200136 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2013-10-17 15:41
That is, I can locally reproduce the behaviour Guillaume describes on the latest tip build.
msg200144 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2013-10-17 16:29
Here is a patch for 3.3+.

Other versions are affected too. They don't raise SystemError, but produce illegal unicode string on wide build.

E.g. in Python 2.7:

>>> 'a+/,+IKw-b'.decode('utf-7', 'replace')
u'a\ufffd\U003f20acb'

\U003f20ac is illegal code.

As encoding and encoded data can come from external source, this can be used in secure attacks.
msg200253 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2013-10-18 13:47
And here is a patch for 2.7.
msg200263 - (view) Author: Barry A. Warsaw (barry) * (Python committer) Date: 2013-10-18 14:33
2.6.9 doesn't produce a SystemError afaict:

Python 2.6.9rc1+ (unknown, Oct 18 2013, 10:29:22) 
[GCC 4.4.3] on linux3
Type "help", "copyright", "credits" or "license" for more information.
>>> content = b'+1911\' rel=\'stylesheet\' type=\'text/css\' />\n<link rel="alternate" type="application/rss+xml'
>>> content.decode("utf-7", "replace")
u'\ud7dd\ufffd rel=\'stylesheet\' type=\'text\ufffdcss\' \ufffd>\n<link rel="alternate" type="application\ufffdrss\uc669\ufffd'
msg200264 - (view) Author: Barry A. Warsaw (barry) * (Python committer) Date: 2013-10-18 14:36
On Oct 18, 2013, at 02:33 PM, Barry A. Warsaw wrote:

>2.6.9 doesn't produce a SystemError afaict:

Please note that 2.6.9 is security only, so the threshold for worrying about
things is a remotely exploitable security vulnerability that cannot be
reasonably worked around in Python code.
msg200353 - (view) Author: Larry Hastings (larry) * (Python committer) Date: 2013-10-19 01:24
Ping.  Please fix before "beta 1".
msg200450 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2013-10-19 17:39
New changeset 214c0aac7540 by Serhiy Storchaka in branch '2.7':
Issue #19279: UTF-7 decoder no more produces illegal unicode strings.
http://hg.python.org/cpython/rev/214c0aac7540

New changeset f471f2f05621 by Serhiy Storchaka in branch '3.3':
Issue #19279: UTF-7 decoder no more produces illegal strings.
http://hg.python.org/cpython/rev/f471f2f05621

New changeset 7dde9c553f16 by Serhiy Storchaka in branch 'default':
Issue #19279: UTF-7 decoder no more produces illegal strings.
http://hg.python.org/cpython/rev/7dde9c553f16
msg200465 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2013-10-19 18:17
New changeset 73ab6aba24e5 by Serhiy Storchaka in branch '3.3':
Fixed tests for issue #19279.
http://hg.python.org/cpython/rev/73ab6aba24e5
msg201508 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2013-10-27 23:26
@Serhiy: What is the status of the issue?
msg201515 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2013-10-28 06:27
The bug is fixed on maintenance releases. Maintainer of 3.2 can backport the fix to 3.2 if it worth.
msg207788 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2014-01-09 19:39
Georg, is this issue wort to be fixed in 3.2? If yes, use the patch against 2.7.
msg215458 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2014-04-03 17:00
> Georg, is this issue wort to be fixed in 3.2? If yes, use the patch against 2.7.

Ping?
msg222203 - (view) Author: Mark Lawrence (BreamoreBoy) * Date: 2014-07-03 17:51
To repeat the question do we or don't we fix this in 3.2?
msg222223 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2014-07-03 21:41
I suggest to close the issue. It's "just" another way to crash Python 3.2, like any other bug fix. Python 3.2 does not accept bug fixes anymore.
History
Date User Action Args
2022-04-11 14:57:52adminsetgithub: 63478
2014-07-04 18:39:11serhiy.storchakasettitle: UTF-7 can produce inconsistent Unicode string -> UTF-7 decoder can produce inconsistent Unicode string
2014-07-04 18:38:35serhiy.storchakasetstatus: open -> closed
title: UTF-7 to UTF-8 decoding crash -> UTF-7 can produce inconsistent Unicode string
stage: patch review -> resolved
resolution: fixed
versions: + Python 2.7, Python 3.3, Python 3.4, - Python 3.2
2014-07-03 21:41:45vstinnersetmessages: + msg222223
2014-07-03 17:51:26BreamoreBoysetnosy: + BreamoreBoy
messages: + msg222203
2014-04-03 17:00:34vstinnersetmessages: + msg215458
2014-01-09 19:39:08serhiy.storchakasetmessages: + msg207788
2013-11-22 07:09:49mceplsetnosy: + mcepl
2013-10-28 06:27:12serhiy.storchakasetmessages: + msg201515
2013-10-27 23:26:38vstinnersetmessages: + msg201508
2013-10-22 17:31:20serhiy.storchakasetassignee: serhiy.storchaka ->
versions: - Python 2.7, Python 3.3, Python 3.4
2013-10-19 18:17:20python-devsetmessages: + msg200465
2013-10-19 17:39:55python-devsetnosy: + python-dev
messages: + msg200450
2013-10-19 01:24:47larrysetmessages: + msg200353
2013-10-18 14:40:57barrysetversions: - Python 2.6
2013-10-18 14:36:25barrysetmessages: + msg200264
2013-10-18 14:33:18barrysetmessages: + msg200263
2013-10-18 13:47:06serhiy.storchakasetfiles: + utf7_errors-2.7.patch

messages: + msg200253
2013-10-18 10:32:53piotr.dobrogostsetnosy: + piotr.dobrogost
2013-10-17 16:29:57serhiy.storchakasetfiles: + utf7_errors.patch
priority: normal -> release blocker
type: crash -> security

versions: + Python 2.6, Python 2.7, Python 3.2
keywords: + patch
nosy: + larry, benjamin.peterson, barry, georg.brandl

messages: + msg200144
stage: needs patch -> patch review
2013-10-17 15:41:54ncoghlansetmessages: + msg200136
2013-10-17 15:41:05ncoghlansetnosy: + ncoghlan
messages: + msg200135
2013-10-17 15:13:00glebourgeoissetmessages: + msg200134
2013-10-17 15:07:30glebourgeoissetmessages: + msg200133
2013-10-17 14:54:02mrabarnettsetnosy: + mrabarnett
messages: + msg200132
2013-10-17 10:02:11vstinnersetnosy: + vstinner
2013-10-17 09:57:27serhiy.storchakasetversions: + Python 3.4
nosy: + ezio.melotti, serhiy.storchaka

assignee: serhiy.storchaka
components: + Unicode
stage: needs patch
2013-10-17 09:55:36glebourgeoiscreate