classification
Title: UTF7 decoding is far too strict
Type: behavior Stage: patch review
Components: Unicode Versions: Python 3.1, Python 2.7
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: pitrou Nosy List: Nick Barnes, ggenellina, lemburg, pitrou, vstinner
Priority: critical Keywords: needs review, patch

Created on 2008-11-25 11:11 by Nick Barnes, last changed 2009-05-04 18:59 by pitrou. This issue is now closed.

Files
File name Uploaded Description Edit
utf7patch Nick Barnes, 2008-12-01 18:07 Patch to Python 2.6 to fix problems with UTF7 Unicode codec.
issue4426.patch vstinner, 2009-05-04 12:02
python-unicode-trunk-patch Nick Barnes, 2009-05-04 16:34 New trunk patch
Messages (18)
msg76405 - (view) Author: Nick Barnes (Nick Barnes) Date: 2008-11-25 11:11
UTF-7 decoding raises an exception for any character not in the RFC2152
"Set D" (directly encoded characters).  In particular, it raises an
exception for characters in "Set O" (optional direct characters), such
as < = > [ ] @ etc.  These characters can legitimately appear in
UTF-7-encoded text, and should be decoded (as themselves).  As it is,
the UTF-7 decoder can't reliably be used to decode any UTF-7 text other
than that encoded by Python's own UTF-7 encoder.

Looking at the source of unicodeobject.c, the call to the SPECIAL macro
on line 1009 has hardcoded second and third arguments of zero.  Maybe
changing the second argument to 1 would fix this.  Maybe.
msg76409 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2008-11-25 12:38
Can you write some tests to help fixing this issue? Stupid example (I 
don't know UTF-8 encoding):
>>> all((byte.encode("utf-7") == byte) for byte in '<=>[]@')
>>> all((byte.decode("utf-7") == byte) for byte in '<=>[]@')
msg76414 - (view) Author: Nick Barnes (Nick Barnes) Date: 2008-11-25 13:09
# Note, this test covers issues 4425 and 4426

# Direct encoded characters:
set_d =
"ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789'(),-./:?"

# Optional direct characters:
set_o = '!"#$%&*;<=>@[]^_`{|}'

all((c.encode('utf7') == c) for c in set_d)
all((c.decode('utf7') == c) for c in set_d)
all((c.decode('utf7') == c) for c in set_o)
msg76415 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2008-11-25 13:30
On 2008-11-25 12:11, Nick Barnes wrote:
> New submission from Nick Barnes <Nick.Barnes@pobox.com>:
> 
> UTF-7 decoding raises an exception for any character not in the RFC2152
> "Set D" (directly encoded characters).  In particular, it raises an
> exception for characters in "Set O" (optional direct characters), such
> as < = > [ ] @ etc.  These characters can legitimately appear in
> UTF-7-encoded text, and should be decoded (as themselves).  As it is,
> the UTF-7 decoder can't reliably be used to decode any UTF-7 text other
> than that encoded by Python's own UTF-7 encoder.

Thanks for noticing this. Apparently, the UTF-7 codec is not used
a lot by Python users, since it's been like this for years.

The tests we have do check round-trip safety, but not the special
characteristics of the UTF-7 codec.

Also note that the code for the codec was contributed and is, AFAIK,
not maintained by any of the Python developers.
msg76419 - (view) Author: Nick Barnes (Nick Barnes) Date: 2008-11-25 18:56
Well, I could submit a diff for unicodeobject.c, but I have never
contributed to Python (or used this particular tracking system) before.
 Is there a standard form for contributing changes?  Unified diff?
msg76423 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2008-11-25 19:35
> Is there a standard form for contributing changes?  Unified diff?

Attach a patch file (unified diff, yes).
msg76453 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2008-11-26 09:11
On 2008-11-25 19:56, Nick Barnes wrote:
> Nick Barnes <Nick.Barnes@pobox.com> added the comment:
> 
> Well, I could submit a diff for unicodeobject.c, but I have never
> contributed to Python (or used this particular tracking system) before.
>  Is there a standard form for contributing changes?  Unified diff?

Please send unified diffs and attach them to the ticket.

While you're at it: perhaps you could try to refactor the code a bit
in order to get rid off the many macros use throughout the codec.
They make the code a lot less readable.

Thanks,
-- 
Marc-Andre Lemburg
eGenix.com

________________________________________________________________________
2008-11-12: Released mxODBC.Connect 0.9.3      http://python.egenix.com/

:::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,MacOSX for free ! ::::

   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
    D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
           Registered at Amtsgericht Duesseldorf: HRB 46611
msg76496 - (view) Author: Nick Barnes (Nick Barnes) Date: 2008-11-27 12:18
I'll try to get to this next week.  Right now I'm snowed under.  I don't
promise to do any refactoring.
msg76696 - (view) Author: Nick Barnes (Nick Barnes) Date: 2008-12-01 14:02
My original defect report here was incorrect, or possibly only relates
to a particular older Python installation.  It is still the case that
UTF-7 decoding is fussier than it need be (decoding should be
permissive), and is broken specifically for the '/' character (ASCII
47).  I'm probably going to rewrite the whole codec for greater clarity
and decoding permissiveness.

Any UTF-7 encoder has two boolean parameters: whether to base-64 encode
whitespace (sp, ht, nl, cr), and whether to base-64 encode "set O"
characters.  The existing Python UTF-7 encoder says "no" to both of
these.  It would be useful to have them as options.  How should encoding
parameters such as these be passed?  As setstate() methods on the
IncrementalEncoder and StreamWriter objects?  Or should I provide four
separate codecs (retaining the existing behaviour in the 'utf7' codec,
of course).
msg76704 - (view) Author: Nick Barnes (Nick Barnes) Date: 2008-12-01 18:07
Here is my patch.  This is a rewrite of the UTF7 encoder and decoder. 
It now handles surrogate pairs correctly, so non-BMP characters work
with this codec.  And my motivating example ('/'.decode('utf7')) works
OK.  I'm not totally confident of the error-handling code here, but in
'strict' mode this is definitely a better codec than the one it
replaces.  This patch is based on the Python 2.6 distribution.
msg77933 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2008-12-16 21:03
I'm not in a position to comment on the encoding algorithm itself but I
have a couple of comments:

* I get the following compilation warning:
Objects/unicodeobject.c: In function ‘PyUnicode_DecodeUTF7Stateful’:
Objects/unicodeobject.c:1531: attention : ‘shiftOutStart’ may be used
uninitialized in this function

* shouldn't there be an additional test for the '/' behaviour (unless it
is already there and I have overlooked it)?

* your patch fails on another test:
Traceback (most recent call last):
  File "Lib/test/test_unicode.py", line 538, in test_codecs_utf7
    self.assertRaises(UnicodeDecodeError, '+\xc1'.decode, 'utf-7')
AssertionError: UnicodeDecodeError not raised
msg87113 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2009-05-04 11:07
Copy of msg76404 from duplicate issue (#4425):
----
'/'.encode('utf7') returns '+AC8-'.  It should return '/'.  See RFC 
2152.

'/'.decode('utf7') raises an exception (this is a special case of a
general problem with UTF-7 decoding, which I will report as a separate 
bug).
----
msg87117 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2009-05-04 12:00
I updated the patch to python trunk. I was hard because "patch -p1" 
failed at many places, so please double check the updated patch.

Changes from utf7patch:
 * I removed the test "c >=0" in DECODE_DIRECT(c) because c type 
is "Py_UNICODE" and the type is always unsigned (right?)
 * I added "shiftOutStart = p;" at the beginning of 
PyUnicode_DecodeUTF7Stateful() to fix the gcc warning reported by 
pitrou.
 * I fixed the failing test in test_unicode.py (also reported by 
pitrou)
 * I added the tests listed in msg76414 (by nick)

I didn't read the UTF-7 encoder or decoder code. I just updated the 
patch.
msg87119 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2009-05-04 12:02
(oops, i stripped spaces in my last patch)
msg87124 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2009-05-04 13:17
A quick comment on the patch: it seems to invert (quite futily I'd say)
the meaning of the arguments given to PyUnicode_EncodeUTF7, which is an
incompatible API change.
I'm in favour of reworking this patch in order to keep the original API.
If I'm not mistaken, it's just a matter of using the boolean complement
of these two arguments (encodeSetO, encodeWhiteSpace) inside the body of
PyUnicode_EncodeUTF7.
msg87137 - (view) Author: Nick Barnes (Nick Barnes) Date: 2009-05-04 16:34
This was my first contribution to Python.  I don't know what the rules
are on changing the arguments of an internal function such as
PyUnicode_EncodeUTF7().  Since I was rewriting the whole function
anyway, I tried to give it arguments which made more sense with respect
to the defining RFC.  If you want us to revert to the original meanings
of these arguments (so the third argument means "use base-64 encoding
for characters in set O", and not "use direct encoding for characters in
set O"), please can we have better names for them?

The name "encodeSetO" was meaningless: the function encodes *all* the
characters in the string.  What the argument specifies is whether the
"set O" characters are self-encoded or base-64 encoded.  So maybe it
should be called "base64SetO".

Ditto for the "encodeWhiteSpace" name.

Here's a trunk patch with the meaning of those parameters reverted, and
better names.
msg87147 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2009-05-04 18:25
Thanks for the update. Functions like PyUnicode_EncodeUTF7() are part of
the public C API, therefore their semantics can't be changed lightly.
The patch looks ok to me, apart from minor style issues.
msg87150 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2009-05-04 18:59
Committed in r72283 and r72285. Thanks!
History
Date User Action Args
2009-05-04 18:59:11pitrousetstatus: open -> closed
resolution: accepted -> fixed
messages: + msg87150
2009-05-04 18:25:18pitrousetassignee: pitrou
resolution: accepted
messages: + msg87147
versions: + Python 3.1, Python 2.7, - Python 2.6
2009-05-04 16:34:35Nick Barnessetfiles: + python-unicode-trunk-patch

messages: + msg87137
2009-05-04 13:17:34pitrousetmessages: + msg87124
2009-05-04 12:02:48vstinnersetfiles: + issue4426.patch

messages: + msg87119
2009-05-04 12:02:11vstinnersetfiles: - issue4426.patch
2009-05-04 12:00:50vstinnersetfiles: + issue4426.patch

messages: + msg87117
2009-05-04 11:07:13vstinnersetmessages: + msg87113
2008-12-16 21:03:35pitrousetpriority: critical
keywords: + patch, needs review
messages: + msg77933
stage: patch review
2008-12-06 00:54:49ggenellinasetnosy: + ggenellina
2008-12-01 18:07:58Nick Barnessetfiles: + utf7patch
messages: + msg76704
2008-12-01 14:02:09Nick Barnessetmessages: + msg76696
2008-11-29 22:33:43pitrousetnosy: + pitrou
2008-11-27 12:18:24Nick Barnessetmessages: + msg76496
2008-11-26 09:11:23lemburgsetmessages: + msg76453
2008-11-25 19:35:14vstinnersetmessages: + msg76423
2008-11-25 18:56:45Nick Barnessetmessages: + msg76419
2008-11-25 13:30:08lemburgsetnosy: + lemburg
messages: + msg76415
2008-11-25 13:09:21Nick Barnessetmessages: + msg76414
2008-11-25 12:38:45vstinnersetnosy: + vstinner
messages: + msg76409
2008-11-25 11:11:55Nick Barnescreate