classification
Title: Sending binary data with a POST request in httplib can cause Unicode exceptions
Type: behavior Stage: test needed
Components: Versions: Python 2.7
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: orsenthil Nosy List: Adam.Cohen, aronacher, eric.araujo, gregory.p.smith, orsenthil, r.david.murray, ssbarnea, terry.reedy, thijs, vstinner
Priority: normal Keywords: patch

Created on 2011-06-24 15:39 by ssbarnea, last changed 2012-11-18 15:23 by eric.araujo.

Files
File name Uploaded Description Edit
urllib2.patch vstinner, 2011-09-22 23:54
Messages (18)
msg138953 - (view) Author: Sorin Sbarnea (ssbarnea) * Date: 2011-06-24 15:39
It looks that Python 2.7 changes did induce some important bugs into httplib due to to implicit str-unicode encoding/decoding.

One clear example is that PyAMF library doesn't work with Python 2.7 because it is not able to generate binary data POST responses.

Please check http://dev.pyamf.org/ticket/823

(partial trackback, full in above bug)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/httplib.py", line 937, in endheaders
    self._send_output(message_body)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/httplib.py", line 795, in _send_output
    msg += message_body
msg138971 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2011-06-24 18:18
If this worked in 2.6 and fails in 2.7, it would probably be helpful if we can determine what change broke it.  I believe hg has some sort of 'bisect' support that might make this not too onerous to do.  Senthil (or someone) will eventually either figure out the problem or do the bisect, but if you want to speed things along you could do the bisect.
msg138975 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2011-06-24 19:07
A crash is a segfault or equivalent.
Python 2.6 only gets security fixes.
PyAMF does not run on Python 3. Hence a problem with PyAMF is no evidence of a problem with 3.x. Separate tests/examples would be needed.

Changes are not bugs unless they introduce a discrepancy between code and doc. Please post a self-contained example that exhibits the behavior that you consider a problem. It should not just be a repeat of #11898. Then quote the section of the docs that says (or suggests) that the behavior should be different from what it is.

The PyAMF site says "PyAMF requires Python 2.4 or newer. Python 3.0 isn’t supported yet." Since 3.0 was deprecated 2 years ago with the release of 3.1, I strongly suspect that the statement was written before 2.7 was released a year ago. Library developers should not make open ended promises like 'or newer' -- certainly not without testing and revising as necessary with each new Python version.

If PyAMF was broken by planned, announced, and documented changed in 2.7, that is too bad, but it is a year too late to change 2.7. Like all new versions, it had public beta and release candidate phases when people could test their packages and make comments.

I believe what David is getting at is finding out for sure whether the change was intended or not.

The quote from the link you provide
  >msg += message_body
appears to be the programming error, already explained in #11898,
where msg is unicode and message_body is bytes with non-ascii bytes.

>>> u'a'+'\xf0'
UnicodeDecodeError: 'ascii' codec can't decode byte 0xf0 in position 0: ordinal not in range(128)
This is exactly the same error message that followed in the link, except that the position of the non-ascii byte. The fix is to not do the above.
msg138977 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2011-06-24 19:47
Did things like "u'a'+'\xf0'" work in 2.6- (with implicit latin-1 decoding)? (I do not have 2.6 loaded.)

The doc for seq+seq (concatenation) in the language reference section 5.6. Binary arithmetic operations says that both sequences must be the same type. In the Library manual, 5.6. Sequence Types, the footnote for seq+seq makes no mention of a special exception for (some) mixed unicode/byte concatenations. I think footnote 6 about string+string should both note the exception and its limitation (and if the limitation was changed in 2.7, say so). (In any case, the exception was removed in Py3, so *this* is not a Py3 issue.)
msg138989 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2011-06-24 21:41
Many applications and libraries say "Python X.Y or newer", and it is one of the strengths of Python that this will often be true.  That's what our backward compatibility policy is about, and that's why the fact that it isn't true for 2.x->3.x is such a big deal.  As far as I can see there was no deprecation involved here, so "announced" is not a factor, I think.  We won't be sure until we know what changed.

All that said, it is quite possible (even likely, given #11898) that the pyamf code contains a bug and only worked by accident, and is now failing because some other bug in Python was fixed.  Again, we won't know until we have a complete diagnosis of the cause of the change in behavior.
msg139103 - (view) Author: Sorin Sbarnea (ssbarnea) * Date: 2011-06-25 17:10
You are right, I debugged the problem a little more and discovered at least one bug in PyAMF.

Still, I was surprised to find out something very strange, it look that BytesIO.getvalue() does return `str` even if the documentation says it does return `bytes`. Should I file another bug?

Python 2.7.1 (r271:86832, Jun 13 2011, 14:28:51) 
[GCC 4.2.1 (Based on Apple Inc. build 5658) (LLVM build 2335.15.00)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import io
>>> a = io.BytesIO()
>>> a
<_io.BytesIO object at 0x10f9453b0>
>>> a.getvalue()
''
>>> print type(a.getvalue())
<type 'str'>
>>>
msg139107 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2011-06-25 18:26
No, that's correct.  In python 2.x the 'bytes' stuff is just a portability aid.  In 2.x, bytes and string are the same type.  In Python 3 they aren't, so by using the 'fake' classes in python2 you can often make your code work correctly on both python2 and python3.

So, can this issue be closed, or do you think there is still might be a valid backward compatibility issue?
msg139108 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2011-06-25 18:31
In 2.7, bytes is an alias for str to aid porting to 3.x.
>>> bytes is str
True
>>> type(bytes())
<type 'str'>

I suspect the doc uses 'bytes' rather than 'str' because it was backported from 3.x. Perhaps it should be changed but I do not know the policy on using the alias in 2.6/7 docs.

I presume in 2.7 io.BytesIO is similar, if not equivalent to io.StringIO, but it is not an alias. Again, it was added so 2.7 code could use a bytes memory buffer that would remain bytes in 3.x and not become unicode text, like StringIO does.
msg139265 - (view) Author: Sorin Sbarnea (ssbarnea) * Date: 2011-06-27 12:59
Here is a test file that will replicate the problem, I added it as a gist so it could support contributions ;)

Py <2.7 works
Py ==2.7 fails
Py >=3.0 works after minor changes required by py3k

https://gist.github.com/1047551
msg139268 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2011-06-27 13:37
rdmurray>python2.6 py27-str-unicode-bytes.py 
type(b)=<type 'str'>
Traceback (most recent call last):
  File "py27-str-unicode-bytes.py", line 17, in <module>
    unicode_str += b # this line will throw UnicodeDecodeError on Python 2.7
UnicodeDecodeError: 'ascii' codec can't decode byte 0xff in position 4: ordinal not in range(128)

And of course it doesn't work earlier than 2.6 since the b'' notation isn't supported before 2.6.
msg139269 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2011-06-27 13:41
To clarify: if I convert your program to using strings pre2.6, it still fails with a UnicodeDecodeError, as one would expect.  bytes are strings in 2.x.
msg139271 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2011-06-27 13:48
And finally, your program does *not* succeed on Python3, except in the trivial sense that on python3 you never attempt to add the string and bytes data.  It is exactly this kind of programming error that Python3 is designed to avoid: instead of sometimes getting a UnicodeDecodeError depending on what is in the "bytes" string, you *always* get a "Can't convert 'bytes' object to str implicitly" error when you attempt to add string and bytes.
msg139272 - (view) Author: Sorin Sbarnea (ssbarnea) * Date: 2011-06-27 13:53
Right, so you have some binary data and you want to sent it to `httplib`. This worked in the past when `msg` was a non-unicode string, but starting with Python 2.7 this became an unicode string, so when you try to append the `message` if will fail because it will try to decode it.
msg139283 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2011-06-27 14:36
But senthil already demonstrated in the previous issue that it does not become a unicode string unless you use unicode input.

You also claimed that your test program here succeeded in python2.6, but it does not.  This casts a little bit of doubt on your claim that there is a regression.

Can you produce a minimal example of using httplib that demonstrates the regression?
msg139304 - (view) Author: Sorin Sbarnea (ssbarnea) * Date: 2011-06-27 15:54
I updated the gist and made a minimal test
https://gist.github.com/1047551
msg144427 - (view) Author: Adam Cohen (Adam.Cohen) Date: 2011-09-22 22:11
I encountered this issue as well. "params" is simply a bytestring, with no encoding. Workaround/proper solution is to cast the string as a bytearray with bytearray(params).
msg144433 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2011-09-22 23:54
Here is a patch for httplib encoding HTTP headers to ISO-8859-1, as done in Python 3 (see HTTPConnection.putheader() from http.client). urllib is not affected by this issue because it does already encode Unicode, but encode to ASCII instead of ISO-8859-1.

Related commit in Python 3:

changeset:   67720:b3cadf5cf742
user:        Armin Ronacher <armin.ronacher@active-4.com>
date:        Sat Jan 22 13:44:22 2011 +0000
files:       Lib/http/client.py Lib/test/test_httpservers.py Misc/NEWS
description:

To match the behaviour of HTTP server, the HTTP client library now also encodes headers with iso-8859-1 (latin1) encoding.  It was already doing that for incoming headers which makes this behaviour now consistent in both incoming and outgoing direction.
msg175727 - (view) Author: Gregory P. Smith (gregory.p.smith) * (Python committer) Date: 2012-11-17 08:09
I'm running into this on 2.7.3 with code that worked fine on 2.6.5.

The problem appears to be caused by a 'Host' http header that has a unicode type for the hostname:port value.

Encoding header values makes sense though I haven't yet examined the patch in detail.
History
Date User Action Args
2012-11-18 15:23:13eric.araujosetnosy: + aronacher
2012-11-17 08:09:11gregory.p.smithsetnosy: + gregory.p.smith
messages: + msg175727
2011-09-22 23:54:34vstinnersetfiles: + urllib2.patch
keywords: + patch
messages: + msg144433
2011-09-22 22:11:22Adam.Cohensetnosy: + Adam.Cohen
messages: + msg144427
2011-08-07 06:07:40orsenthilsetassignee: orsenthil

nosy: + orsenthil
2011-07-04 16:15:31eric.araujosetnosy: + eric.araujo
2011-07-03 21:04:48thijssetnosy: + thijs
2011-06-27 15:54:00ssbarneasetmessages: + msg139304
2011-06-27 14:36:25r.david.murraysetmessages: + msg139283
2011-06-27 13:53:51ssbarneasetmessages: + msg139272
2011-06-27 13:48:16r.david.murraysetmessages: + msg139271
2011-06-27 13:41:29r.david.murraysetmessages: + msg139269
2011-06-27 13:38:59vstinnersetnosy: + vstinner
2011-06-27 13:37:24r.david.murraysetmessages: + msg139268
2011-06-27 12:59:56ssbarneasetmessages: + msg139265
2011-06-25 18:31:39terry.reedysetmessages: + msg139108
2011-06-25 18:26:37r.david.murraysetmessages: + msg139107
2011-06-25 17:10:24ssbarneasetmessages: + msg139103
2011-06-24 21:41:59r.david.murraysetmessages: + msg138989
2011-06-24 19:47:35terry.reedysetmessages: + msg138977
2011-06-24 19:07:58terry.reedysetstage: test needed
type: crash -> behavior
versions: - Python 3.1, Python 3.2, Python 3.3, Python 3.4
2011-06-24 19:07:26terry.reedysetnosy: + terry.reedy
messages: + msg138975
2011-06-24 18:18:03r.david.murraysetnosy: + r.david.murray
messages: + msg138971
2011-06-24 15:39:53ssbarneacreate