classification
Title: urllib.urlencode provides two features in one param
Type: enhancement Stage: resolved
Components: Documentation, Library (Lib) Versions: Python 2.7, Python 2.6
process
Status: closed Resolution: duplicate
Dependencies: Superseder:
Assigned To: docs@python Nosy List: docs@python, georg.brandl, mike_j_brown, orsenthil, salty-horse, terry.reedy
Priority: normal Keywords: easy

Created on 2005-11-06 21:58 by salty-horse, last changed 2010-07-14 18:53 by orsenthil. This issue is now closed.

Messages (5)
msg60833 - (view) Author: Ori Avtalion (salty-horse) Date: 2005-11-06 21:58
Using the 2.4 distribution.

It seems that urlencode knows how to handle unicode
input with quote_plus and ascii encoding, but it only
does that when doseq is True.

1) There's no mention of that useful feature in the
documentation.
2) If I want to encode unicode data without doseq's
feature, there's no way to do so. Although it's rare to
use doseq's intended function, they shouldn't be connected.

Shouldn't values be checked with _is_unicode and
handled correctly in both modes of doseq?
One reason I see that *might* make the unicode check
cause problems is the comment says "preserve old
behavior" when doseq is False. Could such a check
affect the behaviour of old code?
If it can, the unicode handling could be another
optional parameter.

Also, the docstring is really unclear as to the purpose
of doseq.
Can an small example be added? (I saw no PEP guidelines
for how examples should be given in docstrings, or if
they're even allowed, so perhaps this fits just the
regular documentation)

With query={"key": ("val1", "val2")
doseq=1 yields: key=val1&key=val2
doseq=0 yields: key=%28%27val1%27%2C+%27val2%27%29

After the correct solution is settled, I'll gladly
submit a patch with the fixes.
msg60834 - (view) Author: Mike Brown (mike_j_brown) Date: 2005-12-29 23:32
Logged In: YES 
user_id=371366

I understand why the implementation is the way it is. I
agree that it is not documented as ideally as it could be. I
also agree with your implication that ASCII-range unicode
input should be acceptable (and converted to ASCII bytes
internally before percent-encoding), regardless of doseq. I
would not go so far as to say non-ASCII-range unicode should
be accepted, since safe conversion to bytes before
percent-encoding would not be possible.

However, I was unable to reproduce your observation that
doseq=0 results in urlencode not knowing how to handle
unicode. The object is just passed to str(). Granted, that's
not *quite* the same as when doseq=1, where unicode objects
are specifically run through .encode('us-ascii','replace')),
but I wouldn't characterize it as not knowing how to handle
ASCII-range unicode. The results for ASCII-range unicode are
the same.

If you're going to make things more consistent, I would
actually tighten up the doseq=1 behavior, replacing

v = quote_plus(v.encode("ASCII","replace"))

with

v = quote_plus(v.encode("ASCII","strict"))

and then mention in the docs that any object type is
acceptable as a key or value, but if unicode is passed, it
must be all ASCII-range characters; if there is a risk of
characters above \u007f being passed, then the caller should
convert the unicode to str beforehand.

As for doseq's purpose and documentation, the doseq=1
behavior is ideal for almost all situations, since it takes
care not to treat str or unicode as a sequence of separate
1-character values. AFAIK, the only reason it isn't the
default is for backward compatiblity. It was introduced in
Python 2.0.1 and was trying to retain compatibility with
code written for Python 1.5.2 through 2.0.0. I suggest
deprecating it and making doseq=1 behavior the default, if
others (MvL?) approve.
msg60835 - (view) Author: Ori Avtalion (salty-horse) Date: 2005-12-30 16:10
Logged In: YES 
user_id=854801

> However, I was unable to reproduce your observation that
> doseq=0 results in urlencode not knowing how to handle
> unicode.
I had given urlencode a hebrew unicode string, and
"".encode() could not convert it to ascii:

s_unicode = u'\u05d1\u05d3\u05d9\u05e7\u05d4'
print urllib.urlencode({"key":s_unicode}, 0)

As I notice now, the line:
>> urllib.urlencode({"key":s_unicode}, 1)
key=%3F%3F%3F%3F%3F

does not raise an exception but produces an incorrect result.

The correct way to call it is like this:
>> urllib.urlencode({"key":s_unicode.encode("iso8859_8")}, 1)
key=%E1%E3%E9%F7%E4


So, in addition to your suggestion, I think the
documentation should explicitly state that unicode strings
will be treated as us-ascii.

What about my suggestion of an example for doseq's behaviour
in the docstring?
msg109824 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2010-07-10 06:35
"put something somewhere" will not get action.
Please suggest specific wording and a specific place to put it and mark it TEXT or PATCH or something so a doc person can find it.

I am assuming that this does not apply to 3.x.
msg110311 - (view) Author: Senthil Kumaran (orsenthil) * (Python committer) Date: 2010-07-14 18:53
This was fixed as part of Issue8788. Closing this.
History
Date User Action Args
2010-07-14 18:53:56orsenthilsetstatus: open -> closed
resolution: duplicate
messages: + msg110311

stage: test needed -> resolved
2010-07-10 06:35:34terry.reedysetversions: + Python 2.7
nosy: + terry.reedy, docs@python

messages: + msg109824

assignee: georg.brandl -> docs@python
2009-04-22 18:48:01ajaksu2setkeywords: + easy
stage: test needed
2009-02-12 18:25:59ajaksu2setnosy: + orsenthil
type: enhancement
versions: + Python 2.6, - Python 2.7
2009-02-09 00:36:07ajaksu2setnosy: + georg.brandl
assignee: georg.brandl
components: + Documentation
versions: + Python 2.7, - Python 2.4
2005-11-06 21:58:35salty-horsecreate