This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: urllib.parse: Allow bytes in some APIs that use string literals internally
Type: behavior Stage: resolved
Components: Library (Lib) Versions: Python 3.2
process
Status: closed Resolution: accepted
Dependencies: Superseder:
Assigned To: ncoghlan Nosy List: eric.araujo, eric.smith, hfuru, ncoghlan, orsenthil, pitrou, r.david.murray, sjt, vstinner
Priority: high Keywords: patch

Created on 2010-09-16 13:25 by ncoghlan, last changed 2022-04-11 14:57 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
issue9873_initial_attempt.diff ncoghlan, 2010-09-28 14:00 Very rough first cut at polymorphic urllib.parse API review
issue9873_coercion_to_str.diff ncoghlan, 2010-10-24 13:42 Second attempt, using a coercion based approach review
issue9873_just_needs_docs.diff ncoghlan, 2010-11-30 07:32 Code and tests complete
Messages (29)
msg116543 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2010-09-16 13:25
As per python-dev discussion in June, many Py3k APIs currently gratuitously prevent the use of bytes and bytearray objects as arguments due to their use of string literals internally.

Examples:
urllib.parse.urlparse
urllib.parse.urlunparse
urllib.parse.urljoin
urllib.parse.urlsplit
(and plenty of other examples in urllib.parse)

While a strict reading of the relevant RFCs suggests that strings are the more appropriate type for these APIs, as a practical matter, protocol developers want to be able to operate on ASCII supersets as raw bytes.

The proposal is to modify the implementation of these functions such that string literals are used only with string arguments, and bytes literals otherwise. If a function accepts multiple values and a mixture of strings and bytes is passed in then the operation will still fail (as it should).
msg116663 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2010-09-17 13:43
From the python-dev thread (http://mail.python.org/pipermail/python-dev/2010-September/103780.html):
==============
So the domain of any polymorphic text manipulation functions we define would be:
 - Unicode strings
 - byte sequences where the encoding is either:
   - a single byte ASCII superset (e.g. iso-8859-*, cp1252, koi8*, mac*)
   - an ASCII compatible multibyte encoding (e.g. UTF-8, EUC-JP)

Passing in byte sequences that are encoded using an ASCII incompatible
multibyte encoding (e.g. CP932, UTF-7, UTF-16, UTF-32, shift-JIS,
big5, iso-2022-*, EUC-CN/KR/TW) or a single byte encoding that is not
an ASCII superset (e.g. EBCDIC) will have undefined results.
==================
msg116927 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2010-09-20 11:42
The design approach (at least for urllib.parse) is to add separate *b APIs that operate on bytes rather than implicitly allowing bytes in the ordinary versions of the function.

Allowing programmers to manipulate correctly encoded (and hence ASCII compatible) bytes to avoid decode/encode overhead when manipulating URLs is fine (and the whole point of adding the new functions). Allowing them to *not know* whether they have encoded data or text suitable for display to the user isn't necessary (and is easy to add later if we decide we want it, while taking it away is far more difficult).

More detail at http://mail.python.org/pipermail/python-dev/2010-September/103828.html
msg117520 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2010-09-28 14:00
Attached patch is a very rough first cut at this. I've gone with the basic approach of simply assigning the literals to local variables in each function that uses them. My rationale for that is:
1. Every function has to have some kind of boilerplate to switch based on the type of the argument
2. Some functions need to switch other values (such as the list of scheme_chars or the unparse function), so a separate object with literal attributes won't be enough
3. Given 1 and 2, the overhead of a separate namespace for the literal references isn't justified

I've also gone with a philosophy that only str objects are treated as strings and everything else is implicitly assumed to be bytes. This is in keeping with the general interpreter behaviour where we *really* don't support duck-typing when it comes to strings.

The test updates aren't comprehensive yet, but they were enough to uncover quite a few things I had missed.

quoting is still a bit ugly, so I may still add a byte->bytes/str->str variant of those APIs.
msg117556 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2010-09-28 21:56
A possible duck-typing approach here would be to replace the "instance(x, str)" tests with "hasattr(x, 'encode')" checks instead.

Thoughts?
msg117588 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2010-09-29 09:57
> A possible duck-typing approach here would be to replace the
> "instance(x, str)" tests with "hasattr(x, 'encode')" checks instead.

Looks more ugly than useful to me. People wanting to emulate str had better subclass it anyway...
msg117653 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2010-09-29 20:49
Agreed - I think there's a non-zero chance of triggering the str-path by mistake if we try to duck-type it (I just added a similar comment to #9969 regarding a possible convenience API for tokenisation)
msg117654 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2010-09-29 20:56
Added to Reitveld: http://codereview.appspot.com/2318041/
msg117661 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2010-09-29 21:55
One of Antoine's review comments made me realise I hadn't explicitly noted the "why not decode with latin-1?" rationale for the bytes handling. (It did come up in one or more of the myriad python-dev discussions on the topic, I just hadn't noted it here)

The primary reason for supporting ASCII compatible bytes directly is specifically to avoid the encoding and decoding overhead associated with the translation to unicode text. Making that overhead implicit rather than avoiding it altogether would be to quite miss the point of the API change.
msg117666 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2010-09-29 22:09
> The primary reason for supporting ASCII compatible bytes directly is
> specifically to avoid the encoding and decoding overhead associated
> with the translation to unicode text.

I think it's quite misguided. latin1 encoding and decoding is blindingly
fast (on the order of 1GB/s. here). Unless you have multi-megabyte URLs,
you won't notice any overhead.
msg117668 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2010-09-29 22:23
> I think it's quite misguided. latin1 encoding and decoding is blindingly
> fast (on the order of 1GB/s. here). Unless you have multi-megabyte URLs,
> you won't notice any overhead.

Ah, I didn't know that (although it makes sense now I think about it).
I'll start exploring ideas along those lines then. Having to name all
the literals as I do in the patch is really quite ugly.

A general sketch of such a strategy would be to stick the following
near the start of affected functions:

encode_result = not isinstance(url, str) # or whatever the main
parameter is called
if encode_result:
    url = url.decode('latin-1')
    # decode any other arguments that need it
    # Select the bytes versions of any relevant globals
else:
    # Select the str versions of any relevant globals

Then, at the end, do an encoding step. However, the encoding step may
get a little messy when it comes to the structured data types. For
that, I'll probably take a leaf out of the email6 book and create a
parallel bytes API, with appropriate encode/decode methods to
transform one into the other.
msg117670 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010-09-29 22:48
I don't understand why you would like to implicitly convert bytes to str (which is one of the worse design choice of Python2). If you don't want to care about encodings, use bytes is fine. Decode bytes using an arbitrary encoding is the fastest way to mojibake.

So You have two choices: create new functions with bytes as input and output (like os.getcwd() and os.getcwdb()), or the output type will depend on the input type (solution choosen by os.path). Example of ther later:

>>> os.path.expanduser('~')
'/home/haypo'
>>> os.path.expanduser(b'~')
b'/home/haypo'
msg117718 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2010-09-30 10:54
From a function *user* perspective, the latter API (bytes->bytes, str->str) is exactly what I'm doing.

Antoine's point is that there are two ways to achieve that:

Option 1 (what my patch currently does):
- provide bytes and str variants of all constants
- choose which set to use at the start of each function
- be careful never to index, only slice (even for single characters)
- a few other traps that I don't remember off the top of my head

Option 2 (the alternative Antoine suggested and I'm considering):
- "decode" the ASCII compatible bytes to str objects by treating them as nominally latin-1
- use the same str-based constants as are used to handle actual str inputs
- be able to index to your heart's content inside the algorithm
- *ensure* that any bytes-as-pseudo-str objects are "encoded" back to actual bytes before they are returned

From outside the function, a user shouldn't be able to tell which approach we're using internally.

The nice thing about option 2 is to make sure you're doing it correctly, you only need to check three kinds of location:
- the initial parameter handling in each function
- any return statements, raise statements that allow a value to leave the function
- any yield expressions (both input and output)

The effects of option 1 are scattered all over your algorithms, so it's hard to be sure you've caught everything.

The downside of option 2 is if you make a mistake and let your bytes-as-pseudo-str objects escape from the confines of your function, you're going to see some very strange behaviour.
msg117719 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010-09-30 10:59
> Option 2 (the alternative Antoine suggested and I'm considering):
> - "decode" ... to str ...
> - ... objects are "encoded" back to actual bytes before 
>   they are returned

In this case, you have to be very careful to not mix str and bytes decoded to str using a pseudo-encoding. Dummy example: urljoin('unicode', b'bytes') should raise an error.

I don't care of the internals if you write tests to ensure that it is not possible to mix str and bytes with the public API.
msg117724 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2010-09-30 11:40
Yeah, the general implementation concept I'm thinking of going with for option 2 will use a few helper functions:

url, coerced_to_str = _coerce_to_str(url)
if coerced_to_str:
    param = _force_to_str(param) # as appropriate
...
return _undo_coercion(result, coerced_to_str)

The first helper function standardises the typecheck, the second one complains if it is given something that is already a string.

The last one just standardises the check to see if the coercion needs to be undone, and actually undoing the coercion.
msg117904 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2010-10-03 02:36
As per RDM's email to python-dev, a better way to create the pseudo_str values would be by decoding as ascii with a surrogate escape error handler rather than by decoding as latin-1.
msg117909 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2010-10-03 12:44
> As per RDM's email to python-dev, a better way to create the
> pseudo_str values would be by decoding as ascii with a surrogate
> escape error handler rather than by decoding as latin-1.

If you were worried about performance, then surrogateescape is certainly
much slower than latin1.
msg117945 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2010-10-04 12:01
Yeah, I'll have to time it to see how much difference latin-1 vs surrogateescape makes when the MSB is set in any bytes.
msg117993 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010-10-05 07:32
> If you were worried about performance, then surrogateescape is certainly
> much slower than latin1.

If you were really worried about performance, the bytes type is maybe faster 
than: decode bytes to str using latin-1, process str strings, encode str to 
bytes using latin-1.
msg118000 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2010-10-05 10:32
On Tue, Oct 5, 2010 at 5:32 PM, STINNER Victor <report@bugs.python.org> wrote:
>
> STINNER Victor <victor.stinner@haypocalc.com> added the comment:
>
>> If you were worried about performance, then surrogateescape is certainly
>> much slower than latin1.
>
> If you were really worried about performance, the bytes type is maybe faster
> than: decode bytes to str using latin-1, process str strings, encode str to
> bytes using latin-1.

I'm fairly resigned to the fact that I'm going to need some kind of
micro-benchmark to compare the different approaches. For example, the
bytes based approach has a lot of extra assignments to local variables
that the str based approach doesn't need.

The first step is to actually have a str-based patch to compare to the
existing bytes based patch. If the code ends up significantly clearer
(as I expect it will), we can probably sacrifice a certain amount of
speed for that benefit.
msg118001 - (view) Author: Senthil Kumaran (orsenthil) * (Python committer) Date: 2010-10-05 11:01
I wonder if Option2 (ascii+surrogateescape vs latin1) is only about
performance. How about escapes that might occur if the Option2 is
adopted. That might take higher priority than performance.
Do we know 'how tight' that approach is?
msg118180 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2010-10-08 11:10
I've been pondering the idea of adopting a more conservative approach here, since there are actually two issues:

1. Properly quoted URLs are transferred as pure 7-bit ASCII (due to percent-encoding of everything else). However, most of the manipulation functions in urllib.parse can't handle bytes at all, even data that is 7-bit clean.

2. In the real world, just like email, URLs will often contain unescaped (or incorrectly escaped) characters. So assuming the input is actually pure ASCII isn't necessarily a valid assumption.

I'm wondering, since encoding (aside from quoting) isn't urllib.parse's problem, maybe what I should be looking at doing is just handling bytes input via an implicit ascii conversion in strict mode (and then conversion back when the processing is complete).

Then bytes manipulation of properly quoted URLs will "just work", while improperly quoted URLs will fail noisily. This isn't like email or http where the protocol contains encoding information that the library should be trying to interpret - we're just being given raw bytes without any context information.

If any application wants to be more permissive than that, it can do its own conversion to a string and then use the text-based processing. I'll add "encode" methods to the result objects to make it easy to convert their contents from str to bytes and vice-versa.

I'll factor out the implicit encoding/decoding such that if we decide to change the model later (ASCII-strict, ASCII-escape, latin-1) it shouldn't be too difficult.
msg119512 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2010-10-24 13:42
Attached a second version of the patch. Notable features:
- uses a coercion-to-str-and-back strategy (using ascii-strict)
- a significantly more comprehensive pass through the urlparse test suite. I'm happy that the test suite mods are complete with this pass.

The actual coercion-to-str technique I used standardises the type consistency check for the attributes and also returns a callable that handles the necessary coercion of any results. The parsed/split result objects gain encode/decode methods to allow that all to be handled polymorphically (although I think the decode methods may actually be redundant in practice).

There's a deliberate loophole in the type consistency checking to allow the empty string to be type-neutral. Without that, the scheme='' default argument to urlsplit and urlparse becomes painful (as do the urljoin shortcuts for base or url being the empty string).

Implementing this was night and day when compared to the initial attempt that tried to actually manipulate bytes input as bytes. With that patch, it took me multiple runs of the test suite to get everything working. This time, the only things I had to fix were typos and bugs in the additional test suite enhancements. The actual code logic for the type coercions worked first time.
msg119513 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2010-10-24 13:47
Unless I hear some reasonable objections within the next week or so, I'm going to document and commit the ascii-strict coercion approach for beta 1.

The difference in code clarity is such that I'm not even going to try to benchmark the two approaches against each other.
msg120673 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2010-11-07 12:17
Just a note for myself when I next update the patch: the 2-tuple returned by defrag needs to be turned into a real result type of its own, and the decode/encode methods on result objects should be tested explicitly.
msg120749 - (view) Author: Éric Araujo (eric.araujo) * (Python committer) Date: 2010-11-08 15:07
Related issue in msg120647.
msg120857 - (view) Author: Hallvard B Furuseth (hfuru) Date: 2010-11-09 10:37
urlunparse(url or params = bytes object) produces a result
with the repr of the bytes object if params is set.

urllib.parse.urlunparse(['http', 'host', '/dir', b'params', '', ''])
--> "http://host/dir;b'params'"

That's confusing since urllib/parse.py goes to a lot of trouble to
support both bytes and str.  Simplest fix is to only accept str:

Index: Lib/urllib/parse.py
@@ -219,5 +219,5 @@ def urlunparse(components):
     scheme, netloc, url, params, query, fragment = components
     if params:
-        url = "%s;%s" % (url, params)
+        url = ';'.join((url, params))
     return urlunsplit((scheme, netloc, url, query, fragment))
 
Some people at comp.lang.python tell me code shouldn't anyway do str()
just in case it is needed like urllib does, not that I can make much
sense of that discussion.  (Subject: harmful str(bytes)).

BTW, the str vs bytes code doesn't have to be quite as painful as in
urllib.parse.  Here is a patch which just rearranges and factors out
some code.
   http://bugs.python.org/file19525/parse.diff
msg122890 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2010-11-30 07:32
New patch which addresses my last two comments (i.e. some basic explicit tests of the encode/decode methods on result objects, and urldefrag returns a named tuple like urlsplit and urlparse already did).

A natural consequence of this patch is that mixed arguments (as in the message above) will be rejected with TypeError.

Once I figure out what the docs changes are going to look like, I'll wrap this all up and commit it.
msg122908 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2010-11-30 15:53
Committed in r86889

The docs changes should soon be live at:
http://docs.python.org/dev/library/urllib.parse.html

If anyone would like to suggest changes to the wording of the docs for post beta1, or finds additional corner cases that the new bytes handling can't cope with, feel free to create a new issue.
History
Date User Action Args
2022-04-11 14:57:06adminsetgithub: 54082
2010-11-30 15:53:40ncoghlansetstatus: open -> closed
resolution: accepted
messages: + msg122908

stage: needs patch -> resolved
2010-11-30 07:32:09ncoghlansetfiles: + issue9873_just_needs_docs.diff

messages: + msg122890
2010-11-09 10:37:27hfurusetnosy: + hfuru
messages: + msg120857
2010-11-08 15:07:13eric.araujosetmessages: + msg120749
2010-11-07 12:17:38ncoghlansetmessages: + msg120673
2010-11-06 20:30:03r.david.murraylinkissue10343 superseder
2010-10-24 13:47:59ncoghlansetmessages: + msg119513
2010-10-24 13:42:51ncoghlansetfiles: + issue9873_coercion_to_str.diff

messages: + msg119512
2010-10-08 18:22:04sjtsetnosy: + sjt
2010-10-08 11:10:41ncoghlansetmessages: + msg118180
2010-10-05 11:01:10orsenthilsetmessages: + msg118001
2010-10-05 10:32:18ncoghlansetmessages: + msg118000
2010-10-05 07:32:03vstinnersetmessages: + msg117993
2010-10-04 12:01:49ncoghlansetmessages: + msg117945
2010-10-03 12:44:49pitrousetmessages: + msg117909
2010-10-03 02:36:09ncoghlansetmessages: + msg117904
2010-10-01 23:04:26vstinnersettitle: Allow bytes in some APIs that use string literals internally -> urllib.parse: Allow bytes in some APIs that use string literals internally
2010-09-30 11:57:36orsenthilsetnosy: + orsenthil
2010-09-30 11:40:33ncoghlansetmessages: + msg117724
2010-09-30 10:59:30vstinnersetmessages: + msg117719
2010-09-30 10:54:31ncoghlansetmessages: + msg117718
2010-09-29 22:48:53vstinnersetnosy: + vstinner
messages: + msg117670
2010-09-29 22:23:39ncoghlansetmessages: + msg117668
2010-09-29 22:09:30pitrousetmessages: + msg117666
2010-09-29 21:55:57ncoghlansetmessages: + msg117661
2010-09-29 20:56:14ncoghlansetmessages: + msg117654
2010-09-29 20:49:11ncoghlansetmessages: + msg117653
2010-09-29 09:57:39pitrousetnosy: + pitrou
messages: + msg117588
2010-09-28 21:56:31ncoghlansetmessages: + msg117556
2010-09-28 14:00:57ncoghlansetfiles: + issue9873_initial_attempt.diff
keywords: + patch
messages: + msg117520
2010-09-20 11:42:32ncoghlansetmessages: + msg116927
2010-09-17 13:43:12ncoghlansetmessages: + msg116663
2010-09-16 20:10:54eric.araujosetnosy: + eric.araujo
2010-09-16 14:08:18eric.smithsetnosy: + eric.smith
2010-09-16 14:01:58r.david.murraysetnosy: + r.david.murray
2010-09-16 13:25:10ncoghlancreate