classification
Title: WebSocket schemes in urllib.parse
Type: enhancement Stage: test needed
Components: Library (Lib) Versions: Python 3.4
process
Status: closed Resolution: works for me
Dependencies: Superseder:
Assigned To: orsenthil Nosy List: eric.araujo, ezio.melotti, martin.panter, nailor, oberstet, orsenthil, r.david.murray
Priority: normal Keywords: easy, patch

Created on 2011-10-22 11:07 by oberstet, last changed 2014-04-14 22:42 by orsenthil. This issue is now closed.

Files
File name Uploaded Description Edit
issue13244.patch nailor, 2011-10-22 20:32 review
Messages (25)
msg146167 - (view) Author: Tobias Oberstein (oberstet) Date: 2011-10-22 11:07
The urlparse module currently does not support the new "ws" and "wss" schemes used for the WebSocket protocol.

As a workaround, we currently use the following code (which is a hack of course):

import urlparse
wsschemes = ["ws", "wss"]
urlparse.uses_relative.extend(wsschemes)
urlparse.uses_netloc.extend(wsschemes)
urlparse.uses_params.extend(wsschemes)
urlparse.uses_query.extend(wsschemes)
urlparse.uses_fragment.extend(wsschemes)

===

A WebSocket URL has scheme "ws" or "wss", MUST have a network location and MAY have a resource part with path and query components, but MUST NOT have a fragment component.
msg146188 - (view) Author: Jyrki Pulliainen (nailor) * Date: 2011-10-22 20:32
I added a patch that adds support for WebSocket URL protocol.

However, a few pointers (and questions):

- The patch is now implemented according to Draft 17[1] of WebSocket protocol

- Draft 17 does not support fragments, it states that the fragments should be treated as a part of the URL, so I didn't add the ws and wss support in uses_fragments. However, the Draft 17 states also, that those should always be encoded. Should there be a special case for ws/wss URLs that have non-encoded number-signs in them, for example should we raise some sort of an exception?

[1] http://tools.ietf.org/html/draft-ietf-hybi-thewebsocketprotocol-17
msg146189 - (view) Author: Tobias Oberstein (oberstet) Date: 2011-10-22 20:47
fragment identifiers:

the spec says:

"Fragment identifiers are meaningless in the context of WebSocket
URIs, and MUST NOT be used on these URIs.  The character "#" in URIs
MUST be escaped as %23 if used as part of the query component."

[see last line of my initial comment]

I nevertheless added the ws/wss schemes to urlparse.uses_fragment so that I can detect them being used and throw.

Does urllib throw when an URL contains a fragment identifier, but the scheme of the URL is not in urlparse.uses_fragment?

If so, thats fine and of course better than putting the burden of checking on the user.

==

Further, when "#" is to be used in a WS URL, it MUST be encoded, and if so, it's interpreted as part of the query component.

So in summary, I think the best would be:

urllib throws upon non-encoded "#", and interpret it as part of the query component when encoded.
msg146190 - (view) Author: Tobias Oberstein (oberstet) Date: 2011-10-22 20:54
Well, thinking about it, %23 can also appear in a percent encoded path component.

I don't get the conditional "..if used as part of the query component" in the spec.
msg146192 - (view) Author: Jyrki Pulliainen (nailor) * Date: 2011-10-22 21:03
Actually, if I get it right, it means that following url is valid:

  ws://example.com/something#somewhere/

and the # should be considered as being a part of the path. The spec does not say a thing should the # in path component be encoded, so I think it's safe to assume it should can be unencoded. However, following url

  ws://example.com/something?query=foo#bar

Is not considered to be valid, as the # is in the query part and is not escaped. So the valid would be:

  ws://example.com/something?query=foo%23bar

I think the motivation behind this is to reduce possible conflicts with browsers that might take the #-part as a fragment when it should be part of the query parameters. However, the confusion is still possible with # in path part.

My take on this would be to omit fragments and just parse the url as is without fragments. Encoding could be left to user, even in the case # is in query part.
msg146193 - (view) Author: Tobias Oberstein (oberstet) Date: 2011-10-22 21:18
I see how you interpret that sentence in the spec, but I would have read it differently:

invalid:

1. ws://example.com/something#somewhere
2. ws://example.com/something#somewhere/
3. ws://example.com/something#somewhere/foo
4. ws://example.com/something?query=foo#bar

valid:

5. ws://example.com/something%23somewhere
6. ws://example.com/something%23somewhere/
7. ws://example.com/something%23somewhere/foo
8. ws://example.com/something?query=foo%23bar

You would take 2. and 3. as valid, but 1. and 4. as invalid, right?
 
But you are right, the spec does not talk about # in path.

If above is a valid summary of the question, I'd better take that to the Hybi list to get feedback before rushing into anything with urllib ..
msg146194 - (view) Author: Jyrki Pulliainen (nailor) * Date: 2011-10-22 21:25
I'd take only 4. as invalid, as the WebSocket (to my interpretation) do not have fragments, so it is assumed to be a part of the path in that case.

But yeah, a confirmation from HyBi would be great. Can you link to the discussion from here, if you ask them (in case it's possible)?
msg146195 - (view) Author: Tobias Oberstein (oberstet) Date: 2011-10-22 21:30
I'll ask (to be sure) and link.

However, after rereading the Hybi 17 section, it says

"""
path = <path-abempty, defined in [RFC3986], Section 3.3>
"""

And http://tools.ietf.org/html/rfc3986 says:

"""
The path is terminated by the first question mark ("?") or number sign ("#") character, or by the end of the URI.
"""

So my reading would be: non-escaped # can never be part of path for a WebSocket URL by reference of RFC3986.
msg146197 - (view) Author: Tobias Oberstein (oberstet) Date: 2011-10-22 22:34
here the links to the question on the Hybi list:

http://www.ietf.org/mail-archive/web/hybi/current/msg09257.html

and

http://www.ietf.org/mail-archive/web/hybi/current/msg09258.html
http://www.ietf.org/mail-archive/web/hybi/current/msg09243.html

==

I'll track those and come back when there is a conclusion ..
msg146202 - (view) Author: Éric Araujo (eric.araujo) * (Python committer) Date: 2011-10-22 23:22
> Actually, if I get it right, it means that following url is valid:
>  ws://example.com/something#somewhere/
I don’t think so.  The URI syntax RFC is generic, so a scheme spec cannot redefine the parsing to mean that in your above example, there is no fragment and the path is /something#somewhere/ .  I believe the spec means that a # in any component must be %-escaped.

> Does urllib throw when an URL contains a fragment identifier, but the scheme of the URL is not
> in urlparse.uses_fragment?
Could you try it in a shell and tell us?  BTW, please don’t use “throw” in a code or doc patch: exceptions are raised, throw is related but different generator method.
msg146203 - (view) Author: Tobias Oberstein (oberstet) Date: 2011-10-22 23:44
sorry for "throw" .. somewhat bad habit (stemming from wandering between languages).

uses_fragment extended:

[autobahn@autobahnhub ~/Autobahn]$ python
Python 2.7.1 (r271:86832, Dec 13 2010, 15:52:15)
[GCC 4.2.1 20070719  [FreeBSD]] on freebsd8
Type "help", "copyright", "credits" or "license" for more information.
>>> import urlparse
>>> wsschemes = ["ws", "wss"]
>>> urlparse.uses_relative.extend(wsschemes)
>>> urlparse.uses_netloc.extend(wsschemes)
>>> urlparse.uses_params.extend(wsschemes)
>>> urlparse.uses_query.extend(wsschemes)
>>> urlparse.uses_fragment.extend(wsschemes)
>>> urlparse.urlparse("ws://example.com/something#somewhere/")
ParseResult(scheme='ws', netloc='example.com', path='/something', params='', query='', fragment='somewhere/')
>>> urlparse.urlparse("ws://example.com/something#somewhere")
ParseResult(scheme='ws', netloc='example.com', path='/something', params='', query='', fragment='somewhere')
>>>

=> fragment extracted


uses_fragment not extended:

[autobahn@autobahnhub ~/Autobahn]$ python
Python 2.7.1 (r271:86832, Dec 13 2010, 15:52:15)
[GCC 4.2.1 20070719  [FreeBSD]] on freebsd8
Type "help", "copyright", "credits" or "license" for more information.
>>> import urlparse
>>> wsschemes = ["ws", "wss"]
>>> urlparse.uses_relative.extend(wsschemes)
>>> urlparse.uses_netloc.extend(wsschemes)
>>> urlparse.uses_params.extend(wsschemes)
>>> urlparse.uses_query.extend(wsschemes)
>>> urlparse.urlparse("ws://example.com/something#somewhere/")
ParseResult(scheme='ws', netloc='example.com', path='/something#somewhere/', params='', query='', fragment='')
>>> urlparse.urlparse("ws://example.com/something#somewhere")
ParseResult(scheme='ws', netloc='example.com', path='/something#somewhere', params='', query='', fragment='')
>>>

=> no fragment extracted, but interpreted as part of path component
=> no exception raised

The answer on Hybi outstanding, but I would interpret Hybi-17: # must always be escaped, both in path and query components. Fragment components are not allowed. Thus, unescaped # can never appear in WS URL. Further, it must not be ignored, but the WS handshake failed.

If this should indeed be the correct reading of the WS spec, then I think urlparse should raise an exception upon unescaped # within URLs from ws/wss schemes.
msg146205 - (view) Author: Éric Araujo (eric.araujo) * (Python committer) Date: 2011-10-22 23:45
>>> urlparse.urlparse("ws://example.com/something#somewhere")
ParseResult(scheme='ws', netloc='example.com', path='/something#somewhere', params='', query='', fragment='')

This makes me sad.  I thought we had fixed urllib months ago to follow the damn rules that have been in a bunch of RFCs for years.
msg146304 - (view) Author: Tobias Oberstein (oberstet) Date: 2011-10-24 15:33
ok, there was feedback on Hybi list:

http://www.ietf.org/mail-archive/web/hybi/current/msg09270.html

"""
    1. ws://example.com/something#somewhere
    2. ws://example.com/something#somewhere/
    3. ws://example.com/something#somewhere/foo
    4. ws://example.com/something?query=foo#bar

I think all of these are invalid. 
"""
Alexey Melnikov, Co-author of the WS spec.

And Julian Reschke:

http://www.ietf.org/mail-archive/web/hybi/current/msg09277.html

==

Thus, I would upload my comment:

"# must always be escaped, both in path and query components. Fragment components are not allowed. Thus, unescaped # can never appear in WS URL. Further, it must not be ignored, but the WS handshake failed."

And further: urlparse should raise an exception upon unescaped # within URLs from ws/wss schemes.
msg146377 - (view) Author: Éric Araujo (eric.araujo) * (Python committer) Date: 2011-10-25 16:07
> # must always be escaped, both in path and query components.

Agreed.  This just follows from the Generic URI Syntax RFC, it’s not specific to WebSockets.

> And further: urlparse should raise an exception upon unescaped # within URLs
> from ws/wss schemes.

I’d say that urlparse should raise an exception when a ws/wss URI contains a fragment part.  I’m not sure this will be possible; from a glance at the source and a quick test, urlparse will happily break the Generic URI Syntax RFC and return a path including a # character!
msg146401 - (view) Author: Tobias Oberstein (oberstet) Date: 2011-10-25 21:00
> I’d say that urlparse should raise an exception when a ws/wss URI contains a fragment part.

Yep, better.

> I’m not sure this will be possible; from a glance at the source and a quick test, urlparse will happily break the Generic URI Syntax RFC and return a path including a # character!

That's unfortunate.

In that case I'd probably prefer the lesser evil, namely that urlparse be set up (falsely) such that ws/wss scheme would falsely allow fragments, so I get back the non-empty fragment as a separate component, and check myself.

If urlparse returns the fragment (falsely) within path, then a user could check only by searching for # in the path. Also hacky .. even worse than compare fragment for != "". 

Essentially, this would be exactly "the hack" that I posted in my very first comment:

urlparse.uses_fragment.extend(wsschemes)

===

Alternative: make this bug dependent on fixing urlparse for fragment rules in generic URI RFC and don't do anything until then?
msg146474 - (view) Author: Jyrki Pulliainen (nailor) * Date: 2011-10-27 07:16
> Alternative: make this bug dependent on fixing urlparse for fragment rules in generic URI RFC and don't do anything until then?

I'd go with this, even though it probably would be a lot bigger work
than this. What's Éric's take on this approach?
msg146477 - (view) Author: Senthil Kumaran (orsenthil) * (Python committer) Date: 2011-10-27 07:51
This kind of suggestion has come up before and easy fix is to add individual schemes as the patch does. There is a number of limitations if want to make the parser generic for any scheme. The difficult thing being the parsing behavior and requirements as defined by scheme. 

The generic parsing rule, which was added previously was, when someone comes up a new scheme and what could be default "fall-back parsing rule".

In this report, I see ws and wss has some requirements, which needs to be codified in the parsing rules followed by urlparse module. To start with, going with patch is a good way. 

If you find any other library (I look at libcurl) handling it differently, please point it out here, so that it can could be useful.
msg146482 - (view) Author: Tobias Oberstein (oberstet) Date: 2011-10-27 08:36
The patch as it stands will result in wrong behavior:

+        self.assertEqual(urllib.parse.urlparse("ws://example.com/stuff#ff"),
+                         ('ws', 'example.com', '/stuff#ff', '', '', ''))

The path component returned is invalid for ws/wss and is invalid for any scheme following the generic URI RFC, since # must be always escaped in path components.

Is urlparse meant to follow the generic URI RFC?

IMHO, the patch at least should do the equivalent of

urlparse.uses_fragment.extend(wsschemes)

so users of urlparse can do the checking for fragment != "", required for ws/wss on their own.
msg146489 - (view) Author: Éric Araujo (eric.araujo) * (Python committer) Date: 2011-10-27 14:49
> Is urlparse meant to follow the generic URI RFC?
No, it predates it.

> IMHO, the patch at least should do the equivalent of
> urlparse.uses_fragment.extend(wsschemes)
> so users of urlparse can do the checking for fragment != "", required for ws/wss on their own.
That’s probably the most urllib.parse can do, sadly.

> Alternative: make this bug dependent on fixing urlparse for fragment rules in generic
> URI RFC and don't do anything until then?
I’m not sure we can fix urllib.parse that way, because of backward compatibility concerns.  We might change the default handling (i.e. when parsing an unknown scheme) to comply with the RFC, but I’d much rather have a new, clean module.

> This kind of suggestion has come up before
I recall some discussion on that too.  Maybe we should bring it up again on python-dev?  I think I read a discussion from years ago where Guido learned that the URI syntax was now generic and that urlparse’s design was obsolete.  There was also someone else who had a new module (was it Nick?) implementing the RFC.  IIRC this module was not discussed for inclusion because urllib gained many tests for RFC compliance and was thought Good Enough™.

> There is a number of limitations if want to make the parser generic for any scheme. The
> difficult thing being the parsing behavior and requirements as defined by scheme. 
I don’t understand what you need.  If I get the RFC correctly, the point is that parsing rules are the same for any and all schemes, then it’s up to the application to refuse some component or do any other scheme-specific handling of the components.  But the parsing of a URI into components is the same.
msg146491 - (view) Author: Senthil Kumaran (orsenthil) * (Python committer) Date: 2011-10-27 15:11
Similar issue Issue7904 and 7da7f9bfdaac wherein the accepted way to parse x-newscheme://foo.com/stuff was added. Does the new ws:// scheme not fall under that?
msg146559 - (view) Author: Tobias Oberstein (oberstet) Date: 2011-10-28 13:51
Is that patch supposed to be in Python 2.7.2?

If so, it doesn't work for "ws":

"ws://example.com/somewhere?foo=bar#dgdg"

F:\scm\Autobahn\testsuite\websockets\servers>python
Python 2.7.2 (default, Jun 12 2011, 15:08:59) [MSC v.1500 32 bit (Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> from urlparse import urlparse
>>> urlparse("ws://example.com/somewhere?foo=bar#dgdg")
ParseResult(scheme='ws', netloc='example.com', path='/somewhere?foo=bar#dgdg', params='', query='', fragment='')
>>> urlparse("ws://example.com/somewhere?foo=bar#dgdg", allow_fragments = True)
ParseResult(scheme='ws', netloc='example.com', path='/somewhere?foo=bar#dgdg', params='', query='', fragment='')
>>> urlparse("ws://example.com/somewhere?foo=bar#dgdg", allow_fragments = False)
ParseResult(scheme='ws', netloc='example.com', path='/somewhere?foo=bar#dgdg', params='', query='', fragment='')
>>>

urlparse will neither parse the query nor the (invalid) fragment component for the "ws" scheme

I would have expected

ParseResult(scheme='ws', netloc='example.com', path='/somewhere', params='', query='foo=bar', fragment='dgdg')
msg185121 - (view) Author: Éric Araujo (eric.araujo) * (Python committer) Date: 2013-03-24 07:51
This is not committed to any branch yet.
msg204160 - (view) Author: Martin Panter (martin.panter) * (Python committer) Date: 2013-11-24 02:08
Suspect this is now fixed in a generic way by Issue 9374. The fix seems to be in 2.7, 3.2 and 3.3.

$ python3.3
Python 3.3.2 (default, May 16 2013, 23:40:52) 
[GCC 4.6.3] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from urllib.parse import urlparse
>>> urlparse("ws://example.com/somewhere?foo=bar#dgdg")
ParseResult(scheme='ws', netloc='example.com', path='/somewhere', params='', query='foo=bar', fragment='dgdg')
msg204191 - (view) Author: Tobias Oberstein (oberstet) Date: 2013-11-24 09:07
FWIW, WebSocket URL parsing is still wrong on Python 2.7.6 - in fact, it's broken in multiple ways:

>>> from urlparse import urlparse
>>> urlparse("ws://example.com/somewhere?foo=bar#dgdg")
ParseResult(scheme='ws', netloc='example.com', path='/somewhere', params='', query='foo=bar', fragment='dgdg')
>>> urlparse("ws://example.com/somewhere?foo=bar%23dgdg")
ParseResult(scheme='ws', netloc='example.com', path='/somewhere', params='', query='foo=bar%23dgdg', fragment='')
>>> urlparse("ws://example.com/somewhere?foo#=bar")
ParseResult(scheme='ws', netloc='example.com', path='/somewhere', params='', query='foo', fragment='=bar')
>>> urlparse("ws://example.com/somewhere?foo%23=bar")
ParseResult(scheme='ws', netloc='example.com', path='/somewhere', params='', query='foo%23=bar', fragment='')
>>>
msg216243 - (view) Author: Senthil Kumaran (orsenthil) * (Python committer) Date: 2014-04-14 22:42
Reading both the RFCs and requirements, I see that this is already taken care.

Note: we are actually have unencoded fragment like # and RFCs talk about fragments with # character only. If you want the behavior of parse with urlencoded to match un-urlencoded one, that's a different requirement and not a scope here.

Here is 3.5 output

$ ./python.exe
Python 3.5.0a0 (default:528234542ff0, Apr 14 2014, 18:25:27)
[GCC 4.2.1 (Apple Inc. build 5666) (dot 3)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import urllib.parse
>>> p = urllib.parse.urlparse('ws://example.com/something?query=foo#bar')
>>> p
ParseResult(scheme='ws', netloc='example.com', path='/something', params='', query='query=foo', fragment='bar')
>>> p = urllib.parse.urlparse('ws://example.com/something#bar')
>>> p
ParseResult(scheme='ws', netloc='example.com', path='/something', params='', query='', fragment='bar')


Here is 2.7.6 output


$ ./python.exe
Python 2.7.6+ (2.7:7dab4feec126+, Jan 11 2014, 15:25:20)
[GCC 4.2.1 (Apple Inc. build 5666) (dot 3)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import urlparse
>>> ^D
[localhost 2.7]$ ./python.exe
Python 2.7.6+ (2.7:7dab4feec126+, Jan 11 2014, 15:25:20)
[GCC 4.2.1 (Apple Inc. build 5666) (dot 3)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from urlparse import urlparse
>>> urlparse('ws://example.com/something?query=foo#bar')
ParseResult(scheme='ws', netloc='example.com', path='/something', params='', query='query=foo', fragment='bar')
>>> urlparse('ws://example.com/something#bar')
ParseResult(scheme='ws', netloc='example.com', path='/something', params='', query='', fragment='bar')
>>>


I find it satisfactory and I think, this bug should be closed. Thank you!
History
Date User Action Args
2014-04-14 22:42:51orsenthilsetstatus: open -> closed
assignee: orsenthil
resolution: works for me
messages: + msg216243
2013-11-24 09:07:21oberstetsetmessages: + msg204191
2013-11-24 02:08:27martin.pantersetnosy: + martin.panter
messages: + msg204160
2013-03-24 07:51:54eric.araujosetmessages: + msg185121
versions: + Python 3.4, - Python 3.3
2011-10-28 13:51:20oberstetsetmessages: + msg146559
2011-10-27 15:15:03eric.araujosetnosy: + r.david.murray
2011-10-27 15:11:32orsenthilsetmessages: + msg146491
2011-10-27 14:49:39eric.araujosetmessages: + msg146489
2011-10-27 08:36:20oberstetsetmessages: + msg146482
2011-10-27 07:51:28orsenthilsetmessages: + msg146477
2011-10-27 07:16:33nailorsetmessages: + msg146474
2011-10-25 21:00:56oberstetsetmessages: + msg146401
2011-10-25 16:07:25eric.araujosetmessages: + msg146377
title: WebSocket schemes in urlparse -> WebSocket schemes in urllib.parse
2011-10-24 15:33:03oberstetsetmessages: + msg146304
2011-10-22 23:45:49eric.araujosetmessages: + msg146205
2011-10-22 23:44:21oberstetsetmessages: + msg146203
2011-10-22 23:22:23eric.araujosetnosy: + eric.araujo
messages: + msg146202
2011-10-22 22:34:27oberstetsetmessages: + msg146197
2011-10-22 21:30:25oberstetsetmessages: + msg146195
2011-10-22 21:25:31nailorsetmessages: + msg146194
2011-10-22 21:18:32oberstetsetmessages: + msg146193
2011-10-22 21:03:53nailorsetmessages: + msg146192
2011-10-22 20:54:35oberstetsetmessages: + msg146190
2011-10-22 20:47:33oberstetsetmessages: + msg146189
2011-10-22 20:32:33nailorsetfiles: + issue13244.patch

nosy: + nailor
messages: + msg146188

keywords: + patch
2011-10-22 11:15:46ezio.melottisetkeywords: + easy
nosy: + orsenthil, ezio.melotti
stage: test needed

versions: + Python 3.3, - Python 2.7
2011-10-22 11:07:42oberstetcreate