This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: urllib.FancyURLopener does not treat URL fragments correctly
Type: behavior Stage:
Components: Library (Lib) Versions: Python 2.7
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: ezio.melotti, karlcow, mher, orsenthil, takahashi.shuhei
Priority: normal Keywords: patch

Created on 2013-06-02 10:22 by takahashi.shuhei, last changed 2022-04-11 14:57 by admin.

Files
File name Uploaded Description Edit
issue18119-code-only.patch karlcow, 2014-10-07 07:45 review
Messages (7)
msg190480 - (view) Author: Shuhei Takahashi (takahashi.shuhei) Date: 2013-06-02 10:22
When urllib.FancyURLopener encounters 302 redirection to a URL with fragments, it sends wrong URL to servers.

For example, if we run:
  urllib.urlopen('http://example.com/foo')
and the server responds like following.
  HTTP/1.1 302 Found
  Location: /bar#test
Then urllib tries next to fetch http://example.com/bar#test, instead of http://example.com/bar.

Correctly, urllib should strip fragment part of URL before issuing requests.
msg228417 - (view) Author: Mark Lawrence (BreamoreBoy) * Date: 2014-10-04 00:26
Slipped under the radar?
msg228660 - (view) Author: karl (karlcow) * Date: 2014-10-06 11:29
This is the correct behavior
  GET http://example.com/foo
with a response containing 
  302 and Location: /bar#test
must trigger
  http://example.com/bar#test



Location is defined in 
http://tools.ietf.org/html/rfc7231#section-7.1.2

7.1.2. Location


   The "Location" header field is used in some responses to refer to a
   specific resource in relation to the response.  The type of
   relationship is defined by the combination of request method and
   status code semantics.

     Location = URI-reference

   The field value consists of a single URI-reference.  When it has the
   form of a relative reference ([RFC3986], Section 4.2), the final
   value is computed by resolving it against the effective request URI
   ([RFC3986], Section 5).

A bit after in the spec.


   If the Location value provided in a 3xx (Redirection) response does
   not have a fragment component, a user agent MUST process the
   redirection as if the value inherits the fragment component of the
   URI reference used to generate the request target (i.e., the
   redirection inherits the original reference's fragment, if any).

   For example, a GET request generated for the URI reference
   "http://www.example.org/~tim" might result in a 303 (See Other)
   response containing the header field:

     Location: /People.html#tim

   which suggests that the user agent redirect to
   "http://www.example.org/People.html#tim"

   Likewise, a GET request generated for the URI reference
   "http://www.example.org/index.html#larry" might result in a 301
   (Moved Permanently) response containing the header field:

     Location: http://www.example.net/index.html

   which suggests that the user agent redirect to
   "http://www.example.net/index.html#larry", preserving the original
   fragment identifier.
msg228720 - (view) Author: Shuhei Takahashi (takahashi.shuhei) Date: 2014-10-06 16:29
Hi karl,

Of course it is correct that the user agent is redirected to http://example.com/bar#test when it got such response. However, it never means UA can send an HTTP request containing fragment part.

In RFC7230 section 3.1.1, HTTP request line is defined as:
    request-line   = method SP request-target SP HTTP-version CRLF
where request-target is defined in section 5.3. In usual case, it is origin-form:
    origin-form    = absolute-path [ "?" query ]
which never includes fragment part.

This means current urllib behavior violates HTTP/1.1 spec.
msg228744 - (view) Author: karl (karlcow) * Date: 2014-10-06 23:03
Takahashi-san,

Ah sorry misunderstood which part your were talking about. I assume wrongly you were talking about navigation. 

Yes for the request which is sent to the server it should be
http://tools.ietf.org/html/rfc7230#section-5.3.1
So refactoring your example.

1st request:

    GET /foo HTTP/1.1
    Accept: text/html
    Host: example.com

server response
    HTTP/1.1 302 Found
    Location: /bar#test

second request must be
    GET /bar HTTP/1.1
    Accept: text/html
    Host: example.com

As for the navigation context is indeed part of the piece of code taking in charge the document after being parsed and not the one doing the HTTP request. (putting it here just that people understand)


(to be tested)
For server side receiving invalid Request-line 
http://tools.ietf.org/html/rfc7230#section-3.1.1

   Recipients of an invalid request-line SHOULD respond with either a
   400 (Bad Request) error or a 301 (Moved Permanently) redirect with
   the request-target properly encoded.  A recipient SHOULD NOT attempt
   to autocorrect and then process the request without a redirect, since
   the invalid request-line might be deliberately crafted to bypass
   security filters along the request chain.
msg228748 - (view) Author: karl (karlcow) * Date: 2014-10-06 23:30
In class urlopen_HttpTests
https://hg.python.org/cpython/file/4f314dedb84f/Lib/test/test_urllib.py#l191

there is a test for invalid redirects

def test_invalid_redirect(self):
https://hg.python.org/cpython/file/4f314dedb84f/Lib/test/test_urllib.py#l247

And one for fragments
def test_url_fragment(self):
https://hg.python.org/cpython/file/4f314dedb84f/Lib/test/test_urllib.py#l205

which refers to http://bugs.python.org/issue11703

code in 
https://hg.python.org/cpython/file/d5688a94a56c/Lib/urllib.py
msg228755 - (view) Author: karl (karlcow) * Date: 2014-10-07 07:45
OK I fixed the code. The issue is here
https://hg.python.org/cpython/file/1e1c6e306eb4/Lib/urllib/request.py#l656

        newurl = urlunparse(urlparts)

Basically it reinjects the fragment in the new url. The fix is easy.

        if urlparts.fragment:
            urlparts = list(urlparts)
            urlparts[5] = ""
        newurl = urlunparse(urlparts)

I was trying to make a test for it, but failed. Could someone help me for the test so I can complete the patch?

Added the code patch only.
History
Date User Action Args
2022-04-11 14:57:46adminsetgithub: 62319
2019-04-26 17:49:04BreamoreBoysetnosy: - BreamoreBoy
2014-10-07 07:45:19karlcowsetfiles: + issue18119-code-only.patch
keywords: + patch
messages: + msg228755
2014-10-06 23:30:55karlcowsetmessages: + msg228748
2014-10-06 23:03:04karlcowsetmessages: + msg228744
2014-10-06 16:29:00takahashi.shuheisetmessages: + msg228720
2014-10-06 11:29:02karlcowsetnosy: + karlcow
messages: + msg228660
2014-10-04 00:26:37BreamoreBoysetnosy: + BreamoreBoy
messages: + msg228417
2013-06-08 17:42:30ezio.melottisetnosy: + orsenthil, ezio.melotti
2013-06-04 14:58:45mhersetnosy: + mher
2013-06-02 10:22:10takahashi.shuheicreate