classification
Title: urlparse fails if the path is numeric
Type: behavior Stage: resolved
Components: Library (Lib) Versions: Python 3.8, Python 3.7, Python 2.7
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: Björn.Lindqvist, Tim.Graham, martin.panter, miss-islington, orsenthil, r.david.murray, vstinner
Priority: normal Keywords: patch

Created on 2016-07-30 19:57 by Björn.Lindqvist, last changed 2019-10-24 10:31 by vstinner. This issue is now closed.

Pull Requests
URL Status Linked Edit
PR 661 merged Tim.Graham, 2017-03-13 17:39
PR 16837 merged miss-islington, 2019-10-18 13:07
PR 16839 merged orsenthil, 2019-10-18 13:51
Messages (12)
msg271702 - (view) Author: Björn Lindqvist (Björn.Lindqvist) Date: 2016-07-30 19:57
This affects both Python 2 and 3. This is as expected:

>>> urlparse('abc:123.html')
ParseResult(scheme='abc', netloc='', path='123.html', params='', query='', fragment='')
>>> urlparse('123.html:abc')
ParseResult(scheme='123.html', netloc='', path='abc', params='', query='', fragment='')
>>> urlparse('abc:123/')
ParseResult(scheme='abc', netloc='', path='123/', params='', query='', fragment='')

This is NOT:

>>> urlparse('abc:123')
ParseResult(scheme='', netloc='', path='abc:123', params='', query='', fragment='')

Expected is path='123' and scheme='abc'. At least according to my reading of the rfc (https://tools.ietf.org/html/rfc1808.html) that is what should happen.
msg271703 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2016-07-30 21:12
See issue 14072.  It may be time to look at this again, but we may still be constrained by backward compatibility.
msg271719 - (view) Author: Martin Panter (martin.panter) * (Python committer) Date: 2016-07-31 02:37
The main backward compatibility consideration would be Issue 754016, but don’t agree with the changes made, and would support reverting them. The original bug reporter wanted urlparse("1.2.3.4:80", "http") to be treated as the URL http://1.2.3.4:80, but the IP address was being parsed as a scheme, so the default “http” scheme was ignored.

The original fix (r83701) affected any URL that had a digit 0–9 immediately after the “scheme:” prefix. In such URLs, the scheme component was no longer parsed. A test case for “path:80” was added, and a demonstration of not parsing any scheme from www.cwi.nl:80/%7Eguido/Python.html was added in the documentation.

Later, the logic was altered to test if the URL looked like an integer (revision 495d12196487, Issue 11467). This restored proper parsing of clsid:85bbd92o-42a0-1o69-a2e4-08002b30309d and mailto:1337@example.org, although another URL given, javascript:123, remains misparsed. The documentation was subsequently adjusted in Issue 16932 to just demonstrate www.cwi.nl/%7Eguido/Python.html being parsed as a path.

The logic was watered down to its current form by revision 9f6b7576c08c, Issue 14072. Now it tests for a non-digit anywhere after the scheme, so that tel:+31641044153 is again parsed properly. But it was pointed out that tel:1234 remains misparsed.

What’s the next step in the watering-down process? All the attempts so far break valid URLs in favour of special-casing inputs that are not valid URLs.
msg271738 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2016-07-31 14:02
I hate to say it, but this may require a python-dev discussion.  We probably ought to be parsing valid urls correctly as our top priority, but if that breaks our parsing of "reasonable" non-valid URLs (that existing code is depending on), it's going to be a backward compatibility problem.
msg271739 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2016-07-31 14:04
On second thought, what are the chances that special casing something that looks like an IP address in the scheme position would maintain backward compatibility?
msg271823 - (view) Author: Martin Panter (martin.panter) * (Python committer) Date: 2016-08-02 13:55
Depends on how you define “looks like an IP address”. Does the www.cwi.nl:80 case look like an IP address? What about “path:80” or “localhost:80”? If there is any code relying on the bug, it may just as easily involve host name as a numeric IP address.
msg271824 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2016-08-02 14:07
Ah, good point, I misread the scope of the problem.
msg289557 - (view) Author: Tim Graham (Tim.Graham) * Date: 2017-03-14 01:34
Based on discussion in issue 16932, I agree that reverting the parsing decisions from issue 754016 (as Martin suggested in msg271719) seems appropriate. I created a pull request that does that.
msg354889 - (view) Author: Senthil Kumaran (orsenthil) * (Python committer) Date: 2019-10-18 13:07
New changeset 5a88d50ff013a64fbdb25b877c87644a9034c969 by Senthil Kumaran (Tim Graham) in branch 'master':
bpo-27657: Fix urlparse() with numeric paths (#661)
https://github.com/python/cpython/commit/5a88d50ff013a64fbdb25b877c87644a9034c969
msg354894 - (view) Author: miss-islington (miss-islington) Date: 2019-10-18 13:24
New changeset 82b5f6b16e051f8a2ac6e87ba86b082fa1c4a77f by Miss Islington (bot) in branch '3.7':
bpo-27657: Fix urlparse() with numeric paths (GH-661)
https://github.com/python/cpython/commit/82b5f6b16e051f8a2ac6e87ba86b082fa1c4a77f
msg354903 - (view) Author: Senthil Kumaran (orsenthil) * (Python committer) Date: 2019-10-18 15:23
New changeset 0f3187c1ce3b3ace60f6c1691dfa3d4e744f0384 by Senthil Kumaran in branch '3.8':
[3.8] bpo-27657: Fix urlparse() with numeric paths (GH-661) (#16839)
https://github.com/python/cpython/commit/0f3187c1ce3b3ace60f6c1691dfa3d4e744f0384
msg355320 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-10-24 10:31
This issue got fixes, so I close it.
History
Date User Action Args
2019-10-24 10:31:31vstinnersetstatus: open -> closed

nosy: + vstinner
messages: + msg355320

resolution: fixed
stage: patch review -> resolved
2019-10-18 15:23:21orsenthilsetmessages: + msg354903
2019-10-18 13:51:49orsenthilsetpull_requests: + pull_request16388
2019-10-18 13:24:31miss-islingtonsetnosy: + miss-islington
messages: + msg354894
2019-10-18 13:07:37miss-islingtonsetkeywords: + patch
pull_requests: + pull_request16382
2019-10-18 13:07:36orsenthilsetmessages: + msg354889
2018-03-15 18:57:46cheryl.sabellasetstage: patch review
versions: + Python 3.7, Python 3.8, - Python 3.5, Python 3.6
2017-03-14 01:34:28Tim.Grahamsetnosy: + Tim.Graham
messages: + msg289557
2017-03-13 17:39:32Tim.Grahamsetpull_requests: + pull_request543
2016-08-02 14:07:03r.david.murraysetmessages: + msg271824
2016-08-02 13:55:05martin.pantersetmessages: + msg271823
2016-07-31 14:04:36r.david.murraysetmessages: + msg271739
2016-07-31 14:02:56r.david.murraysetmessages: + msg271738
2016-07-31 02:37:12martin.pantersetnosy: + martin.panter, orsenthil

messages: + msg271719
versions: + Python 2.7, Python 3.5, Python 3.6
2016-07-30 23:52:19martin.panterlinkissue22891 dependencies
2016-07-30 21:12:06r.david.murraysetnosy: + r.david.murray
messages: + msg271703
2016-07-30 19:57:17Björn.Lindqvistcreate