classification
Title: urlparse fails if the path is numeric
Type: behavior Stage: needs patch
Components: Library (Lib) Versions: Python 3.10, Python 3.9
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: orsenthil Nosy List: Björn.Lindqvist, Chris Dent, Tim.Graham, benjamin.peterson, lukasz.langa, martin.panter, mgorny, miss-islington, ned.deily, orsenthil, r.david.murray, roguelazer
Priority: critical Keywords: 3.7regression, 3.8regression, patch

Created on 2016-07-30 19:57 by Björn.Lindqvist, last changed 2020-05-27 19:14 by mgorny.

Pull Requests
URL Status Linked Edit
PR 661 merged Tim.Graham, 2017-03-13 17:39
PR 16837 merged miss-islington, 2019-10-18 13:07
PR 16839 merged orsenthil, 2019-10-18 13:51
PR 18525 merged orsenthil, 2020-02-16 18:17
PR 18526 merged orsenthil, 2020-02-16 18:19
Messages (23)
msg271702 - (view) Author: Björn Lindqvist (Björn.Lindqvist) Date: 2016-07-30 19:57
This affects both Python 2 and 3. This is as expected:

>>> urlparse('abc:123.html')
ParseResult(scheme='abc', netloc='', path='123.html', params='', query='', fragment='')
>>> urlparse('123.html:abc')
ParseResult(scheme='123.html', netloc='', path='abc', params='', query='', fragment='')
>>> urlparse('abc:123/')
ParseResult(scheme='abc', netloc='', path='123/', params='', query='', fragment='')

This is NOT:

>>> urlparse('abc:123')
ParseResult(scheme='', netloc='', path='abc:123', params='', query='', fragment='')

Expected is path='123' and scheme='abc'. At least according to my reading of the rfc (https://tools.ietf.org/html/rfc1808.html) that is what should happen.
msg271703 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2016-07-30 21:12
See issue 14072.  It may be time to look at this again, but we may still be constrained by backward compatibility.
msg271719 - (view) Author: Martin Panter (martin.panter) * (Python committer) Date: 2016-07-31 02:37
The main backward compatibility consideration would be Issue 754016, but don’t agree with the changes made, and would support reverting them. The original bug reporter wanted urlparse("1.2.3.4:80", "http") to be treated as the URL http://1.2.3.4:80, but the IP address was being parsed as a scheme, so the default “http” scheme was ignored.

The original fix (r83701) affected any URL that had a digit 0–9 immediately after the “scheme:” prefix. In such URLs, the scheme component was no longer parsed. A test case for “path:80” was added, and a demonstration of not parsing any scheme from www.cwi.nl:80/%7Eguido/Python.html was added in the documentation.

Later, the logic was altered to test if the URL looked like an integer (revision 495d12196487, Issue 11467). This restored proper parsing of clsid:85bbd92o-42a0-1o69-a2e4-08002b30309d and mailto:1337@example.org, although another URL given, javascript:123, remains misparsed. The documentation was subsequently adjusted in Issue 16932 to just demonstrate www.cwi.nl/%7Eguido/Python.html being parsed as a path.

The logic was watered down to its current form by revision 9f6b7576c08c, Issue 14072. Now it tests for a non-digit anywhere after the scheme, so that tel:+31641044153 is again parsed properly. But it was pointed out that tel:1234 remains misparsed.

What’s the next step in the watering-down process? All the attempts so far break valid URLs in favour of special-casing inputs that are not valid URLs.
msg271738 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2016-07-31 14:02
I hate to say it, but this may require a python-dev discussion.  We probably ought to be parsing valid urls correctly as our top priority, but if that breaks our parsing of "reasonable" non-valid URLs (that existing code is depending on), it's going to be a backward compatibility problem.
msg271739 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2016-07-31 14:04
On second thought, what are the chances that special casing something that looks like an IP address in the scheme position would maintain backward compatibility?
msg271823 - (view) Author: Martin Panter (martin.panter) * (Python committer) Date: 2016-08-02 13:55
Depends on how you define “looks like an IP address”. Does the www.cwi.nl:80 case look like an IP address? What about “path:80” or “localhost:80”? If there is any code relying on the bug, it may just as easily involve host name as a numeric IP address.
msg271824 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2016-08-02 14:07
Ah, good point, I misread the scope of the problem.
msg289557 - (view) Author: Tim Graham (Tim.Graham) * Date: 2017-03-14 01:34
Based on discussion in issue 16932, I agree that reverting the parsing decisions from issue 754016 (as Martin suggested in msg271719) seems appropriate. I created a pull request that does that.
msg354889 - (view) Author: Senthil Kumaran (orsenthil) * (Python committer) Date: 2019-10-18 13:07
New changeset 5a88d50ff013a64fbdb25b877c87644a9034c969 by Senthil Kumaran (Tim Graham) in branch 'master':
bpo-27657: Fix urlparse() with numeric paths (#661)
https://github.com/python/cpython/commit/5a88d50ff013a64fbdb25b877c87644a9034c969
msg354894 - (view) Author: miss-islington (miss-islington) Date: 2019-10-18 13:24
New changeset 82b5f6b16e051f8a2ac6e87ba86b082fa1c4a77f by Miss Islington (bot) in branch '3.7':
bpo-27657: Fix urlparse() with numeric paths (GH-661)
https://github.com/python/cpython/commit/82b5f6b16e051f8a2ac6e87ba86b082fa1c4a77f
msg354903 - (view) Author: Senthil Kumaran (orsenthil) * (Python committer) Date: 2019-10-18 15:23
New changeset 0f3187c1ce3b3ace60f6c1691dfa3d4e744f0384 by Senthil Kumaran in branch '3.8':
[3.8] bpo-27657: Fix urlparse() with numeric paths (GH-661) (#16839)
https://github.com/python/cpython/commit/0f3187c1ce3b3ace60f6c1691dfa3d4e744f0384
msg355320 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-10-24 10:31
This issue got fixes, so I close it.
msg359273 - (view) Author: James Brown (roguelazer) Date: 2020-01-04 02:37
This is a surprising change to put in a minor release. This change totally changes the semantics of parsing scheme-less URLs with ports in them and ended up breaking a significant amount of my software. It turns out that urls like `example.com:80` are more common than one might hope, and a lot of software has always assumed that `example.com:80` would get parsed as the netloc and the software can guess the scheme based on the port...
msg359277 - (view) Author: Senthil Kumaran (orsenthil) * (Python committer) Date: 2020-01-04 05:26
@James - Originally the issue was considered a revert and the versions were set for the merge, but I certainly recognize the problem when parsing can fail for simple URLs like `localhost:8000` which is very common.

Another developer had raised the concerns with the change in this PR: https://github.com/python/cpython/pull/16839#issuecomment-570758153 

I am reopening this issue, and re-read the arguments again to understand and propose the next steps.
msg360196 - (view) Author: Chris Dent (Chris Dent) Date: 2020-01-17 15:21
Just to add to the list of places this is causing a regression. This has broken the target host determination routines in gabbi: https://github.com/cdent/gabbi/issues/277

While the original fix may have been strictly correct in some ways, it results in a terrible UX, and as several others have noted violated backwards compatibility.
msg361815 - (view) Author: Senthil Kumaran (orsenthil) * (Python committer) Date: 2020-02-11 13:20
Hi Lukaz / Ned:

I will like to revert the backports done in 3.8 and 3.7.

Preferably in 3.8.2 and 3.7.7, so that this undesirable behavior exists only for a single release. 

I have set this is a release blocker to catch your attention.
msg362103 - (view) Author: Senthil Kumaran (orsenthil) * (Python committer) Date: 2020-02-16 21:07
New changeset 505b6015a1579fc50d9697e4a285ecc64976397a by Senthil Kumaran in branch '3.7':
Revert "bpo-27657: Fix urlparse() with numeric paths (GH-661)" (#18526)
https://github.com/python/cpython/commit/505b6015a1579fc50d9697e4a285ecc64976397a
msg362107 - (view) Author: Senthil Kumaran (orsenthil) * (Python committer) Date: 2020-02-16 21:47
New changeset ea316fd21527dec53e704a5b04833ac462ce3863 by Senthil Kumaran in branch '3.8':
Revert "[3.8] bpo-27657: Fix urlparse() with numeric paths (GH-16839)" (GH-18525)
https://github.com/python/cpython/commit/ea316fd21527dec53e704a5b04833ac462ce3863
msg362632 - (view) Author: Łukasz Langa (lukasz.langa) * (Python committer) Date: 2020-02-25 12:03
Can this be closed? Downgrading priority since the fix was released as part of 3.8.2rc2 and 3.8.2 final.
msg362675 - (view) Author: Senthil Kumaran (orsenthil) * (Python committer) Date: 2020-02-26 01:36
Hi  Łukasz, There was a concern raised by python core-devs about behavior in 3.9. I plan to address that point raised in this issue and close this ticket.
msg362742 - (view) Author: Ned Deily (ned.deily) * (Python committer) Date: 2020-02-26 21:26
FYI, for those following along, now that 3.8.2 has been released with the revert of the regression, we are planning to accelerate the schedule for 3.7.7, the next 3.7.x bugfix release, in part to get the revert out to 3.7.x users sooner (https://discuss.python.org/t/3-7-7-schedule-accelerated-cutoff-now-2020-03-02/3511).
msg370048 - (view) Author: Michał Górny (mgorny) * Date: 2020-05-27 05:16
Do I understand correctly that the new behavior is intentional in 3.9, or is that still being discussed?
msg370120 - (view) Author: Michał Górny (mgorny) * Date: 2020-05-27 19:14
I'm sorry but does this change mean that it's not final or...?

My main concern is whether we should be adjusting our packages to the new behavior in py3.9, or wait for further changes.
History
Date User Action Args
2020-05-27 19:14:43mgornysetmessages: + msg370120
2020-05-27 09:07:16ned.deilysetstage: patch review -> needs patch
versions: + Python 3.9, Python 3.10, - Python 2.7, Python 3.7, Python 3.8
2020-05-27 05:16:28mgornysetnosy: + mgorny
messages: + msg370048
2020-02-26 21:26:29ned.deilysetmessages: + msg362742
2020-02-26 01:36:17orsenthilsetmessages: + msg362675
2020-02-25 12:03:05lukasz.langasetpriority: release blocker -> critical

messages: + msg362632
2020-02-16 21:47:25orsenthilsetmessages: + msg362107
2020-02-16 21:07:29orsenthilsetmessages: + msg362103
2020-02-16 18:19:45orsenthilsetpull_requests: + pull_request17903
2020-02-16 18:17:09orsenthilsetkeywords: + patch
stage: commit review -> patch review
pull_requests: + pull_request17902
2020-02-11 13:20:49orsenthilsetpriority: deferred blocker -> release blocker
nosy: + lukasz.langa, benjamin.peterson, ned.deily
messages: + msg361815

2020-01-17 16:14:32vstinnersetnosy: - vstinner
2020-01-17 15:21:32Chris Dentsetnosy: + Chris Dent
messages: + msg360196
2020-01-04 17:49:08ned.deilysetkeywords: + 3.7regression, 3.8regression, - patch
priority: normal -> deferred blocker
2020-01-04 05:26:14orsenthilsetstatus: closed -> open
messages: + msg359277

assignee: orsenthil
resolution: fixed ->
stage: resolved -> commit review
2020-01-04 02:37:16roguelazersetnosy: + roguelazer
messages: + msg359273
2019-10-24 10:31:31vstinnersetstatus: open -> closed

nosy: + vstinner
messages: + msg355320

resolution: fixed
stage: patch review -> resolved
2019-10-18 15:23:21orsenthilsetmessages: + msg354903
2019-10-18 13:51:49orsenthilsetpull_requests: + pull_request16388
2019-10-18 13:24:31miss-islingtonsetnosy: + miss-islington
messages: + msg354894
2019-10-18 13:07:37miss-islingtonsetkeywords: + patch
pull_requests: + pull_request16382
2019-10-18 13:07:36orsenthilsetmessages: + msg354889
2018-03-15 18:57:46cheryl.sabellasetstage: patch review
versions: + Python 3.7, Python 3.8, - Python 3.5, Python 3.6
2017-03-14 01:34:28Tim.Grahamsetnosy: + Tim.Graham
messages: + msg289557
2017-03-13 17:39:32Tim.Grahamsetpull_requests: + pull_request543
2016-08-02 14:07:03r.david.murraysetmessages: + msg271824
2016-08-02 13:55:05martin.pantersetmessages: + msg271823
2016-07-31 14:04:36r.david.murraysetmessages: + msg271739
2016-07-31 14:02:56r.david.murraysetmessages: + msg271738
2016-07-31 02:37:12martin.pantersetnosy: + martin.panter, orsenthil

messages: + msg271719
versions: + Python 2.7, Python 3.5, Python 3.6
2016-07-30 23:52:19martin.panterlinkissue22891 dependencies
2016-07-30 21:12:06r.david.murraysetnosy: + r.david.murray
messages: + msg271703
2016-07-30 19:57:17Björn.Lindqvistcreate