urlparse fails if the path is numeric #71844

BjrnLindqvist · 2016-07-30T19:57:18Z

BPO	27657
Nosy	@orsenthil, @benjaminp, @ned-deily, @leifwalsh, @bitdancer, @ambv, @mgorny, @vadmium, @timgraham, @miss-islington, @Roguelazer
PRs	bpo-27657: Fix urlparse() with numeric paths #661 [3.7] bpo-27657: Fix urlparse() with numeric paths (GH-661) #16837 [3.8] bpo-27657: Fix urlparse() with numeric paths (GH-661) #16839 [3.8] bpo-27657: Revert "Fix urlparse() with numeric paths (GH-661) #18525 [3.7] bpo-27657: Revert - Fix urlparse() with numeric paths (GH-661)" #18526

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = 'https://github.com/orsenthil'
closed_at = <Date 2020-06-08.17:16:41.595>
created_at = <Date 2016-07-30.19:57:17.789>
labels = ['type-bug', 'library', '3.9']
title = 'urlparse fails if the path is numeric'
updated_at = <Date 2021-07-02.00:01:49.848>
user = 'https://bugs.python.org/BjrnLindqvist'

bugs.python.org fields:

activity = <Date 2021-07-02.00:01:49.848>
actor = 'leif.walsh'
assignee = 'orsenthil'
closed = True
closed_date = <Date 2020-06-08.17:16:41.595>
closer = 'orsenthil'
components = ['Library (Lib)']
creation = <Date 2016-07-30.19:57:17.789>
creator = 'Bj\xc3\xb6rn.Lindqvist'
dependencies = []
files = []
hgrepos = []
issue_num = 27657
keywords = ['3.7regression']
message_count = 31.0
messages = ['271702', '271703', '271719', '271738', '271739', '271823', '271824', '289557', '354889', '354894', '354903', '355320', '359273', '359277', '360196', '361815', '362103', '362107', '362632', '362675', '362742', '370048', '370120', '371015', '371024', '371040', '371042', '371043', '371044', '371046', '396833']
nosy_count = 13.0
nosy_names = ['orsenthil', 'benjamin.peterson', 'ned.deily', 'leif.walsh', 'r.david.murray', 'lukasz.langa', 'mgorny', 'martin.panter', 'Bj\xc3\xb6rn.Lindqvist', 'Tim.Graham', 'miss-islington', 'roguelazer', 'Chris Dent']
pr_nums = ['661', '16837', '16839', '18525', '18526']
priority = 'normal'
resolution = 'fixed'
stage = 'resolved'
status = 'closed'
superseder = None
type = 'behavior'
url = 'https://bugs.python.org/issue27657'
versions = ['Python 3.9']

BjrnLindqvist · 2016-07-30T19:57:17Z

This affects both Python 2 and 3. This is as expected:

>>> urlparse('abc:123.html')
ParseResult(scheme='abc', netloc='', path='123.html', params='', query='', fragment='')
>>> urlparse('123.html:abc')
ParseResult(scheme='123.html', netloc='', path='abc', params='', query='', fragment='')
>>> urlparse('abc:123/')
ParseResult(scheme='abc', netloc='', path='123/', params='', query='', fragment='')

This is NOT:

>>> urlparse('abc:123')
ParseResult(scheme='', netloc='', path='abc:123', params='', query='', fragment='')

Expected is path='123' and scheme='abc'. At least according to my reading of the rfc (https://tools.ietf.org/html/rfc1808.html) that is what should happen.

bitdancer · 2016-07-30T21:12:07Z

See bpo-14072. It may be time to look at this again, but we may still be constrained by backward compatibility.

vadmium · 2016-07-31T02:37:10Z

The main backward compatibility consideration would be bpo-754016, but don’t agree with the changes made, and would support reverting them. The original bug reporter wanted urlparse("1.2.3.4:80", "http") to be treated as the URL http://1.2.3.4:80, but the IP address was being parsed as a scheme, so the default “http” scheme was ignored.

The original fix (r83701) affected any URL that had a digit 0–9 immediately after the “scheme:” prefix. In such URLs, the scheme component was no longer parsed. A test case for “path:80” was added, and a demonstration of not parsing any scheme from www.cwi.nl:80/%7Eguido/Python.html was added in the documentation.

Later, the logic was altered to test if the URL looked like an integer (revision 495d12196487, bpo-11467). This restored proper parsing of clsid:85bbd92o-42a0-1o69-a2e4-08002b30309d and mailto:1337@example.org, although another URL given, javascript:123, remains misparsed. The documentation was subsequently adjusted in bpo-16932 to just demonstrate www.cwi.nl/%7Eguido/Python.html being parsed as a path.

The logic was watered down to its current form by revision 9f6b7576c08c, bpo-14072. Now it tests for a non-digit anywhere after the scheme, so that tel:+31641044153 is again parsed properly. But it was pointed out that tel:1234 remains misparsed.

What’s the next step in the watering-down process? All the attempts so far break valid URLs in favour of special-casing inputs that are not valid URLs.

bitdancer · 2016-07-31T14:02:56Z

I hate to say it, but this may require a python-dev discussion. We probably ought to be parsing valid urls correctly as our top priority, but if that breaks our parsing of "reasonable" non-valid URLs (that existing code is depending on), it's going to be a backward compatibility problem.

bitdancer · 2016-07-31T14:04:36Z

On second thought, what are the chances that special casing something that looks like an IP address in the scheme position would maintain backward compatibility?

vadmium · 2016-08-02T13:55:05Z

Depends on how you define “looks like an IP address”. Does the www.cwi.nl:80 case look like an IP address? What about “path:80” or “localhost:80”? If there is any code relying on the bug, it may just as easily involve host name as a numeric IP address.

bitdancer · 2016-08-02T14:07:03Z

Ah, good point, I misread the scope of the problem.

timgraham · 2017-03-14T01:34:28Z

Based on discussion in bpo-16932, I agree that reverting the parsing decisions from bpo-754016 (as Martin suggested in msg271719) seems appropriate. I created a pull request that does that.

orsenthil · 2019-10-18T13:07:36Z

New changeset 5a88d50 by Senthil Kumaran (Tim Graham) in branch 'master':
bpo-27657: Fix urlparse() with numeric paths (#661)
5a88d50

miss-islington · 2019-10-18T13:24:32Z

New changeset 82b5f6b by Miss Islington (bot) in branch '3.7':
bpo-27657: Fix urlparse() with numeric paths (GH-661)
82b5f6b

orsenthil · 2019-10-18T15:23:21Z

New changeset 0f3187c by Senthil Kumaran in branch '3.8':
[3.8] bpo-27657: Fix urlparse() with numeric paths (GH-661) (bpo-16839)
0f3187c

vstinner · 2019-10-24T10:31:31Z

This issue got fixes, so I close it.

Roguelazer · 2020-01-04T02:37:16Z

This is a surprising change to put in a minor release. This change totally changes the semantics of parsing scheme-less URLs with ports in them and ended up breaking a significant amount of my software. It turns out that urls like example.com:80 are more common than one might hope, and a lot of software has always assumed that example.com:80 would get parsed as the netloc and the software can guess the scheme based on the port...

orsenthil · 2020-01-04T05:26:14Z

@james - Originally the issue was considered a revert and the versions were set for the merge, but I certainly recognize the problem when parsing can fail for simple URLs like localhost:8000 which is very common.

Another developer had raised the concerns with the change in this PR: #16839 (comment)

I am reopening this issue, and re-read the arguments again to understand and propose the next steps.

ChrisDent · 2020-01-17T15:21:33Z

Just to add to the list of places this is causing a regression. This has broken the target host determination routines in gabbi: cdent/gabbi#277

While the original fix may have been strictly correct in some ways, it results in a terrible UX, and as several others have noted violated backwards compatibility.

orsenthil · 2020-02-11T13:20:50Z

Hi Lukaz / Ned:

I will like to revert the backports done in 3.8 and 3.7.

Preferably in 3.8.2 and 3.7.7, so that this undesirable behavior exists only for a single release.

I have set this is a release blocker to catch your attention.

orsenthil · 2020-02-16T21:07:29Z

New changeset 505b601 by Senthil Kumaran in branch '3.7':
Revert "bpo-27657: Fix urlparse() with numeric paths (GH-661)" (bpo-18526)
505b601

orsenthil · 2020-02-16T21:47:25Z

New changeset ea316fd by Senthil Kumaran in branch '3.8':
Revert "[3.8] bpo-27657: Fix urlparse() with numeric paths (GH-16839)" (GH-18525)
ea316fd

ambv · 2020-02-25T12:03:05Z

Can this be closed? Downgrading priority since the fix was released as part of 3.8.2rc2 and 3.8.2 final.

orsenthil · 2020-02-26T01:36:18Z

Hi Łukasz, There was a concern raised by python core-devs about behavior in 3.9. I plan to address that point raised in this issue and close this ticket.

ned-deily · 2020-02-26T21:26:29Z

FYI, for those following along, now that 3.8.2 has been released with the revert of the regression, we are planning to accelerate the schedule for 3.7.7, the next 3.7.x bugfix release, in part to get the revert out to 3.7.x users sooner (https://discuss.python.org/t/3-7-7-schedule-accelerated-cutoff-now-2020-03-02/3511).

mgorny · 2020-05-27T05:16:28Z

Do I understand correctly that the new behavior is intentional in 3.9, or is that still being discussed?

mgorny · 2020-05-27T19:14:43Z

I'm sorry but does this change mean that it's not final or...?

My main concern is whether we should be adjusting our packages to the new behavior in py3.9, or wait for further changes.

ambv · 2020-06-08T16:57:08Z

Michał, this change has been reverted in 3.7 and 3.8. Senthil says this needs further work in 3.9. Senthil, please prioritize this. We're already releasing 3.9.0b2.

orsenthil · 2020-06-08T17:16:41Z

Hi All,

On the previous message.

I plan to address that point raised in this issue and close this ticket.

My intention was to close the bug.

The Issue was originally a regression in 3.8, and 3.7, And this been be resolved.

The change introduced did alter the behavior, but it was actually the correct one as I wrote

"For 3.9. - I am ready to defend the patch even at the cost of the breaking of the parsing of undefined behavior. We should keep it. The patch simplifies a lot of corner cases and fixes the reported bugs. We don't guarantee backward compatibility between major versions, so I assume users will be careful when relying upon this undefined behavior and will take corrective action on their side before upgrading to 3.9.

We want patch releases to be backward compatible. That was the
user-complaint."

I agree with Ned Deily, that there are plenty of corner cases with urlparse APIs (Ref: https://bugs.python.org/issue36338#msg355322)

That making any change is difficult (especially if not caught by regression suite).

*And if we simplify the code, it will be welcome, and will help us continue with next changes* - This was the original motivation of this code acceptance (Here 5a88d50 and https://bugs.python.org/issue27657#msg289557)

I was not very explicit on stance proposal previously, and assumed if anything else can be done. But No. I think, this behavior we want in Python3.9 (as discussed in the initial messages of this ticket).

I apologize, Michał that I missed responding to your question quickly.
I hope this clarifies the stance.

mgorny · 2020-06-08T19:43:09Z

I wonder if it would be feasible to support new behavior in earlier versions of Python via __future__. I suppose that could help software avoid having Python version-dependent behavior in the long run.

orsenthil · 2020-06-08T20:14:34Z

Michał, for this particular issue, to rely on the expected and consistent parsing behavior, it might easier to add "scheme" to URLs in the client code.

That will be less confusing IMO. Not sure if __future__ is a good idea. Personally, I am -1 at this point.

ChrisDent · 2020-06-08T20:22:08Z

I just wanted to reiterate what I said at https://bugs.python.org/issue27657#msg360196

The supposed fix provides terribly UX and violates what I think for many people is the path of least surprise.

orsenthil · 2020-06-08T21:10:11Z

Chris, my understanding of the primary objection in your previous post was with a violation of backward compatibility. It was resolved by reverting.

The 3.9+ changes are useful ones, particularly when well-defined behaviors are listed for URLs with the scheme.

---

Let's please assume that the behavior will be consistent for

a) Scheme URLs for all supported versions of Python for numeric paths.

b) Consistent with other widely observed behavior (or supported by RFC) in 3.9+ onwards for this particular case.

If there is any violation of b), please point out, I am certainly open to treat this as a bug (again) and revisit the 3+ issues across which this discussion has spanned.

ChrisDent · 2020-06-08T21:23:18Z

I don't want to belabour the point. If there's general agreement that the new behaviour is the best result that's fine, especially if it is well announced. I agree that it has been inconsistent and weird for a long time. I think my objection simply comes down to "we all got used to it being weird over the years, and now you want to change it!?"

Followed by table flips and all the rest. But it really isn't that much of a big deal, so I won't persist. Long term having it be its most correct is probably the best thing.

leifwalsh · 2021-07-02T00:01:50Z

I don't mean to reopen a can of worms, but this affects requests, which I reported to them here: psf/requests#5855

BjrnLindqvist mannequin added stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error labels Jul 30, 2016

csabella added 3.7 (EOL) end of life 3.8 only security fixes labels Mar 15, 2018

vstinner closed this as completed Oct 24, 2019

orsenthil reopened this Jan 4, 2020

orsenthil self-assigned this Jan 4, 2020

ned-deily added the deferred-blocker label Jan 4, 2020

orsenthil added release-blocker and removed deferred-blocker labels Feb 11, 2020

ambv removed the release-blocker label Feb 25, 2020

ned-deily added 3.9 only security fixes 3.10 only security fixes and removed 3.7 (EOL) end of life 3.8 only security fixes labels May 27, 2020

orsenthil removed the 3.10 only security fixes label Jun 8, 2020

orsenthil closed this as completed Jun 8, 2020

ezio-melotti transferred this issue from another repository Apr 10, 2022

bityob mentioned this issue Jun 19, 2022

'127.0.0.1:8329' parsed wrong in Python 3.9+ gruns/furl#155

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

urlparse fails if the path is numeric #71844

urlparse fails if the path is numeric #71844

BjrnLindqvist mannequin commented Jul 30, 2016

BjrnLindqvist mannequin commented Jul 30, 2016

bitdancer commented Jul 30, 2016

vadmium commented Jul 31, 2016

bitdancer commented Jul 31, 2016

bitdancer commented Jul 31, 2016

vadmium commented Aug 2, 2016

bitdancer commented Aug 2, 2016

timgraham mannequin commented Mar 14, 2017

orsenthil commented Oct 18, 2019

miss-islington commented Oct 18, 2019

orsenthil commented Oct 18, 2019

vstinner commented Oct 24, 2019

Roguelazer mannequin commented Jan 4, 2020

orsenthil commented Jan 4, 2020

ChrisDent mannequin commented Jan 17, 2020

orsenthil commented Feb 11, 2020

orsenthil commented Feb 16, 2020

orsenthil commented Feb 16, 2020

ambv commented Feb 25, 2020

orsenthil commented Feb 26, 2020

ned-deily commented Feb 26, 2020

mgorny mannequin commented May 27, 2020

mgorny mannequin commented May 27, 2020

ambv commented Jun 8, 2020

orsenthil commented Jun 8, 2020

mgorny mannequin commented Jun 8, 2020

orsenthil commented Jun 8, 2020

ChrisDent mannequin commented Jun 8, 2020

orsenthil commented Jun 8, 2020

ChrisDent mannequin commented Jun 8, 2020

leifwalsh mannequin commented Jul 2, 2021

urlparse fails if the path is numeric #71844

urlparse fails if the path is numeric #71844

Comments

BjrnLindqvist mannequin commented Jul 30, 2016

BjrnLindqvist mannequin commented Jul 30, 2016

bitdancer commented Jul 30, 2016

vadmium commented Jul 31, 2016

bitdancer commented Jul 31, 2016

bitdancer commented Jul 31, 2016

vadmium commented Aug 2, 2016

bitdancer commented Aug 2, 2016

timgraham mannequin commented Mar 14, 2017

orsenthil commented Oct 18, 2019

miss-islington commented Oct 18, 2019

orsenthil commented Oct 18, 2019

vstinner commented Oct 24, 2019

Roguelazer mannequin commented Jan 4, 2020

orsenthil commented Jan 4, 2020

ChrisDent mannequin commented Jan 17, 2020

orsenthil commented Feb 11, 2020

orsenthil commented Feb 16, 2020

orsenthil commented Feb 16, 2020

ambv commented Feb 25, 2020

orsenthil commented Feb 26, 2020

ned-deily commented Feb 26, 2020

mgorny mannequin commented May 27, 2020

mgorny mannequin commented May 27, 2020

ambv commented Jun 8, 2020

orsenthil commented Jun 8, 2020

mgorny mannequin commented Jun 8, 2020

orsenthil commented Jun 8, 2020

ChrisDent mannequin commented Jun 8, 2020

orsenthil commented Jun 8, 2020

ChrisDent mannequin commented Jun 8, 2020

leifwalsh mannequin commented Jul 2, 2021