Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

urllib.parse.urlparse and urlsplit not raising ValueError for bad port #77215

Closed
jonathan-lp mannequin opened this issue Mar 9, 2018 · 19 comments
Closed

urllib.parse.urlparse and urlsplit not raising ValueError for bad port #77215

jonathan-lp mannequin opened this issue Mar 9, 2018 · 19 comments
Labels
3.8 only security fixes stdlib Python modules in the Lib dir type-feature A feature request or enhancement

Comments

@jonathan-lp
Copy link
Mannequin

jonathan-lp mannequin commented Mar 9, 2018

BPO 33034
Nosy @ericvsmith, @berkerpeksag, @bbayles, @agnosticdev
PRs
  • [3.6.3] bpo-33034: Added explicit exception message with cast failed.  #6077
  • bpo-33034: Added explicit exception message when cast fails #6078
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2018-03-20.13:00:01.478>
    created_at = <Date 2018-03-09.08:24:01.554>
    labels = ['3.8', 'type-feature', 'library']
    title = 'urllib.parse.urlparse and urlsplit not raising ValueError for bad port'
    updated_at = <Date 2018-04-03.11:02:55.207>
    user = 'https://bugs.python.org/jonathan-lp'

    bugs.python.org fields:

    activity = <Date 2018-04-03.11:02:55.207>
    actor = 'agnosticdev'
    assignee = 'none'
    closed = True
    closed_date = <Date 2018-03-20.13:00:01.478>
    closer = 'berker.peksag'
    components = ['Library (Lib)']
    creation = <Date 2018-03-09.08:24:01.554>
    creator = 'jonathan-lp'
    dependencies = []
    files = []
    hgrepos = []
    issue_num = 33034
    keywords = ['patch']
    message_count = 19.0
    messages = ['313475', '313477', '313600', '313928', '313929', '313930', '313931', '313944', '313947', '313996', '313998', '313999', '314002', '314005', '314133', '314139', '314140', '314725', '314877']
    nosy_count = 5.0
    nosy_names = ['eric.smith', 'berker.peksag', 'bbayles', 'agnosticdev', 'jonathan-lp']
    pr_nums = ['6077', '6078']
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'enhancement'
    url = 'https://bugs.python.org/issue33034'
    versions = ['Python 3.8']

    @jonathan-lp
    Copy link
    Mannequin Author

    jonathan-lp mannequin commented Mar 9, 2018

    (Confirmed in 2.7.14, 3.5.4, and 3.6.3)

    I have this really bad URL from a crawl:
    "http://Server=sde; Service=sde:oracle$sde:oracle11g:geopp; User=bodem; Version=SDE.DEFAULT"

    if I try and parse it with wither urlparse or urlsplit it works - no errors. But when I try and get the port, I get a ValueError.

    from urllib.parse import urlparse
    r = urlparse('http://Server=sde; Service=sde:oracle$sde:oracle11g:geopp; User=bodem; Version=SDE.DEFAULT')
    ParseResult(scheme='http', netloc='Server=sde; Service=sde:oracle$sde:oracle11g:geopp; User=bodem; Version=SDE.DEFAULT', path='', params='', query='', fragment='')

    Ok, great, now to use the result:
    > print(r.port)
    Traceback (most recent call last):
      File "<input>", line 1, in <module>
      File "E:\Software\_libs\Python36\lib\urllib\parse.py", line 167, in port
        port = int(port, 10)
    ValueError: invalid literal for int() with base 10: 'oracle$sde:oracle11g:geopp; User=bodem; Version=SDE.DEFAULT'

    I'm not a Python Guru, but to me at least it's inconsistent with how every other Python Function works. In all other builtin functions I've used it would fail with the exception when I ran the function, not when I try and get the results. This caused a good few minutes of head-scratching while I tried to debug why my try/except wasn't catching it.

    This inconsistency makes the results more difficult to use. Now a user needs to wrap all calls to the *results* in a try/except, or write an entire function just to "read" the results into a won't-except tuple/dict. Seems sub-optimal.

    (May relate to: https://bugs.python.org/issue20059)

    @jonathan-lp
    Copy link
    Mannequin Author

    jonathan-lp mannequin commented Mar 9, 2018

    Arguably the same logic applies to out-of-range ports:

    > a = urlsplit('http://example.com:4444444')
    > a.port
    Traceback (most recent call last):
      File "<input>", line 1, in <module>
      File "E:\Software\_libs\Python36\lib\urllib\parse.py", line 169, in port
        raise ValueError("Port out of range 0-65535")
    ValueError: Port out of range 0-65535

    @agnosticdev
    Copy link
    Mannequin

    agnosticdev mannequin commented Mar 11, 2018

    I agree. I think an explicit exception message would be appropriate when the cast fails to cast from string to int in int(post, 10).

    Putting in a PR to fix this now.
    #6078

    @ericvsmith
    Copy link
    Member

    Wouldn't we be better off to catch this error at parse time, instead of just improve the error message when .port is called?

    I'm not sure why port is a property and not just computed. Maybe the least intrusive way of doing this would be to just evaluate self.port after the parsing is completed?

    @ericvsmith
    Copy link
    Member

    Although having said that, we probably can't make that change for 2.7. And maybe it's even too big a change for 3.8.

    @agnosticdev
    Copy link
    Mannequin

    agnosticdev mannequin commented Mar 16, 2018

    "Wouldn't we be better off to catch this error at parse time, instead of just improve the error message when .port is called?"

    I think there could be a case to be made about catching and dealing with this error in urlparse() / urlsplit() instead of displaying an error when port property is used. I think that approaching it this way would cut right to the problem and alleviate carrying around a potentially bad port value. However, if the port error was caught during parsing but the url, scheme, etc.. values were still valid, are we taking away something from the user by raising the error too soon? Just a thought.

    @ericvsmith
    Copy link
    Member

    Yes, you're right, it can't be changed. And .port is documented as raising ValueError, so it's all working correctly. I think improving the ValueError message, as the PR does, is as good as we can do.

    @jonathan-lp
    Copy link
    Mannequin Author

    jonathan-lp mannequin commented Mar 16, 2018

    Interesting conversation

    As I see it, there are two ways to solve this, both discussed above:
    A) Python can compute and ValueError at parse-time
    B) Python can ValueError at property-call time. (Current method)

    But both have Advantages/Disadvantages.
    A - Pros) - The function is more consistent with all the other Python builtins (well, the one's I've dealt with).
    A - Pros) - Not carrying around a "bad" port.
    A - Cons) - As Matt suggests, we could be "taking something from the user" because they may want the other values. (although in that case, the current semi-related behaviour: "Unmatched square brackets in the netloc attribute will raise a ValueError" is also potentially taking something from the user).

    B - Pros) - User gets the other values even if they don't get the port.
    B - Cons) - User needs to have more Try/Excepts in the code (whereever you may access the property), or to write their own wrapper function.

    Given the above, of the those options I still think option (A) is better. The other values may have a reduced worth if the port is invalid (depending on the user's goal).

    May I suggest a third option:
    C) A flag for urlsplit/urlparse to indicate we don't want to do port validation, and to just return whatever it thinks is there. (example.com:3293849038 would return 3293849038. example.com:gibberish would return "gibberish" etc).

    This way the user can choose what behaviour they want with the bad port and test the value of the port themselves if they care (something I was going to do anyway before I discovered it was included in the builtin). Some of the url quoting functions have a similar flag ("errors") for handling bad data transparently or not.

    @agnosticdev
    Copy link
    Mannequin

    agnosticdev mannequin commented Mar 16, 2018

    Jonathan, thank you very much for your thoughts I appreciate the pros and cons of each option.

    In regards to your option C, if we provided a flag to optionally raise the error in urlsplit and urlparse were you thinking the default flag to be set to True?

    For example:

    def urlparse(url, scheme='', allow_fragments=True, port_validation=True):
    
    
    def urlsplit(url, scheme='', allow_fragments=True, port_validation=True):

    @agnosticdev
    Copy link
    Mannequin

    agnosticdev mannequin commented Mar 17, 2018

    One more question I would raise based upon a point made earlier by Eric is if option C would be too large of a change for 3.8? Any thoughts on this?

    @ericvsmith
    Copy link
    Member

    For backward comparability, option C would have to default to not raising an exception. Which would limit it's usefulness, but that's the way it goes.

    @agnosticdev
    Copy link
    Mannequin

    agnosticdev mannequin commented Mar 17, 2018

    This is a very good point, Eric. For backwards compatibility we would have to set the default parameter to false, so we be in the same state we are today by default. Knowing this my vote would be to go with the improvements to the ValueError message when using .port as the current PR does now.

    @jonathan-lp
    Copy link
    Mannequin Author

    jonathan-lp mannequin commented Mar 17, 2018

    Thanks for the thoughts.
    If only the exception message changes, that doesn't really address the issue that caused me to raise this - namely that it seems to be inconsistent with how almost every other Python builtin function I've used works. I have to defer to you folks who know how feasible changing any of that is - I can see your reasoning.

    One quick observation - on glancing over the patch, it seems to only be for urlparse, but this happens to urlsplit too. Or does urlsplit import from that function (as I said, I only glanced).

    @agnosticdev
    Copy link
    Mannequin

    agnosticdev mannequin commented Mar 17, 2018

    Yes, my goal for the patch was to provide a more explicit error message for this situation and to provide a low surface area change to the overall source, knowing that there are future development goals and backward compatibility to keep in mind. That way the patch would be a first step in providing more explicit information when a developer would run into this situation and hopefully improving the current situation at hand.

    The new ValueError message raised in this patch does work for both urlparse and urlsplit.

    @berkerpeksag
    Copy link
    Member

    New changeset 2cb4661 by Berker Peksag (Matt Eaton) in branch 'master':
    bpo-33034: Improve exception message when cast fails for {Parse,Split}Result.port (GH-6078)
    2cb4661

    @berkerpeksag
    Copy link
    Member

    The problem with adding a port_validation argument is that the port attribute is not the only thing that is computed lazily. There is also hostname, username, password attributes etc.

    I think the best way would be introducing a new API with more strict parsing rules. For example:

    from urllib.parse URL
    
        url = URL('http://Server=sde; Service=sde:oracle').parse()

    would raise a ValueError earlier than urlparse() and return a immutable namedtuple.

    Such an API can be designed as a standalone module first and then can be added into the existing urllib.parse module. I'd personally happy to review and discuss such a modern and user friendly API in the standard library.

    I think Eric has explained the current situation perfectly and we can now close this issue. We can create a new issue or a new python-ideas thread when or if we have a prototype of a new API.

    Thank you, everyone!

    @berkerpeksag berkerpeksag added 3.8 only security fixes stdlib Python modules in the Lib dir labels Mar 20, 2018
    @berkerpeksag berkerpeksag added the type-feature A feature request or enhancement label Mar 20, 2018
    @agnosticdev
    Copy link
    Mannequin

    agnosticdev mannequin commented Mar 20, 2018

    Berker and Eric, thank you very much. I really like the idea of introducing a new API with more strict parsing rules for this situation. I would be willing to put some ideas down on a first pass at this.

    @agnosticdev
    Copy link
    Mannequin

    agnosticdev mannequin commented Mar 30, 2018

    I was able to get some time together today and created a rough draft for the idea that you had Berker on introducing a new API with more strict parsing rules.

    This will allow the ValueError to be raised when the URL is parsed rather than when the computed property is used.

    Jonathan, this will help address your concern on the consistency of when the error message is raised.

    I added the rough draft on my Github account here:
    https://github.com/agnosticdev/cpython-apis/tree/master/url
    https://github.com/agnosticdev/cpython-apis/blob/master/url/url.py

    Please take a look and let me know what you think.
    You can contact me directly to continue the conversation at my email: agnosticdev@gmail.com

    @agnosticdev
    Copy link
    Mannequin

    agnosticdev mannequin commented Apr 3, 2018

    Wanted to check in on this to see if there was any feedback on this topic?

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.8 only security fixes stdlib Python modules in the Lib dir type-feature A feature request or enhancement
    Projects
    None yet
    Development

    No branches or pull requests

    2 participants