Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alternate RFC 3986 compliant URI parsing module #43453

Open
ncoghlan opened this issue Jun 4, 2006 · 16 comments
Open

Alternate RFC 3986 compliant URI parsing module #43453

ncoghlan opened this issue Jun 4, 2006 · 16 comments
Assignees
Labels
stdlib Python modules in the Lib dir type-feature A feature request or enhancement

Comments

@ncoghlan
Copy link
Contributor

ncoghlan commented Jun 4, 2006

BPO 1500504
Nosy @ncoghlan, @orsenthil, @devdanzin, @merwok, @ambv, @berkerpeksag, @vadmium
Files
  • urischemes.py: v 0.4 of the urischemes module
  • urischemes.py
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/orsenthil'
    closed_at = None
    created_at = <Date 2006-06-04.14:50:18.000>
    labels = ['type-feature', 'library']
    title = 'Alternate RFC 3986 compliant URI parsing module'
    updated_at = <Date 2015-01-01.01:25:35.044>
    user = 'https://github.com/ncoghlan'

    bugs.python.org fields:

    activity = <Date 2015-01-01.01:25:35.044>
    actor = 'berker.peksag'
    assignee = 'orsenthil'
    closed = False
    closed_date = None
    closer = None
    components = ['Library (Lib)']
    creation = <Date 2006-06-04.14:50:18.000>
    creator = 'ncoghlan'
    dependencies = []
    files = ['7307', '32591']
    hgrepos = []
    issue_num = 1500504
    keywords = ['patch']
    message_count = 16.0
    messages = ['50411', '50412', '50413', '50414', '50415', '83920', '86301', '86308', '86348', '109963', '109966', '120245', '120264', '120344', '120349', '202728']
    nosy_count = 7.0
    nosy_names = ['ncoghlan', 'orsenthil', 'ajaksu2', 'eric.araujo', 'lukasz.langa', 'berker.peksag', 'martin.panter']
    pr_nums = []
    priority = 'normal'
    resolution = None
    stage = 'patch review'
    status = 'open'
    superseder = None
    type = 'enhancement'
    url = 'https://bugs.python.org/issue1500504'
    versions = ['Python 3.5']

    @ncoghlan
    Copy link
    Contributor Author

    ncoghlan commented Jun 4, 2006

    Inspired by (and based on) Paul Jimenez's uriparse
    module (http://python.org/sf/1462525), urischemes tries
    to put a cleaner interface in front of the URI parsing
    engine.

    Most of the module works with a URI subclass of tuple
    that is always a 5-tuple (scheme, authority, path,
    query, fragment).

    The authority component is either None, or a
    URIAuthority subclass of tuple that is always a 4-tuple
    (user, password, host, port).

    The function make_uri will create a URI string from the
    5 constituent components of a URI. The components do
    not need to be strings - if they are not strings, str()
    will be invoked on them (this allows the URIAuthority
    tuple subclass to be used transparently instead of a
    string for the authority component). The result is
    checked to ensure it is an RFC-compliant URI.

    The function split_uri accepts a string and returns a
    URI object with strings as the individual elements.
    Invoking str() on this object will recreate a URI
    string using make_uri(). The regex underlying this
    operation is now broken out and available as module
    level attributes like URI_PATTERN.

    The functions split_authority and make_authority are
    similar, only working solely on the authority component
    rather than the whole URI.

    The function parse_uri digs into the internal structure
    of a URI, also parsing the components. This will
    replace a non-empty URI authority component string with
    a URIAuthority tuple subclass. Depending on the scheme,
    it may also replace other components (e.g. for mailto
    links, the path is replaced with a (user, host) tuple
    subclass).

    The main parsing engine is still URIParser (much the
    same as Paul's), but the root of the internal parser
    hierarchy is now SchemeParser. This has two subclasses,
    URLParser and MailtoParser. The various URL flavours
    are now different instances of URLParser rather than
    subclasses. All of the actual parsers are available as
    module level attributes with the same name as the
    scheme they parse. Additionally, each parser knows the
    name of the scheme it is intended to parse.

    The parse() methods of the individual parsers are now
    expected to return a URI object (SchemeParser actually
    takes care of this). The parse() method also takes a
    dictionary of defaults, which can override the defaults
    supplied by the parser instance. The unparse() method
    is gone - instead, the scheme parser should ensure that
    all components returned are either strings or produce
    the right thing when __str__ is invoked (e.g. see
    _MailtoURIPath)

    The module level 'schemes' attribute is a mapping from
    scheme names to parsers that is automatically populated
    with all instances of SchemeParser that are found in
    the module globals()

    urljoin has been renamed to join_uri to match the style
    of the other names in the module.

    @ncoghlan
    Copy link
    Contributor Author

    ncoghlan commented Jun 5, 2006

    Logged In: YES
    user_id=1038590

    Updated version attached which addresses some issues raised
    by Mike Brown in private mail (the difference between a URI
    and a URI reference and some major differences between URI
    paths and posix paths).

    Also settled on split/join for the component separation and
    recombination operations and made the join methods all take
    a tuple so that join_x(split_x(uri)) round trips.

    Based on the terminology in the RFC, the function to combine
    a URI reference with a base URI is now called "resolve_uriref".

    @ncoghlan
    Copy link
    Contributor Author

    ncoghlan commented Jun 6, 2006

    Logged In: YES
    user_id=1038590

    Uploaded version 0.3 which passes all the RFC tests, as well
    as the failing 4Suite tests Mike sent me based on version
    0.1 and 0.2.

    The last 4suite failure went away when I realised those
    tests expected to operate in strict mode :)

    @ncoghlan
    Copy link
    Contributor Author

    ncoghlan commented Jun 8, 2006

    Logged In: YES
    user_id=1038590

    Uploaded version 0.4

    This version cleans up the logic in resolve_uripath a bit
    (use a separate loop to strip the leading dot segments, add
    comments explaining meaning of if statements when dealing
    with dot segments).

    It also exposes EmailPath (along with split_emailpath and
    join_emailpath) as public objects, rather than treating them
    as internal to the MailtoSchemeParser.

    @ncoghlan
    Copy link
    Contributor Author

    Removed all versions prior to 0.4

    @devdanzin
    Copy link
    Mannequin

    devdanzin mannequin commented Mar 21, 2009

    I'll collect open issues that would be solved by this.

    @devdanzin devdanzin mannequin added stdlib Python modules in the Lib dir type-feature A feature request or enhancement labels Mar 21, 2009
    @devdanzin devdanzin mannequin added easy labels Apr 22, 2009
    @ncoghlan
    Copy link
    Contributor Author

    The code itself is no longer the hard part here (hence the easy tag).

    The problem is the fact that getting something like this into the
    standard library is a tough sell on python-dev because it isn't really a
    field tested module, but once people start downloading things from PyPI,
    they're more likely to go for something like 4Suite rather than a mere
    URI parsing module.

    What the issue really needs is someone to champion the benefits of
    having this in the standard library.

    Now that it is available, it would also be worth looking at updating the
    module to use collection.named_tuple instead of creating its own variant
    of the same thing.

    @devdanzin
    Copy link
    Mannequin

    devdanzin mannequin commented Apr 22, 2009

    ISTM that gathering the issues where this would help is a good start,
    but I haven't had the time to do it yet.

    @orsenthil
    Copy link
    Member

    I am willing to review this/work on it. But I wonder if this can be
    categorized as easy task.

    1. Integration to Standard Library will involve compatibility with
      existing parsing, which will invariably involve certain tweaks (with
      discussions/buy-in from others).

    2. There are other patches which tries to achieve this purpose;
      consolidation is required.

    @orsenthil
    Copy link
    Member

    A new way for parsing URI. I have not reviewed it even after saying I would like to, but having the dependency issue resolved, I think it is good to look at it again, especially if it leads some helpful approaches to parsing IRI.

    @ncoghlan
    Copy link
    Contributor Author

    "accepted" is a little too strong for the current status of this :)

    I've removed the easy tag as well (making the case for this or something like it in the standard library it going to involve a fair bit of effort - the coding was actually the comparatively easy part).

    @merwok
    Copy link
    Member

    merwok commented Nov 2, 2010

    Is this still relevant? Can’t the improvements make it into urllib.parse?

    @ncoghlan
    Copy link
    Contributor Author

    ncoghlan commented Nov 2, 2010

    I still like the higher level API concept, although I might not do it exactly as presented here any more.

    Independently of introducing a new parsing API, it would be worthwhile extracting the parsing tests from the attached module to make sure the *existing* parser can handle them all correctly.

    @merwok
    Copy link
    Member

    merwok commented Nov 3, 2010

    Sure, adding tests is a no-brainer.

    Regarding the module, I’m a bit reluctant. I see value in providing low-level building blocks (think OS calls) and high-level utilities for regular use, but here it seems that urllib.parse and urischemes are at the same level. I’m not opposed to the functionality itself—I would like to use a class simply named “URI” (and generally get better names, that is RFC names instead of specific inventions), have components normalization and such goodies—but I think the existing module can get fixes and improvements. I fear the confusion that could be caused by having two modules for the same task, unless you want to propose that the new module deprecate urllib.parse.

    Senthil, what is your opinion?

    @ncoghlan
    Copy link
    Contributor Author

    ncoghlan commented Nov 3, 2010

    Just to be clear, even *I* don't think adding urischemes as it stands is a particularly great idea, and I wrote it. The only reason I haven't closed the issue is because I'd like to see it mined for additional tests in test_urlparse and perhaps even implementation or API enhancements in url.parse first.

    (The latter becomes a lot more likely if the urischemes implementation passes tests that url.parse fails)

    I also think, since I wrote this, the various urllib parsing methods were updated to return named tuple instances with properties, so a lot of the awkwardness of extracting partial values went away. (i.e. returning structured objects already raised the level of the urllib APIs from the "tuple-of-strings" level they used to be sitting at)

    I do still assert that urischemes is slightly "higher level" than the current incarnation of similar functionality in urllib.parse. Universal Resource Identifiers are more encompassing than Universal Resource Locators and Universal Resource Names, and the new APIs explicitly deal with both kinds of URI. There are subtle differences in the assumptions you're allowed to make when you may have a URN rather than a URL, so I believe the current module sometimes does the wrong thing when given one of the former.

    That said, it's been a long time since I've needed to remember the details, so I don't recall exactly where the current module gets URI handling wrong (or at least, did back in 2006). The intro to RFC 3986 is a good place to start in learning the differences though - Sir Tim writes good docs :)

    @akuchling
    Copy link
    Member

    Here's a slightly modified version of urischeme.py that can be run under Python 3 and compares its results with urllib.parse, printing out the mismatches.

    The major differences seem to be 1) urischeme fills in the default port if it's not explicitly provided, e.g. http urls have the port set to 80, 2) the path is returned as '/', not the empty string, for the URL http://host, 3) urllib.parse.urljoin() doesn't get rid of ./ and ../ in URLs.

    1. seems like something worth fixing in urllib.parse. The others probably present some backward-compatibility issues.

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    stdlib Python modules in the Lib dir type-feature A feature request or enhancement
    Projects
    None yet
    Development

    No branches or pull requests

    4 participants