New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Alternate RFC 3986 compliant URI parsing module #43453
Comments
Inspired by (and based on) Paul Jimenez's uriparse Most of the module works with a URI subclass of tuple The authority component is either None, or a The function make_uri will create a URI string from the The function split_uri accepts a string and returns a The functions split_authority and make_authority are The function parse_uri digs into the internal structure The main parsing engine is still URIParser (much the The parse() methods of the individual parsers are now The module level 'schemes' attribute is a mapping from urljoin has been renamed to join_uri to match the style |
Logged In: YES Updated version attached which addresses some issues raised Also settled on split/join for the component separation and Based on the terminology in the RFC, the function to combine |
Logged In: YES Uploaded version 0.3 which passes all the RFC tests, as well The last 4suite failure went away when I realised those |
Logged In: YES Uploaded version 0.4 This version cleans up the logic in resolve_uripath a bit It also exposes EmailPath (along with split_emailpath and |
Removed all versions prior to 0.4 |
I'll collect open issues that would be solved by this. |
The code itself is no longer the hard part here (hence the easy tag). The problem is the fact that getting something like this into the What the issue really needs is someone to champion the benefits of Now that it is available, it would also be worth looking at updating the |
ISTM that gathering the issues where this would help is a good start, |
I am willing to review this/work on it. But I wonder if this can be
|
A new way for parsing URI. I have not reviewed it even after saying I would like to, but having the dependency issue resolved, I think it is good to look at it again, especially if it leads some helpful approaches to parsing IRI. |
"accepted" is a little too strong for the current status of this :) I've removed the easy tag as well (making the case for this or something like it in the standard library it going to involve a fair bit of effort - the coding was actually the comparatively easy part). |
Is this still relevant? Can’t the improvements make it into urllib.parse? |
I still like the higher level API concept, although I might not do it exactly as presented here any more. Independently of introducing a new parsing API, it would be worthwhile extracting the parsing tests from the attached module to make sure the *existing* parser can handle them all correctly. |
Sure, adding tests is a no-brainer. Regarding the module, I’m a bit reluctant. I see value in providing low-level building blocks (think OS calls) and high-level utilities for regular use, but here it seems that urllib.parse and urischemes are at the same level. I’m not opposed to the functionality itself—I would like to use a class simply named “URI” (and generally get better names, that is RFC names instead of specific inventions), have components normalization and such goodies—but I think the existing module can get fixes and improvements. I fear the confusion that could be caused by having two modules for the same task, unless you want to propose that the new module deprecate urllib.parse. Senthil, what is your opinion? |
Just to be clear, even *I* don't think adding urischemes as it stands is a particularly great idea, and I wrote it. The only reason I haven't closed the issue is because I'd like to see it mined for additional tests in test_urlparse and perhaps even implementation or API enhancements in url.parse first. (The latter becomes a lot more likely if the urischemes implementation passes tests that url.parse fails) I also think, since I wrote this, the various urllib parsing methods were updated to return named tuple instances with properties, so a lot of the awkwardness of extracting partial values went away. (i.e. returning structured objects already raised the level of the urllib APIs from the "tuple-of-strings" level they used to be sitting at) I do still assert that urischemes is slightly "higher level" than the current incarnation of similar functionality in urllib.parse. Universal Resource Identifiers are more encompassing than Universal Resource Locators and Universal Resource Names, and the new APIs explicitly deal with both kinds of URI. There are subtle differences in the assumptions you're allowed to make when you may have a URN rather than a URL, so I believe the current module sometimes does the wrong thing when given one of the former. That said, it's been a long time since I've needed to remember the details, so I don't recall exactly where the current module gets URI handling wrong (or at least, did back in 2006). The intro to RFC 3986 is a good place to start in learning the differences though - Sir Tim writes good docs :) |
Here's a slightly modified version of urischeme.py that can be run under Python 3 and compares its results with urllib.parse, printing out the mismatches. The major differences seem to be 1) urischeme fills in the default port if it's not explicitly provided, e.g. http urls have the port set to 80, 2) the path is returned as '/', not the empty string, for the URL http://host, 3) urllib.parse.urljoin() doesn't get rid of ./ and ../ in URLs.
|
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: