New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
regression - mimetypes guess_type is confused by ; in the filename #82630
Comments
Our tests in DataLad started to fail while building on Debian with Python 3.7.5rc1 whenever they passed just fine previously with 3.7.3rc1. Analysis boiled down to mimetypes
Ref:
|
FWIW, our more complete test filename is # python3 -c 'import patoolib.util as ut; print(ut.guess_mime(r" \"\`;b&b&c |.tar.gz"))' which works fine with older versions |
I've performed a bisect the issue with the following script:
That points to 87bd207 (bpo-22347: Update mimetypes.guess_type to allow proper parsing of URLs (GH-15522), 2019-09-05). That commit was included in 3.7.5rc1 when it was cherry picked by 8873bff. |
Marking as regression release blocker for 3.7.5 final and 3.8.0 final. |
I am looking into the issue. |
The bug is interesting due to some of the implementation details of "guess_type". The documentation says that it can parse either a URL or a filename. Switching from urllib.parse._splittype to urllib.parse.urlparse changed what a valid "path" is. _splittype doesn't care about the rest of the URL except the scheme, but urlparse does. Previously, we used to split things like: >>> print(urllib.parse._splittype(';1.tar.gz')
(None, ';1.tar.gz') Then, we'd just treat the 2nd part as a filesystem path, which would rightfully guess the extension as .tar.gz However, switching to using parsing via urllib.parse.urlparse, we get: >>> print(urllib.parse.urlparse(';1.tar.gz')
ParseResult(scheme='', netloc='', path='', params='1.tar.gz', query='', fragment='') And then we get the ".path" attribute for further processing, which being empty, returns (None, None). The format of all these parts is:
A simple fix would be to just merge path, parameters, query and fragment together (with appropriate delimiters) and the proceed with further processing. That would fix parsing of Filesystem paths but would break (again) parsing of URLs like: >>> mimetypes.guess_type('http://example.com/index.html;1.tar.gz')
('application/x-tar', 'gzip') It should return 'text/html' as the type, since this is a URL and everything after the ';' should not be used to determine the mimetype. But, if there is no scheme provided, we should treat it as a filesystem path and in that case 'application/x-tar' is the right type. I hope I am not confusing everyone here. The right fix IMO would be to make "guess_type" not treat URLs and filesytem paths alike. |
Thanks for looking into this, @maxking. With both 3.8.0 final and 3.7.5 final scheduled for just a few days away, I wonder if the best thing to do at this point is to revert them and work on a more robust fix targeted for the next maintenance releases since the original issue was not identified as being a security issue or otherwise critical. |
Yeah, I agree. I'll submit a PR for reverting the commits. |
I'd like to suggest add unit test for the report case. |
And I aplogize for my patch which makes regrssion issue. |
corona10: That's okay, it happens. I missed it too. There was really no way to foresee all the use cases, which is why we have beta and rc period to catch bugs. Yes, we should add a test case definitely, do you want to work on a PR? |
Sure, I want to finalize this issue :) |
Thanks everyone for the quick action on this! |
(On second thought, I'll leave this open as a release blocker until we've cherry-picked the fixes for 3.8.0 final and 3.7.5 final.) |
(3.8.0 is released with this fix) |
(fix also released in 3.7.5) |
@ned.deily @maxking Have a warm and happy holiday and a hopeful new year. |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: