classification
Title: regression - mimetypes guess_type is confused by ; in the filename
Type: behavior Stage: resolved
Components: Library (Lib) Versions: Python 3.9, Python 3.8, Python 3.7
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: Yaroslav.Halchenko, corona10, kyleam, lukasz.langa, martin.panter, maxking, miss-islington, ned.deily
Priority: normal Keywords: 3.7regression, patch

Created on 2019-10-11 14:06 by Yaroslav.Halchenko, last changed 2019-12-12 15:24 by corona10. This issue is now closed.

Pull Requests
URL Status Linked Edit
PR 16724 merged maxking, 2019-10-12 00:42
PR 16725 closed miss-islington, 2019-10-12 05:41
PR 16727 merged maxking, 2019-10-12 15:56
PR 16728 merged maxking, 2019-10-12 16:04
PR 16729 merged corona10, 2019-10-12 17:04
PR 17431 merged miss-islington, 2019-12-01 23:06
PR 17432 merged miss-islington, 2019-12-01 23:07
Messages (24)
msg354455 - (view) Author: Yaroslav Halchenko (Yaroslav.Halchenko) Date: 2019-10-11 14:06
Our tests in DataLad started to fail while building on Debian with Python 3.7.5rc1 whenever they passed just fine previously with 3.7.3rc1. Analysis boiled down to mimetypes

    $> ./python3.9 -c 'import mimetypes; mimedb = mimetypes.MimeTypes(strict=False); print(mimedb.guess_type(";1.tar.gz"))'                           
    (None, None)
    
    $> ./python3.9 -c 'import mimetypes; mimedb = mimetypes.MimeTypes(strict=False); print(mimedb.guess_type("1.tar.gz"))' 
    ('application/x-tar', 'gzip')
    
    $> git describe
    v3.8.0b1-1174-g2b7dc40b2af


Ref: 

- original issue in DataLad: https://github.com/datalad/datalad/issues/3769
msg354456 - (view) Author: Yaroslav Halchenko (Yaroslav.Halchenko) Date: 2019-10-11 14:08
FWIW, our more complete test filename is 

# python3 -c 'import patoolib.util as ut; print(ut.guess_mime(r" \"\`;b&b&c |.tar.gz"))'
(None, None)

which works fine with older versions
msg354465 - (view) Author: Kyle Meyer (kyleam) Date: 2019-10-11 16:13
I've performed a bisect the issue with the following script:

    #!/bin/sh
    make -j3 || exit 125
    ./python <<\EOF || exit 1
    import sys
    import mimetypes
    res = mimetypes.MimeTypes(strict=False).guess_type(";1.tar.gz")
    if res[0] is None:
        sys.exit(1)
    EOF

That points to 87bd2071c7 (bpo-22347: Update mimetypes.guess_type to allow proper parsing of URLs (GH-15522), 2019-09-05).  That commit was included in 3.7.5rc1 when it was cherry picked by 8873bff287.
msg354470 - (view) Author: Ned Deily (ned.deily) * (Python committer) Date: 2019-10-11 17:17
Marking as regression release blocker for 3.7.5 final and 3.8.0 final.
msg354488 - (view) Author: Abhilash Raj (maxking) * (Python committer) Date: 2019-10-11 20:11
I am looking into the issue.
msg354500 - (view) Author: Abhilash Raj (maxking) * (Python committer) Date: 2019-10-11 21:26
The bug is interesting due to some of the implementation details of "guess_type". The documentation says that it can parse either a URL or a filename.

Switching from urllib.parse._splittype to urllib.parse.urlparse changed what a valid "path" is. _splittype doesn't care about the rest of the URL except the scheme, but urlparse does. Previously, we used to split things like:

   >>> print(urllib.parse._splittype(';1.tar.gz')
   (None, ';1.tar.gz')

Then, we'd just treat the 2nd part as a filesystem path, which would rightfully guess the extension as .tar.gz

However, switching to using parsing via urllib.parse.urlparse, we get:

    >>> print(urllib.parse.urlparse(';1.tar.gz')
    ParseResult(scheme='', netloc='', path='', params='1.tar.gz', query='', fragment='')

And then we get the ".path" attribute for further processing, which being empty, returns (None, None).

The format of all these parts is:

    scheme://netloc/path;parameters?query#fragment

A simple fix would be to just merge path, parameters, query and fragment together (with appropriate delimiters) and the proceed with further processing. That would fix parsing of Filesystem paths but would break (again) parsing of URLs like:

    >>> mimetypes.guess_type('http://example.com/index.html;1.tar.gz')
    ('application/x-tar', 'gzip')

It should return 'text/html' as the type, since this is a URL and everything after the ';' should not be used to determine the mimetype. But, if there is no scheme provided, we should treat it as a filesystem path and in that case 'application/x-tar' is the right type.

I hope I am not confusing everyone here. 

The right fix IMO would be to make "guess_type" not treat URLs and filesytem paths alike.
msg354505 - (view) Author: Ned Deily (ned.deily) * (Python committer) Date: 2019-10-11 21:38
Thanks for looking into this, @maxking. With both 3.8.0 final and 3.7.5 final scheduled for just a few days away, I wonder if the best thing to do at this point is to revert them and work on a more robust fix targeted for the next maintenance releases since the original issue was not identified as being a security issue or otherwise critical.
msg354513 - (view) Author: Abhilash Raj (maxking) * (Python committer) Date: 2019-10-12 00:38
Yeah, I agree. I'll submit a PR for reverting the commits.
msg354522 - (view) Author: miss-islington (miss-islington) Date: 2019-10-12 05:41
New changeset 19a3d873005e5730eeabdc394c961e93f2ec02f0 by Miss Islington (bot) (Abhilash Raj) in branch 'master':
bpo-38449: Revert "bpo-22347: Update mimetypes.guess_type to allow oper parsing of URLs (GH-15522)" (GH-16724)
https://github.com/python/cpython/commit/19a3d873005e5730eeabdc394c961e93f2ec02f0
msg354523 - (view) Author: Dong-hee Na (corona10) * (Python triager) Date: 2019-10-12 06:04
I'd like to suggest add unit test for the report case.
So that we can detect future regression issue :)
msg354524 - (view) Author: Dong-hee Na (corona10) * (Python triager) Date: 2019-10-12 06:13
And I aplogize for my patch which makes regrssion issue.
msg354536 - (view) Author: Abhilash Raj (maxking) * (Python committer) Date: 2019-10-12 16:42
corona10: That's okay, it happens. I missed it too. There was really no way to foresee all the use cases, which is why we have beta and rc period to catch bugs.

Yes, we should add a test case definitely, do you want to work on a PR?
msg354537 - (view) Author: Dong-hee Na (corona10) * (Python triager) Date: 2019-10-12 16:49
> Yes, we should add a test case definitely, do you want to work on a PR?

Sure, I want to finalize this issue :)
msg354539 - (view) Author: Abhilash Raj (maxking) * (Python committer) Date: 2019-10-12 16:58
New changeset 5a638a805503131f4a9cc2bbc5944611295c1500 by Abhilash Raj in branch '3.8':
[3.8] bpo-38449: Revert "bpo-22347: Update mimetypes.guess_type to allow oper parsing of URLs" (GH-16724) (GH-16728)
https://github.com/python/cpython/commit/5a638a805503131f4a9cc2bbc5944611295c1500
msg354544 - (view) Author: miss-islington (miss-islington) Date: 2019-10-12 18:50
New changeset 164bee296ab1f87cc05566b39ee8fb9fb64b3e5a by Miss Islington (bot) (Abhilash Raj) in branch '3.7':
[3.7] bpo-38449: Revert "bpo-22347: Update mimetypes.guess_type to allow oper parsing of URLs (GH-15685)" (GH-16724) (GH-16727)
https://github.com/python/cpython/commit/164bee296ab1f87cc05566b39ee8fb9fb64b3e5a
msg354545 - (view) Author: Ned Deily (ned.deily) * (Python committer) Date: 2019-10-12 18:52
Thanks everyone for the quick action on this!
msg354546 - (view) Author: Ned Deily (ned.deily) * (Python committer) Date: 2019-10-12 18:52
(On second thought, I'll leave this open as a release blocker until we've cherry-picked the fixes for 3.8.0 final and 3.7.5 final.)
msg354665 - (view) Author: Łukasz Langa (lukasz.langa) * (Python committer) Date: 2019-10-14 21:45
(3.8.0 is released with this fix)
msg354696 - (view) Author: Ned Deily (ned.deily) * (Python committer) Date: 2019-10-15 07:30
New changeset 2a405598bbccbc42710dc5ecf3d44c8de4c16582 by Ned Deily (Abhilash Raj) in branch '3.7':
[3.7] bpo-38449: Revert "bpo-22347: Update mimetypes.guess_type to allow oper parsing of URLs (GH-15685)" (GH-16724) (GH-16727)
https://github.com/python/cpython/commit/2a405598bbccbc42710dc5ecf3d44c8de4c16582
msg354705 - (view) Author: Ned Deily (ned.deily) * (Python committer) Date: 2019-10-15 07:45
(fix also released in 3.7.5)
msg357694 - (view) Author: Abhilash Raj (maxking) * (Python committer) Date: 2019-12-01 23:06
New changeset 2fe4c48917c2d1b40cf063c6ed22ae2e71f4cb62 by Abhilash Raj (Dong-hee Na) in branch 'master':
bpo-38449: Add URL delimiters test cases (#16729)
https://github.com/python/cpython/commit/2fe4c48917c2d1b40cf063c6ed22ae2e71f4cb62
msg357695 - (view) Author: miss-islington (miss-islington) Date: 2019-12-01 23:23
New changeset 926eabb6b46106e677d5e1ea25b7bab918da4110 by Miss Islington (bot) in branch '3.7':
bpo-38449: Add URL delimiters test cases (GH-16729)
https://github.com/python/cpython/commit/926eabb6b46106e677d5e1ea25b7bab918da4110
msg357696 - (view) Author: miss-islington (miss-islington) Date: 2019-12-01 23:24
New changeset 4f1eaf028058cc357030dfaa5e611c90662539f0 by Miss Islington (bot) in branch '3.8':
bpo-38449: Add URL delimiters test cases (GH-16729)
https://github.com/python/cpython/commit/4f1eaf028058cc357030dfaa5e611c90662539f0
msg358302 - (view) Author: Dong-hee Na (corona10) * (Python triager) Date: 2019-12-12 15:24
@ned.deily @maxking
I close this issue since all PRs were merged.
Thanks, everyone for actions for this issue :)

Have a warm and happy holiday and a hopeful new year.
History
Date User Action Args
2019-12-12 15:24:35corona10setstatus: open -> closed
resolution: fixed
messages: + msg358302

stage: patch review -> resolved
2019-12-01 23:24:21miss-islingtonsetmessages: + msg357696
2019-12-01 23:23:36miss-islingtonsetmessages: + msg357695
2019-12-01 23:07:11miss-islingtonsetpull_requests: + pull_request16911
2019-12-01 23:06:42miss-islingtonsetpull_requests: + pull_request16910
2019-12-01 23:06:39maxkingsetmessages: + msg357694
2019-10-15 07:45:01ned.deilysetmessages: + msg354705
2019-10-15 07:30:25ned.deilysetmessages: + msg354696
2019-10-14 21:45:09lukasz.langasetpriority: release blocker -> normal

messages: + msg354665
2019-10-14 12:43:44vstinnersetnosy: - vstinner
2019-10-12 18:52:59ned.deilysetpriority: normal -> release blocker

messages: + msg354546
2019-10-12 18:52:00ned.deilysetpriority: release blocker -> normal

messages: + msg354545
2019-10-12 18:50:07miss-islingtonsetmessages: + msg354544
2019-10-12 17:04:04corona10setpull_requests: + pull_request16311
2019-10-12 16:58:15maxkingsetmessages: + msg354539
2019-10-12 16:49:08corona10setmessages: + msg354537
2019-10-12 16:42:59maxkingsetmessages: + msg354536
2019-10-12 16:04:59maxkingsetpull_requests: + pull_request16309
2019-10-12 15:56:15maxkingsetpull_requests: + pull_request16308
2019-10-12 06:13:56corona10setmessages: + msg354524
2019-10-12 06:04:23corona10setnosy: + corona10
messages: + msg354523
2019-10-12 05:41:53miss-islingtonsetpull_requests: + pull_request16304
2019-10-12 05:41:50miss-islingtonsetnosy: + miss-islington
messages: + msg354522
2019-10-12 00:42:58maxkingsetkeywords: + patch
stage: patch review
pull_requests: + pull_request16302
2019-10-12 00:38:51maxkingsetmessages: + msg354513
2019-10-11 21:38:52ned.deilysetmessages: + msg354505
2019-10-11 21:26:30maxkingsetmessages: + msg354500
2019-10-11 20:11:09maxkingsetnosy: + maxking
messages: + msg354488
2019-10-11 17:17:31ned.deilysetpriority: normal -> release blocker

nosy: + ned.deily, martin.panter, lukasz.langa, vstinner
messages: + msg354470

keywords: + 3.7regression
2019-10-11 16:13:34kyleamsetnosy: + kyleam
messages: + msg354465
2019-10-11 14:08:39Yaroslav.Halchenkosetmessages: + msg354456
2019-10-11 14:06:40Yaroslav.Halchenkocreate