Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

regression - mimetypes guess_type is confused by ; in the filename #82630

Closed
YaroslavHalchenko mannequin opened this issue Oct 11, 2019 · 24 comments
Closed

regression - mimetypes guess_type is confused by ; in the filename #82630

YaroslavHalchenko mannequin opened this issue Oct 11, 2019 · 24 comments
Labels
3.7 (EOL) end of life 3.8 only security fixes 3.9 only security fixes stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error

Comments

@YaroslavHalchenko
Copy link
Mannequin

YaroslavHalchenko mannequin commented Oct 11, 2019

BPO 38449
Nosy @ned-deily, @ambv, @vadmium, @maxking, @corona10, @miss-islington, @kyleam
PRs
  • bpo-38449: Revert "bpo-22347: Update mimetypes.guess_type to allow oper parsing of URLs (GH-15522)" #16724
  • [3.7] bpo-38449: Revert "bpo-22347: Update mimetypes.guess_type to allow oper parsing of URLs (GH-15522)" (GH-16724) #16725
  • [3.7] bpo-38449: Revert "bpo-22347: Update mimetypes.guess_type to allow oper parsing of URLs (GH-15685)" (GH-16724) #16727
  • [3.8] bpo-38449: Revert "bpo-22347: Update mimetypes.guess_type to allow oper parsing of URLs (GH-15685)" (GH-16724) #16728
  • bpo-38449: Add URL delimiters test cases #16729
  • [3.7] bpo-38449: Add URL delimiters test cases (GH-16729) #17431
  • [3.8] bpo-38449: Add URL delimiters test cases (GH-16729) #17432
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2019-12-12.15:24:35.429>
    created_at = <Date 2019-10-11.14:06:40.678>
    labels = ['3.7', '3.8', 'type-bug', 'library', '3.9']
    title = 'regression - mimetypes guess_type is confused by ; in the filename'
    updated_at = <Date 2019-12-12.15:24:35.339>
    user = 'https://bugs.python.org/YaroslavHalchenko'

    bugs.python.org fields:

    activity = <Date 2019-12-12.15:24:35.339>
    actor = 'corona10'
    assignee = 'none'
    closed = True
    closed_date = <Date 2019-12-12.15:24:35.429>
    closer = 'corona10'
    components = ['Library (Lib)']
    creation = <Date 2019-10-11.14:06:40.678>
    creator = 'Yaroslav.Halchenko'
    dependencies = []
    files = []
    hgrepos = []
    issue_num = 38449
    keywords = ['patch', '3.7regression']
    message_count = 24.0
    messages = ['354455', '354456', '354465', '354470', '354488', '354500', '354505', '354513', '354522', '354523', '354524', '354536', '354537', '354539', '354544', '354545', '354546', '354665', '354696', '354705', '357694', '357695', '357696', '358302']
    nosy_count = 8.0
    nosy_names = ['ned.deily', 'Yaroslav.Halchenko', 'lukasz.langa', 'martin.panter', 'maxking', 'corona10', 'miss-islington', 'kyleam']
    pr_nums = ['16724', '16725', '16727', '16728', '16729', '17431', '17432']
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue38449'
    versions = ['Python 3.7', 'Python 3.8', 'Python 3.9']

    @YaroslavHalchenko
    Copy link
    Mannequin Author

    YaroslavHalchenko mannequin commented Oct 11, 2019

    Our tests in DataLad started to fail while building on Debian with Python 3.7.5rc1 whenever they passed just fine previously with 3.7.3rc1. Analysis boiled down to mimetypes

    $> ./python3.9 -c 'import mimetypes; mimedb = mimetypes.MimeTypes(strict=False); print(mimedb.guess_type(";1.tar.gz"))'                           
    (None, None)
    
    $> ./python3.9 -c 'import mimetypes; mimedb = mimetypes.MimeTypes(strict=False); print(mimedb.guess_type("1.tar.gz"))' 
    ('application/x-tar', 'gzip')
    
    $> git describe
    v3.8.0b1-1174-g2b7dc40b2af
    

    Ref:

    @YaroslavHalchenko YaroslavHalchenko mannequin added 3.7 (EOL) end of life 3.8 only security fixes 3.9 only security fixes stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error labels Oct 11, 2019
    @YaroslavHalchenko
    Copy link
    Mannequin Author

    YaroslavHalchenko mannequin commented Oct 11, 2019

    FWIW, our more complete test filename is

    # python3 -c 'import patoolib.util as ut; print(ut.guess_mime(r" \"\`;b&b&c |.tar.gz"))'
    (None, None)

    which works fine with older versions

    @kyleam
    Copy link
    Mannequin

    kyleam mannequin commented Oct 11, 2019

    I've performed a bisect the issue with the following script:

    #!/bin/sh
    make -j3 || exit 125
    ./python <<\EOF || exit 1
    import sys
    import mimetypes
    res = mimetypes.MimeTypes(strict=False).guess_type(";1.tar.gz")
    if res[0] is None:
        sys.exit(1)
    EOF
    

    That points to 87bd207 (bpo-22347: Update mimetypes.guess_type to allow proper parsing of URLs (GH-15522), 2019-09-05). That commit was included in 3.7.5rc1 when it was cherry picked by 8873bff.

    @ned-deily
    Copy link
    Member

    Marking as regression release blocker for 3.7.5 final and 3.8.0 final.

    @maxking
    Copy link
    Contributor

    maxking commented Oct 11, 2019

    I am looking into the issue.

    @maxking
    Copy link
    Contributor

    maxking commented Oct 11, 2019

    The bug is interesting due to some of the implementation details of "guess_type". The documentation says that it can parse either a URL or a filename.

    Switching from urllib.parse._splittype to urllib.parse.urlparse changed what a valid "path" is. _splittype doesn't care about the rest of the URL except the scheme, but urlparse does. Previously, we used to split things like:

       >>> print(urllib.parse._splittype(';1.tar.gz')
       (None, ';1.tar.gz')

    Then, we'd just treat the 2nd part as a filesystem path, which would rightfully guess the extension as .tar.gz

    However, switching to using parsing via urllib.parse.urlparse, we get:

        >>> print(urllib.parse.urlparse(';1.tar.gz')
        ParseResult(scheme='', netloc='', path='', params='1.tar.gz', query='', fragment='')

    And then we get the ".path" attribute for further processing, which being empty, returns (None, None).

    The format of all these parts is:

    scheme://netloc/path;parameters?query#fragment
    

    A simple fix would be to just merge path, parameters, query and fragment together (with appropriate delimiters) and the proceed with further processing. That would fix parsing of Filesystem paths but would break (again) parsing of URLs like:

        >>> mimetypes.guess_type('http://example.com/index.html;1.tar.gz')
        ('application/x-tar', 'gzip')

    It should return 'text/html' as the type, since this is a URL and everything after the ';' should not be used to determine the mimetype. But, if there is no scheme provided, we should treat it as a filesystem path and in that case 'application/x-tar' is the right type.

    I hope I am not confusing everyone here.

    The right fix IMO would be to make "guess_type" not treat URLs and filesytem paths alike.

    @ned-deily
    Copy link
    Member

    Thanks for looking into this, @maxking. With both 3.8.0 final and 3.7.5 final scheduled for just a few days away, I wonder if the best thing to do at this point is to revert them and work on a more robust fix targeted for the next maintenance releases since the original issue was not identified as being a security issue or otherwise critical.

    @maxking
    Copy link
    Contributor

    maxking commented Oct 12, 2019

    Yeah, I agree. I'll submit a PR for reverting the commits.

    @miss-islington
    Copy link
    Contributor

    New changeset 19a3d87 by Miss Islington (bot) (Abhilash Raj) in branch 'master':
    bpo-38449: Revert "bpo-22347: Update mimetypes.guess_type to allow oper parsing of URLs (GH-15522)" (GH-16724)
    19a3d87

    @corona10
    Copy link
    Member

    I'd like to suggest add unit test for the report case.
    So that we can detect future regression issue :)

    @corona10
    Copy link
    Member

    And I aplogize for my patch which makes regrssion issue.

    @maxking
    Copy link
    Contributor

    maxking commented Oct 12, 2019

    corona10: That's okay, it happens. I missed it too. There was really no way to foresee all the use cases, which is why we have beta and rc period to catch bugs.

    Yes, we should add a test case definitely, do you want to work on a PR?

    @corona10
    Copy link
    Member

    Yes, we should add a test case definitely, do you want to work on a PR?

    Sure, I want to finalize this issue :)

    @maxking
    Copy link
    Contributor

    maxking commented Oct 12, 2019

    New changeset 5a638a8 by Abhilash Raj in branch '3.8':
    [3.8] bpo-38449: Revert "bpo-22347: Update mimetypes.guess_type to allow oper parsing of URLs" (GH-16724) (GH-16728)
    5a638a8

    @miss-islington
    Copy link
    Contributor

    New changeset 164bee2 by Miss Islington (bot) (Abhilash Raj) in branch '3.7':
    [3.7] bpo-38449: Revert "bpo-22347: Update mimetypes.guess_type to allow oper parsing of URLs (GH-15685)" (GH-16724) (GH-16727)
    164bee2

    @ned-deily
    Copy link
    Member

    Thanks everyone for the quick action on this!

    @ned-deily
    Copy link
    Member

    (On second thought, I'll leave this open as a release blocker until we've cherry-picked the fixes for 3.8.0 final and 3.7.5 final.)

    @ambv
    Copy link
    Contributor

    ambv commented Oct 14, 2019

    (3.8.0 is released with this fix)

    @ned-deily
    Copy link
    Member

    New changeset 2a40559 by Ned Deily (Abhilash Raj) in branch '3.7':
    [3.7] bpo-38449: Revert "bpo-22347: Update mimetypes.guess_type to allow oper parsing of URLs (GH-15685)" (GH-16724) (GH-16727)
    2a40559

    @ned-deily
    Copy link
    Member

    (fix also released in 3.7.5)

    @maxking
    Copy link
    Contributor

    maxking commented Dec 1, 2019

    New changeset 2fe4c48 by Abhilash Raj (Dong-hee Na) in branch 'master':
    bpo-38449: Add URL delimiters test cases (bpo-16729)
    2fe4c48

    @miss-islington
    Copy link
    Contributor

    New changeset 926eabb by Miss Islington (bot) in branch '3.7':
    bpo-38449: Add URL delimiters test cases (GH-16729)
    926eabb

    @miss-islington
    Copy link
    Contributor

    New changeset 4f1eaf0 by Miss Islington (bot) in branch '3.8':
    bpo-38449: Add URL delimiters test cases (GH-16729)
    4f1eaf0

    @corona10
    Copy link
    Member

    @ned.deily @maxking
    I close this issue since all PRs were merged.
    Thanks, everyone for actions for this issue :)

    Have a warm and happy holiday and a hopeful new year.

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.7 (EOL) end of life 3.8 only security fixes 3.9 only security fixes stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    No branches or pull requests

    5 participants