classification
Title: mimetypes for python 3.7.5 fails to detect matroska video
Type: behavior Stage:
Components: Library (Lib) Versions: Python 3.7
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: The Compiler, ammar2, dhess, michael-lazar, r.david.murray, steve.dower, toonn, xtreak
Priority: normal Keywords:

Created on 2019-10-31 19:05 by toonn, last changed 2020-07-17 13:04 by dhess.

Messages (13)
msg355763 - (view) Author: (toonn) Date: 2019-10-31 19:05
A user reported an error to us which seems to derive from the ``mimetypes`` library failing to guess the mime type for ``.mkv`` matroska video files:
https://github.com/ranger/ranger/issues/1744#issuecomment-548514373

This is a regression because the same query successfully identifies the filename as being of the ``video/x-mastroska`` mime type.
msg355781 - (view) Author: Karthikeyan Singaravelan (xtreak) * (Python committer) Date: 2019-11-01 02:23
I couldn't find mkv in mimetypes with search. Can you please post the output of the mimetypes query in 3.7.4 and 3.7.5 for the regression? In the attached GitHub issue the user reports mkv returns None and mp4 is detected.
msg355782 - (view) Author: Ammar Askar (ammar2) * (Python triager) Date: 2019-11-01 02:38
This is what I get on master, will try 3.7.5+ as noted in the Github issue:


Python 3.9.0a0 (heads/noopt-dirty:f3b170812d, Oct  1 2019, 20:15:53) [MSC v.1916 32 bit (Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import mimetypes
>>> print(mimetypes.guess_type('E01.mkv'))
('video/x-matroska', None)
msg355841 - (view) Author: (toonn) Date: 2019-11-01 20:44
The result is the same for 3.7.4, on my mac.
msg356106 - (view) Author: Karthikeyan Singaravelan (xtreak) * (Python committer) Date: 2019-11-06 09:13
It seems that there is a list of files from which the mime types are also added at https://github.com/python/cpython/blob/5c0c325453a175350e3c18ebb10cc10c37f9595c/Lib/mimetypes.py#L42. "video/x-matroska" is not present in CPython repo's list of suffixes so it should be inferring from the list of known files. Can you please run the below script on 3.7.4 and 3.7.5 on the same machine? I am using Mac and 3.7.4 and 3.7.5 report video/x-matroska correctly.

import mimetypes
print(mimetypes.guess_type('E01.mkv'))
print(mimetypes.types_map['.mkv'])
msg356853 - (view) Author: Florian Bruhin (The Compiler) * Date: 2019-11-18 09:47
I'm seeing the same in ranger and I'm currently trying to debug this - I'm still
not quite sure what I'm seeing as there seem to be various issues/weirdnesses
which overlap each other.

This strikes me as odd:

>>> import mimetypes
>>> mimetypes.guess_type('E01.mkv')
('video/x-matroska', None)
>>> mimetypes.types_map['.mkv']
'video/x-matroska'

>>> mt = mimetypes.MimeTypes()
>>> mt.guess_type('E01.mkv')
(None, None)
>>> mt.types_map
({'.rtf': 'application/rtf', [redacted for brevity]}, {'.js': 'application/javascript', [redacted for brevity]})
>>> mt.types_map[0]['.mkv']
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
KeyError: '.mkv'
>>> mt.types_map[1]['.mkv']
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
KeyError: '.mkv'

The Python documentation claims: "This class represents a MIME-types database.
By default, it provides access to the same database as the rest of this module.
The initial database is a copy of that provided by the module" - yet that
apparently isn't the case.

I see this with both Python 3.7.5 and 3.8.0, but with 3.6.9 I get the correct
output for both module- and class-level access.
msg356861 - (view) Author: Florian Bruhin (The Compiler) * Date: 2019-11-18 11:45
I now bisected this with the following script:

#!/bin/bash
git clean -dxf
./configure || exit 125
make -j2 || exit 125
output=$(./python -c "import mimetypes; mt = mimetypes.MimeTypes(); print(mt.guess_type('E01.mkv')[0])")
echo "$output"
echo "$(git describe) $output" >> ../bisect-results.txt
[[ $output == None ]] && exit 1 || exit 0

This shows 9fc720e5e4f772598013ea48a3f0d22b2b6b04fa as the commit which broke this (bpo-4963, GH-3062).
msg356865 - (view) Author: David K. Hess (dhess) * Date: 2019-11-18 13:23
Hi, I'm the author of the commit that's been fingered. Some comments about the behavior being reported....

First, as pointed out by @xtreak, indeed the mimetypes module uses mimetypes files present on the platform to add to the built in list of mimetypes. In this case, "video/x-mastroska" and ".mkv" are not found in the mimetypes module and were never there - they are coming from the host OS.

Also, for better or worse, the mimetypes module has an internal "init" method that does more than just instantiates a MimeTypes instance for default use:

https://github.com/python/cpython/blob/5c0c325453a175350e3c18ebb10cc10c37f9595c/Lib/mimetypes.py#L345

It also loads in these system files (and also Windows Registry entries on Win32) into a fresh MimeTypes instance. So, addressing what @The Compiler is seeing, properly resetting the mimetypes module really involves calling mimetypes.init(). By historical design, instantiating a MimeTypes class instance directly will not use host OS system mime type files.

As to why this commit is causing a change in the observed behavior, the problem that was corrected in this commit was that the mimetypes module had non-deterministic behavior related to initialization. In the original init code, the module level mime types tables are changed (really corrupted) after first load and you can never reinitialize the module back to a known good state (i.e. to original module defaults without information from the host OS system).

So, realistically, the behavior currently observed is the correct behavior given the presence and historical nature of the init function. The fact that a fresh MimeTypes instance without having been init()'d or with no filenames provided is returning an OS entry prior to this commit is really part of the initialization bug which was fixed.

Regarding the ranger bug, the main thing is you should not use a MimeTypes instance directly unless you run it through the same initializations that the init code does.

Anyway, that's my perspective having waded through all of that during the original BPO. I don't claim it's the correct one but that's where we are at.
msg356866 - (view) Author: Florian Bruhin (The Compiler) * Date: 2019-11-18 13:29
Ah, I think I see what's happening now.

Before that commit, when doing "mt = mimetypes.MimeTypes()", its self.types_map is populated as follows:

- Its __init__ method calls the mimetypes.init() function.
- That then reads all the files in mimetypes.knownfiles into a temporary MimeTypes object
- The resulting types_map is saved as a module global (mimetypes.types_map).
- The __init__ of our "mt" object continues and picks up all the types from that global types_map.

After the change, instead this happens:

- Its __init__ method calls the mimetypes.init() function.
- Like above, mimetypes.init() populates mimetypes.types_map
- However, MimeTypes.__init__ now uses _types_map_default instead of the (now reassigned) types_map, i.e. it never reads the entries from knownfiles.

In other words, it only picks up the hardcoded types in the module, but never reads the files it's (according to the documentation) supposed to read - thus the difference between using "mimetypes.guess_type('E01.mkv')" (which uses the correctly initialized global object) and using "mimetypes.MimeTypes().guess_type('E01.mkv')" (which doesn't know about mkv, as it's defined in one of the mimes.types files, not hardcoded in the module).

As a workaround, this results in the same behavior as before:

mt = mimetypes.MimeTypes()
for fn in mimetypes.knownfiles:
    if os.path.isfile(fn):
        mt.read(fn)
msg356867 - (view) Author: Florian Bruhin (The Compiler) * Date: 2019-11-18 13:34
Ah, I only saw dhess' comment after already submitting mine.

> By historical design, instantiating a MimeTypes class instance directly will not use host OS system mime type files.

Yet that wasn't what happened before that commit, and it's also not the behaviour which was (and is) documented - from https://docs.python.org/3.6/library/mimetypes.html#mimetypes.MimeTypes

    By default, it provides access to the same database as the rest of this module. The initial database is a copy of that provided by the module, and may be extended by loading additional mime.types-style files into the database using the read() or readfp() methods. The mapping dictionaries may also be cleared before loading additional data if the default data is not desired.

    The optional filenames parameter can be used to cause additional files to be loaded “on top” of the default database.

You might be right in that the new behaviour is in some way more correct - but it's wildly backwards-incompatible, and it's contrary to everything the documentation says.

I've only skimmed over bpo-4963 though - maybe I missing something?
msg356868 - (view) Author: David K. Hess (dhess) * Date: 2019-11-18 14:28
The documentation you quoted does read to me as compatible? The database it is referring to is the one hardcoded in the module – not the one assembled from that and the host OS. But, maybe this is just the vagaries of language and perspective at play.

Anyway I do agree it is an unexpected behavior change from the perspective of a user of the MimeTypes class directly. To get the best context for this change, it's useful to run through the long history of the issue that drove it:

https://bugs.python.org/issue4963

Note, that discussion never touched on the use case of instantiating a MimeTypes class directly and there are apparently no test cases covering this particular scenario either. With no awareness of this perspective/use case it didn't get directly addressed.

Perhaps all MimeTypes instances should auto-load system files unless a new __init__ param selects for this new "clean" behavior?
msg373795 - (view) Author: Michael Lazar (michael-lazar) * Date: 2020-07-17 03:14
Greetings,

I just encountered this issue [0] and I agree with the sentiment that the documentation is currently misleading.

Particularly,

> By default, it provides access to the same database as the rest of this module. The initial database is a copy of that provided by the module, and may be extended by loading additional mime.types-style files into the database using the read() or readfp() methods. The mapping dictionaries may also be cleared before loading additional data if the default data is not desired.

“as the rest of the module” implies to me that it should behave the same way as mimetypes.guess_type() does. The documentation only has one other reference to this built-in list of mimetypes, and the default values are hidden behind underscored variable names. I would re-word this as

"By default, it provides access to a database of well-known values defined internally by the python module. Unlike the other mimetypes convenience functions, it does not include definitions from the list of mimetypes.knownfiles. The initial database may be extended by loading additional mime.types-style files into the database using the read() or readfp() methods. The mapping dictionaries may also be cleared before loading additional data if the default data is not desired."

I would be happy to submit a PR if others agree.

[0] https://github.com/michael-lazar/jetforce/issues/38
msg373832 - (view) Author: David K. Hess (dhess) * Date: 2020-07-17 13:04
@michael-lazar a documentation change seems the path of least resistance given the complicated history of this module. +1 from me.
History
Date User Action Args
2020-07-17 13:04:21dhesssetmessages: + msg373832
2020-07-17 03:14:50michael-lazarsetnosy: + michael-lazar
messages: + msg373795
2019-11-18 14:28:07dhesssetmessages: + msg356868
2019-11-18 13:37:51The Compilersetnosy: + r.david.murray
2019-11-18 13:34:34The Compilersetmessages: + msg356867
2019-11-18 13:29:10The Compilersetmessages: + msg356866
2019-11-18 13:23:28dhesssetmessages: + msg356865
2019-11-18 11:45:25The Compilersetnosy: + steve.dower, dhess
messages: + msg356861
2019-11-18 09:47:12The Compilersetnosy: + The Compiler
messages: + msg356853
2019-11-06 09:13:22xtreaksetmessages: + msg356106
2019-11-01 20:44:53toonnsetmessages: + msg355841
2019-11-01 02:38:42ammar2setnosy: + ammar2
messages: + msg355782
2019-11-01 02:23:32xtreaksetmessages: + msg355781
2019-11-01 02:18:12xtreaksetnosy: + xtreak
2019-10-31 19:05:11toonncreate