Title: mimetypes.guess_extension result changes after mimetypes.init()
Type: behavior Stage: needs patch
Components: email, Library (Lib) Versions: Python 3.7, Python 3.6
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: a.badger, barry, dhess, l0nwlf, martin.panter, pitrou, r.david.murray, siona, sivert, terry.reedy, wichert, wodny
Priority: normal Keywords: easy, patch

Created on 2009-01-16 12:04 by siona, last changed 2017-05-14 01:29 by dhess.

File name Uploaded Description Edit
mimetypes-init-test.patch r.david.murray, 2010-06-29 19:04 review
issue4963.patch a.badger, 2014-04-14 16:08 stable guess_extension patch and test review
Messages (24)
msg79955 - (view) Author: S Arrowsmith (siona) Date: 2009-01-16 12:04
Asking mimetypes to reload mime.types can cause guess_extension() to
return a different result if multiple extensions are mapped to that mime

>>> import mimetypes
>>> mimetypes.guess_extension('image/jpeg')
>>> mimetypes.init()
>>> mimetypes.guess_extension('image/jpeg')

This is because both the forward (extension to type) and inverse (type
to extension) type mapping dicts are populated by iterating through the
existing forward (extension to type) dict (types_map), then supplemented
by reading from mime.types (or any other files given to init()). The
fully populated forward dict becomes the new types_map. Initially,
types_map is hard-coded, but when the type mapping dicts are
repopulated, by explicitly or implicitly calling init() again, it is
done by iterating over the types_map created by the first init() call,
not the hard-coded one. If the iteration order for a set of extensions
with the same type is different in these two versions of the forward
dict, the order of extensions appearing for that type in the inverse
dict will change. And so the behavior of guess_all_extensions() and
hence guess_extension() will change.
msg79961 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2009-01-16 16:40
3.0, WinXP
import mimetypes

I wonder at this answer since .jpg and occasionally .jpeg is standard in
Windows usage, but the doc is unclear to me as to the actual intent of
the function.
msg79962 - (view) Author: S Arrowsmith (siona) Date: 2009-01-16 17:13
Ah, yes, forgot to mention this is on Debian 4.0. I doubt you're going
to run into it on a Windows system unless you explicitly give init() a
mime.types file, looking at the knownfiles list used by default.
msg108650 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2010-06-25 23:35
Can't reproduce under Mandriva Linux:

>>> import mimetypes
>>> print(mimetypes.guess_extension('image/jpeg'))
>>> mimetypes.init()
>>> print(mimetypes.guess_extension('image/jpeg'))

The fact that it returns ".jpe" rather than ".jpg", however, could be a bug in itself (since the latter will really be expected by everyone, not the former).
msg108674 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2010-06-26 01:25
I can't reproduce this either, and without a reproducer we might as well close it.

Antoine it is possible that your fix for #5853 inadvertently fixed this, but I don't feel like untangling the logic of the module enough to figure it out :)  So I'm going to close it 'works for me'.  If S Arrowsmith can reproduce it with 2.7r2, we can reopen.

By the way, it produced 'jpe' for me, too...but, then, my system (Gentoo) /etc/mime.types file has 'jpe' as the first filetype for jpeg, so I don't think that association is Python's bug, per se. Though I may eventually have to address it in email6.  (Also by the way, I tried switching the order and passing in the modified file explicitly on the explicit init, but that didn't change the behavior).
msg108704 - (view) Author: S Arrowsmith (siona) Date: 2010-06-26 10:53
Sorry, still there:

Python 2.7rc2 (r27rc2:82137, Jun 26 2010, 11:27:59) 
[GCC 4.3.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import mimetypes
>>> mimetypes.guess_extension('image/jpeg')
>>> mimetypes.init()
>>> mimetypes.guess_extension('image/jpeg')

The fact that it's not reproducible on other Linux systems (I can't reproduce on the RedHat box I have to hand) might suggest there's something odd about Debian's mime.types . But I've just tried it passing init() the mime.types from the (working) RedHat box, and it's still producing the odd behaviour. (And I'm now on Debian 5.0, so it's not a Debian 4.0-specific issue either.) Wish I had a convenient Ubuntu install to try it on.

msg108707 - (view) Author: Shashwat Anand (l0nwlf) Date: 2010-06-26 11:09
Can't reproduce.

16:36:36 l0nwlf-MBP:~$ python2.7
Python 2.7rc2+ (trunk:82148M, Jun 22 2010, 10:32:46) 
[GCC 4.2.1 (Apple Inc. build 5646) (dot 1)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import mimetypes
>>> mimetypes.guess_extension('image/jpeg')
>>> mimetypes.init()
>>> mimetypes.guess_extension('image/jpeg')

Results were same in python2.5, 2.6 too. I wonder whether this is machine specific or distro specific.
msg108746 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2010-06-26 20:06
S Arrowsmith: can put a print statement into mimetypes.init, find out what files are loading, and paste the image/jpeg lines from each of those files here?  That might provide a clue.
msg108755 - (view) Author: S Arrowsmith (siona) Date: 2010-06-26 22:42
>>> import mimetypes 
>>> mimetypes.guess_extension('image/jpeg')
>>> mimetypes.init()
>>> mimetypes.guess_extension('image/jpeg')

$ grep jpeg /etc/mime.types
image/jpeg					jpeg jpg jpe

That big chunk of whitespace is 5 tabs. Not very helpful, I fear.
msg108853 - (view) Author: S Arrowsmith (siona) Date: 2010-06-28 18:45
I've dug into it -- again -- and my original analysis still holds. Getting consistent guess_extension() results across an explicit init() call depends on dict.items() returning keys in the same order on two different dictionaries (the original, hard-coded types_map and the one created by the first, implicit init() call).

Why should this be different on Debian to other Linuxes, even given same data as a "working" distribution? Is there something in the implementation details of dict.items() which is that distribution dependent?

(A "fix", BTW, is to insert a call to _default_mime_types() either in init() or in MimeTypes.__init__ before it calls init().)
msg108934 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2010-06-29 19:04
It must be that the different key order only happens on the one platform because of the quirky nature of dictionary construction.  That is, there is *something* on that platform that is changing where things get hashed when the dictionary is recreated.

The problem with fixing this is that any fix is going to change the behavior, unless we go to the lengths of recording the order of the initializations in add_type and replay it when init is called a second time.  That solution is pretty much a non-starter :)

The mimetypes docs say that init can be called more than once,  They say that a MimeTypes object starts out "with the same database as provided by the rest of the module".  The docs explain how the initial database state is created.

What the docs don't do is say what *happens* when you call init more than once.  There are two possibilities: either we (1) restart from the initial state, or we (2) start from the current (possibly modified) state of the database and then add whatever is specified in the init call.  (Actually, there's a third possibility: we could also add back in anything from the default init that was deleted; but this halfway version is unlikely to be anyone's intent or expectation.)

The actual implementation of the mimetypes module does (2) if and only if you pass init a list of files.  If you don't then it does something that isn't even the third way above: it reloads *just* the data from the system files it managed to find, without reloading the data from the internal tables.

Clearly this behavior is....odd.  When no files are passed, init should do one of two things: either nothing, or reset the global db state to its initial value.

It's not so clear what the behavior should be when you pass init one or more files.  It is possible, even highly probable, that there is code out there that depends on the fact that doing so is additive.

Given this analysis, I think that the best fix would be implement (and document) the following behavior for init:

  If called with no arguments, it rebuilds the module database from scratch

  If called with a list of files, it adds the contents of those files to the module database

The second is a backward compatibility hack.  Ideally it would be deprecated in favor of some sort of load_mime_files method.

It is possible that the first will also break code, but I think it is less likely, and probably an acceptable risk in a new major release.  But I'd be prepared to change it to 'init does nothing' if breakage showed up during RC testing.

The problem with this "fix" is that it does not, in fact, address the root cause of the OP's bug report.  The specific behavior he observes when calling init() would be fixed, but the underlying problem remains.  If he were to instead instantiate a new MimeTypes db, then when it "copies" the module database, it will build its own database by running the old database in key order, and once again the results returned by guess_extension might mutate.  This means that the new db is *not* a copy of the old db when it starts.

That problem could be fixed by having MimeTypes.__init__ do a copy of the types_map and types_map_inv data structures instead of rebuilding them from scratch.  This would mean shifting the initialization of these structures out of MimeTypes and in to init (in the 'reinitialize' code path) or perhaps into _default_mime_types, but I don't see that as a big problem, once init is doing a full reinitialization by default.  (There is also the question of whether it should be a 'deep copy', but I don't think that is needed since a user would need to be doing something pretty hackish to run afoul of a shallow-copy-induced problem.)

Can anyone see flaws in this analysis and proposed solution?  I've marked the fix as easy since a python hacker should be able to knock out a solution in a day, but it isn't trivial.  And I have no clue how to write a unit test for the MimeTypes.__init__ order-shifting bug.

I'm also resetting the priority to normal since I consider the ambiguity of what calling init twice actually does to be a bigger issue than it sometimes changing the results of a function with 'guess' in its name :)

I've attached a patch with a unit test for the 'init doesn't re-init' behavior.

(By the way, it also appears to me from reading the code that read_mime_types is buggy in that it actually returns a merge of the loaded file with the current module DB state, but I haven't checked that observation.)
msg108967 - (view) Author: S Arrowsmith (siona) Date: 2010-06-30 09:50
That solution looks sound to me, in particular documenting the semantics of repeated init() calls!

As for the underlying problem, it seems to me that an alternative to copying the existing structures rather than rebuilding them would be to use OrderedDicts. Although I can't think why it might be a preferable alternative, other than being a bit clearer that order of insertion can affect behaviour.
msg182465 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2013-02-20 03:38
I'd forgotten about this issue.  I wonder if the dictionary randomization makes the problem worse.
msg214948 - (view) Author: Wichert Akkerman (wichert) Date: 2014-03-27 12:40
I can reproduce this on Both OSX 10.9 and Ubuntu 12.04:

>>> import mimetypes
>>> mimetypes.guess_extension('image/jpeg')
>>> mimetypes.init()
>>> mimetypes.guess_extension('image/jpeg')

The same thing happens for Python 3.4:

Python 3.4.0rc3 (default, Mar 13 2014, 10:48:59) 
[GCC 4.2.1 Compatible Apple LLVM 5.1 (clang-503.0.38)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import mimetypes
>>> mimetypes.guess_all_extensions('image/jpeg')
['.jpg', '.jpeg', '.jpe']
>>> mimetypes.init()
>>> mimetypes.guess_all_extensions('image/jpeg')
['.jpeg', '.jpe', '.jpg']

This also looks related to Issue1043134
msg216104 - (view) Author: Toshio Kuratomi (a.badger) * Date: 2014-04-14 16:08
Took a look at this and was able to reproduce it on Fedora Linux 20 and current cpython head.  It is somewhat random though.  I'm able to get reasonably consistent failures using image/jpeg and iterating the test case about 20 times.

Additionally, it looks like the data structure that mimetypes.guess_extensions() is reading its extensions from is a list so it doesn't have to do with dictionary sort order.  It has something to do with the way the extensions are read in from the files and then given to add_type().

Talking to r.david.murray I think that this particular problem can be solved by simply sorting the list of extensions prior to guess_extension taking the first extension off of the list.

The question of what to do when the first extension in the list isn't the best extension should be resolved in Issue1043134.

I'll attach a patch with test case for this problem.
msg216764 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2014-04-17 23:28
OK, it is great having a test that makes this at least mostly reproducible :)

Having reloaded my brain on this thing, I'm thinking that the best solution may be indeed to switch to ordered dicts.  If we then reorder the hardcoded lists to be in "preferred" order, that should then also solve issue 1043134.
msg246649 - (view) Author: Julian Sivertsen (sivert) Date: 2015-07-12 13:22
I bumped into a similar issue with mimetypes.guess_extension on Arch Linux 64-bit in February.  The behavior is still present in python 3.4.3.

$ python
$ python
$ cat
from mimetypes import guess_extension

$ python
Python 3.4.3 (default, Mar 25 2015, 17:13:50)
[GCC 4.9.2 20150304 (prerelease)] on linux
Type "help", "copyright", "credits" or "license" for more information.
msg293248 - (view) Author: David K. Hess (dhess) * Date: 2017-05-08 19:15
Concur with @sivert – the result of guess_extension() is non-deterministic between mimetypes module initialization.

$ python
Python 3.4.3 (default, Nov 17 2016, 01:08:31) 
[GCC 4.8.4] on linux
Type "help", "copyright", "credits" or "license" for more information.
$ python -c 'import mimetypes;print(mimetypes.guess_extension("image/jpeg"))'
$ python -c 'import mimetypes;print(mimetypes.guess_extension("image/jpeg"))'
$ python -c 'import mimetypes;print(mimetypes.guess_extension("image/jpeg"))'
$ python -c 'import mimetypes;print(mimetypes.guess_extension("image/jpeg"))'
$ python -c 'import mimetypes;print(mimetypes.guess_extension("image/jpeg"))'
$ python -c 'import mimetypes;print(mimetypes.guess_extension("image/jpeg"))'
$ python -c 'import mimetypes;print(mimetypes.guess_extension("image/jpeg"))'
msg293249 - (view) Author: David K. Hess (dhess) * Date: 2017-05-08 19:35
And the underlying problem causing this:

$ python -c 'import mimetypes;print(mimetypes.guess_all_extensions("image/jpeg"))'
['.jpeg', '.jpg', '.jpe']
$ python -c 'import mimetypes;print(mimetypes.guess_all_extensions("image/jpeg"))'
['.jpg', '.jpe', '.jpeg']
$ python -c 'import mimetypes;print(mimetypes.guess_all_extensions("image/jpeg"))'
['.jpg', '.jpeg', '.jpe']
$ python -c 'import mimetypes;print(mimetypes.guess_all_extensions("image/jpeg"))'
['.jpe', '.jpg', '.jpeg']
$ python -c 'import mimetypes;print(mimetypes.guess_all_extensions("image/jpeg"))'
['.jpeg', '.jpg', '.jpe']

If the module can't know which extension is preferred, perhaps guess_extension should just be deprecated and the results of guess_all_extensions sorted on return?

At least that would give us some determinism to work with.
msg293258 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2017-05-08 22:34
@dhess: do you want to work on the OrderedDict + correctly ordered hardcoded lists solution?
msg293261 - (view) Author: Martin Panter (martin.panter) * (Python committer) Date: 2017-05-08 23:47
I suggest to discuss the non-determinism problem in Issue 1043134 (about determining a canonical extension for each content type). I understood this bug (Issue 4963) is about the behaviour of repeated initialization of the same instance of mimetypes.

BTW an ordered dictionary wouldn’t help with duplicate dictionary keys; see guess_extension("application/excel").
msg293262 - (view) Author: Martin Panter (martin.panter) * (Python committer) Date: 2017-05-09 00:16
I understand hash randomization was added after this bug was opened. Here is a demonstration with “video/mp4”, which only has the extension “.mp4” built in. But my /etc/mime.types file lists “mp4 mp4v mpg4”, so after the second initialization the behaviour changes:

PYTHONHASHSEED=0 python3.5 -c 'from mimetypes import *; print(guess_all_extensions("video/mp4")); init(); print(guess_all_extensions("video/mp4"))'
['.mp4', '.mp4v', '.mpg4']
['.mpg4', '.mp4', '.mp4v']

The first extension is always “.mp4”, regardless of hash randomization, due to the built-in list. But after re-initialization, the first extension depends on the order in the internal dictionary.

Using an ordered dictionary may work as a bug fix, but the whole initialization logic is so complex and it would be good to simplify it in the long term.
msg293626 - (view) Author: David K. Hess (dhess) * Date: 2017-05-13 22:39
Ok, I followed @r.david.murray's advice and decided to take a shot at this.

First, I noticed that I couldn't reproduce the non-deterministic behavior that I reported above on the latest code (i.e. pre-3.7). After doing some research it appears this was the sequence of events:

1) Pre-3.3, hashing was stable and this wasn't a problem.
2) Hash randomization became the default in version 3.3 and this non-determinism showed up.
3) A new dict implementation was introduced in 3.6 and key orders became stable between runs and this non-determinism was gone. However, as the notes on the new dict implementation indicate, this ordering should not be relied upon.

I also looked at some other issues:

* 6626 - The patch here basically rewrote the module. I agreed with the last comment on that issue that it probably doesn't need that.
* 24527 - Related to the .init() problems discussed here in r.david.murray's excellent analysis of the init behavior.
* 1043134 - Where the preferred extension issue was addressed via a proposed new map.

My approach with this patch is to address the init problem, the non-determinism and the preferred extension issue.

For the init, I made two changes:

1) I added new references to the initial values of the maps so they could be retained between init() calls. I also modified MimeTypes.__init__ to refer to these.

2) I modified the init() function to check the files argument as r.david.murray suggested. If it is supplied, then the existing database is used and the files are added to it. If it is not supplied, then the module reinitializes from scratch. I'll update the documentation to reflect this if the commit passes muster.

For the non-determinism and preferred extension, I changed the two extension type maps to be OrderedDicts. I then sorted the entries to the OrderedDict constructor by mime type and then placed the preferred extension as the first extension to be processed. This guarantees that it will be the extension returned for guess_type. The OrderedDict also guarantees that guess_all_extensions will always build and return the same value.

The commit can be reviewed here:

I'll open a PR if and when this approach gets enough positive feedback.
msg293628 - (view) Author: David K. Hess (dhess) * Date: 2017-05-14 01:29
Pushed more commits so here's a branch compare:
Date User Action Args
2017-05-14 01:29:51dhesssetmessages: + msg293628
2017-05-13 22:39:36dhesssetmessages: + msg293626
2017-05-09 00:16:15martin.pantersetmessages: + msg293262
2017-05-08 23:47:04martin.pantersetnosy: + martin.panter
messages: + msg293261
2017-05-08 22:35:03r.david.murraysetstage: test needed -> needs patch
versions: + Python 3.6, Python 3.7, - Python 2.7, Python 3.2, Python 3.3, Python 3.4
2017-05-08 22:34:41r.david.murraysetmessages: + msg293258
2017-05-08 19:35:57dhesssetmessages: + msg293249
2017-05-08 19:15:23dhesssetnosy: + dhess
messages: + msg293248
2015-10-23 11:52:38wodnysetnosy: + wodny
2015-07-12 13:22:10sivertsetnosy: + sivert
messages: + msg246649
2015-05-20 13:37:29r.david.murraylinkissue24246 superseder
2014-04-17 23:28:07r.david.murraysetmessages: + msg216764
2014-04-14 16:08:18a.badgersetfiles: + issue4963.patch
nosy: + a.badger
messages: + msg216104

2014-03-27 12:40:07wichertsetnosy: + wichert
messages: + msg214948
2013-02-20 03:38:28r.david.murraysetversions: + Python 3.2, Python 3.3, Python 3.4
nosy: + barry

messages: + msg182465

components: + email
2010-06-30 09:50:09sionasetmessages: + msg108967
2010-06-29 19:04:01r.david.murraysetfiles: + mimetypes-init-test.patch
priority: low -> normal
messages: + msg108934

keywords: + easy, patch
resolution: works for me ->
2010-06-28 18:45:20sionasetmessages: + msg108853
2010-06-26 22:42:31sionasetmessages: + msg108755
2010-06-26 20:06:09r.david.murraysetstatus: closed -> open
priority: normal -> low
messages: + msg108746

stage: test needed
2010-06-26 11:09:39l0nwlfsetnosy: + l0nwlf
messages: + msg108707
2010-06-26 10:53:20sionasetmessages: + msg108704
2010-06-26 01:26:01r.david.murraysetstatus: open -> closed

nosy: + r.david.murray
messages: + msg108674

resolution: works for me
2010-06-25 23:35:31pitrousetnosy: + pitrou
messages: + msg108650
2010-06-25 23:29:30terry.reedysetversions: + Python 2.7, - Python 2.5, Python 2.4
2009-01-16 17:13:16sionasetmessages: + msg79962
2009-01-16 16:40:59terry.reedysetnosy: + terry.reedy
messages: + msg79961
2009-01-16 12:04:54sionacreate