classification
Title: mimetypes.guess_extension result changes after mimetypes.init()
Type: behavior Stage: test needed
Components: email, Library (Lib) Versions: Python 3.4, Python 3.3, Python 3.2, Python 2.7
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: a.badger, barry, l0nwlf, pitrou, r.david.murray, siona, terry.reedy, wichert
Priority: normal Keywords: easy, patch

Created on 2009-01-16 12:04 by siona, last changed 2014-04-14 16:08 by a.badger.

Files
File name Uploaded Description Edit
mimetypes-init-test.patch r.david.murray, 2010-06-29 19:04 review
issue4963.patch a.badger, 2014-04-14 16:08 stable guess_extension patch and test review
Messages (15)
msg79955 - (view) Author: S Arrowsmith (siona) Date: 2009-01-16 12:04
Asking mimetypes to reload mime.types can cause guess_extension() to
return a different result if multiple extensions are mapped to that mime
type:

>>> import mimetypes
>>> mimetypes.guess_extension('image/jpeg')
'.jpe'
>>> mimetypes.init()
>>> mimetypes.guess_extension('image/jpeg')
'.jpeg'
>>>

This is because both the forward (extension to type) and inverse (type
to extension) type mapping dicts are populated by iterating through the
existing forward (extension to type) dict (types_map), then supplemented
by reading from mime.types (or any other files given to init()). The
fully populated forward dict becomes the new types_map. Initially,
types_map is hard-coded, but when the type mapping dicts are
repopulated, by explicitly or implicitly calling init() again, it is
done by iterating over the types_map created by the first init() call,
not the hard-coded one. If the iteration order for a set of extensions
with the same type is different in these two versions of the forward
dict, the order of extensions appearing for that type in the inverse
dict will change. And so the behavior of guess_all_extensions() and
hence guess_extension() will change.
msg79961 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2009-01-16 16:40
3.0, WinXP
import mimetypes
print(mimetypes.guess_extension('image/jpeg'))
mimetypes.init()
print(mimetypes.guess_extension('image/jpeg'))
gives
.jpe
.jpe

I wonder at this answer since .jpg and occasionally .jpeg is standard in
Windows usage, but the doc is unclear to me as to the actual intent of
the function.
msg79962 - (view) Author: S Arrowsmith (siona) Date: 2009-01-16 17:13
Ah, yes, forgot to mention this is on Debian 4.0. I doubt you're going
to run into it on a Windows system unless you explicitly give init() a
mime.types file, looking at the knownfiles list used by default.
msg108650 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2010-06-25 23:35
Can't reproduce under Mandriva Linux:

>>> import mimetypes
>>> print(mimetypes.guess_extension('image/jpeg'))
.jpe
>>> mimetypes.init()
>>> print(mimetypes.guess_extension('image/jpeg'))
.jpe

The fact that it returns ".jpe" rather than ".jpg", however, could be a bug in itself (since the latter will really be expected by everyone, not the former).
msg108674 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2010-06-26 01:25
I can't reproduce this either, and without a reproducer we might as well close it.

Antoine it is possible that your fix for #5853 inadvertently fixed this, but I don't feel like untangling the logic of the module enough to figure it out :)  So I'm going to close it 'works for me'.  If S Arrowsmith can reproduce it with 2.7r2, we can reopen.

By the way, it produced 'jpe' for me, too...but, then, my system (Gentoo) /etc/mime.types file has 'jpe' as the first filetype for jpeg, so I don't think that association is Python's bug, per se. Though I may eventually have to address it in email6.  (Also by the way, I tried switching the order and passing in the modified file explicitly on the explicit init, but that didn't change the behavior).
msg108704 - (view) Author: S Arrowsmith (siona) Date: 2010-06-26 10:53
Sorry, still there:

Python 2.7rc2 (r27rc2:82137, Jun 26 2010, 11:27:59) 
[GCC 4.3.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import mimetypes
>>> mimetypes.guess_extension('image/jpeg')
'.jpe'
>>> mimetypes.init()
>>> mimetypes.guess_extension('image/jpeg')
'.jpeg'

The fact that it's not reproducible on other Linux systems (I can't reproduce on the RedHat box I have to hand) might suggest there's something odd about Debian's mime.types . But I've just tried it passing init() the mime.types from the (working) RedHat box, and it's still producing the odd behaviour. (And I'm now on Debian 5.0, so it's not a Debian 4.0-specific issue either.) Wish I had a convenient Ubuntu install to try it on.

Bizarre.
msg108707 - (view) Author: Shashwat Anand (l0nwlf) Date: 2010-06-26 11:09
Can't reproduce.

16:36:36 l0nwlf-MBP:~$ python2.7
Python 2.7rc2+ (trunk:82148M, Jun 22 2010, 10:32:46) 
[GCC 4.2.1 (Apple Inc. build 5646) (dot 1)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import mimetypes
>>> mimetypes.guess_extension('image/jpeg')
'.jpe'
>>> mimetypes.init()
>>> mimetypes.guess_extension('image/jpeg')
'.jpe'
>>> 

Results were same in python2.5, 2.6 too. I wonder whether this is machine specific or distro specific.
msg108746 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2010-06-26 20:06
S Arrowsmith: can put a print statement into mimetypes.init, find out what files are loading, and paste the image/jpeg lines from each of those files here?  That might provide a clue.
msg108755 - (view) Author: S Arrowsmith (siona) Date: 2010-06-26 22:42
>>> import mimetypes 
>>> mimetypes.guess_extension('image/jpeg')
/etc/mime.types
'.jpe'
>>> mimetypes.init()
/etc/mime.types
>>> mimetypes.guess_extension('image/jpeg')
'.jpeg'
>>> 

$ grep jpeg /etc/mime.types
image/jpeg					jpeg jpg jpe
$

That big chunk of whitespace is 5 tabs. Not very helpful, I fear.
msg108853 - (view) Author: S Arrowsmith (siona) Date: 2010-06-28 18:45
I've dug into it -- again -- and my original analysis still holds. Getting consistent guess_extension() results across an explicit init() call depends on dict.items() returning keys in the same order on two different dictionaries (the original, hard-coded types_map and the one created by the first, implicit init() call).

Why should this be different on Debian to other Linuxes, even given same data as a "working" distribution? Is there something in the implementation details of dict.items() which is that distribution dependent?

(A "fix", BTW, is to insert a call to _default_mime_types() either in init() or in MimeTypes.__init__ before it calls init().)
msg108934 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2010-06-29 19:04
It must be that the different key order only happens on the one platform because of the quirky nature of dictionary construction.  That is, there is *something* on that platform that is changing where things get hashed when the dictionary is recreated.

The problem with fixing this is that any fix is going to change the behavior, unless we go to the lengths of recording the order of the initializations in add_type and replay it when init is called a second time.  That solution is pretty much a non-starter :)

The mimetypes docs say that init can be called more than once,  They say that a MimeTypes object starts out "with the same database as provided by the rest of the module".  The docs explain how the initial database state is created.

What the docs don't do is say what *happens* when you call init more than once.  There are two possibilities: either we (1) restart from the initial state, or we (2) start from the current (possibly modified) state of the database and then add whatever is specified in the init call.  (Actually, there's a third possibility: we could also add back in anything from the default init that was deleted; but this halfway version is unlikely to be anyone's intent or expectation.)

The actual implementation of the mimetypes module does (2) if and only if you pass init a list of files.  If you don't then it does something that isn't even the third way above: it reloads *just* the data from the system files it managed to find, without reloading the data from the internal tables.

Clearly this behavior is....odd.  When no files are passed, init should do one of two things: either nothing, or reset the global db state to its initial value.

It's not so clear what the behavior should be when you pass init one or more files.  It is possible, even highly probable, that there is code out there that depends on the fact that doing so is additive.

Given this analysis, I think that the best fix would be implement (and document) the following behavior for init:

  If called with no arguments, it rebuilds the module database from scratch

  If called with a list of files, it adds the contents of those files to the module database

The second is a backward compatibility hack.  Ideally it would be deprecated in favor of some sort of load_mime_files method.

It is possible that the first will also break code, but I think it is less likely, and probably an acceptable risk in a new major release.  But I'd be prepared to change it to 'init does nothing' if breakage showed up during RC testing.

The problem with this "fix" is that it does not, in fact, address the root cause of the OP's bug report.  The specific behavior he observes when calling init() would be fixed, but the underlying problem remains.  If he were to instead instantiate a new MimeTypes db, then when it "copies" the module database, it will build its own database by running the old database in key order, and once again the results returned by guess_extension might mutate.  This means that the new db is *not* a copy of the old db when it starts.

That problem could be fixed by having MimeTypes.__init__ do a copy of the types_map and types_map_inv data structures instead of rebuilding them from scratch.  This would mean shifting the initialization of these structures out of MimeTypes and in to init (in the 'reinitialize' code path) or perhaps into _default_mime_types, but I don't see that as a big problem, once init is doing a full reinitialization by default.  (There is also the question of whether it should be a 'deep copy', but I don't think that is needed since a user would need to be doing something pretty hackish to run afoul of a shallow-copy-induced problem.)

Can anyone see flaws in this analysis and proposed solution?  I've marked the fix as easy since a python hacker should be able to knock out a solution in a day, but it isn't trivial.  And I have no clue how to write a unit test for the MimeTypes.__init__ order-shifting bug.

I'm also resetting the priority to normal since I consider the ambiguity of what calling init twice actually does to be a bigger issue than it sometimes changing the results of a function with 'guess' in its name :)

I've attached a patch with a unit test for the 'init doesn't re-init' behavior.

(By the way, it also appears to me from reading the code that read_mime_types is buggy in that it actually returns a merge of the loaded file with the current module DB state, but I haven't checked that observation.)
msg108967 - (view) Author: S Arrowsmith (siona) Date: 2010-06-30 09:50
That solution looks sound to me, in particular documenting the semantics of repeated init() calls!

As for the underlying problem, it seems to me that an alternative to copying the existing structures rather than rebuilding them would be to use OrderedDicts. Although I can't think why it might be a preferable alternative, other than being a bit clearer that order of insertion can affect behaviour.
msg182465 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2013-02-20 03:38
I'd forgotten about this issue.  I wonder if the dictionary randomization makes the problem worse.
msg214948 - (view) Author: Wichert Akkerman (wichert) Date: 2014-03-27 12:40
I can reproduce this on Both OSX 10.9 and Ubuntu 12.04:

>>> import mimetypes
>>> mimetypes.guess_extension('image/jpeg')
'.jpe'
>>> mimetypes.init()
>>> mimetypes.guess_extension('image/jpeg')
'.jpeg'

The same thing happens for Python 3.4:

Python 3.4.0rc3 (default, Mar 13 2014, 10:48:59) 
[GCC 4.2.1 Compatible Apple LLVM 5.1 (clang-503.0.38)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import mimetypes
>>> mimetypes.guess_all_extensions('image/jpeg')
['.jpg', '.jpeg', '.jpe']
>>> mimetypes.init()
>>> mimetypes.guess_all_extensions('image/jpeg')
['.jpeg', '.jpe', '.jpg']

This also looks related to Issue1043134
msg216104 - (view) Author: Toshio Kuratomi (a.badger) * Date: 2014-04-14 16:08
Took a look at this and was able to reproduce it on Fedora Linux 20 and current cpython head.  It is somewhat random though.  I'm able to get reasonably consistent failures using image/jpeg and iterating the test case about 20 times.

Additionally, it looks like the data structure that mimetypes.guess_extensions() is reading its extensions from is a list so it doesn't have to do with dictionary sort order.  It has something to do with the way the extensions are read in from the files and then given to add_type().

Talking to r.david.murray I think that this particular problem can be solved by simply sorting the list of extensions prior to guess_extension taking the first extension off of the list.

The question of what to do when the first extension in the list isn't the best extension should be resolved in Issue1043134.

I'll attach a patch with test case for this problem.
History
Date User Action Args
2014-04-14 16:08:18a.badgersetfiles: + issue4963.patch
nosy: + a.badger
messages: + msg216104

2014-03-27 12:40:07wichertsetnosy: + wichert
messages: + msg214948
2013-02-20 03:38:28r.david.murraysetversions: + Python 3.2, Python 3.3, Python 3.4
nosy: + barry

messages: + msg182465

components: + email
2010-06-30 09:50:09sionasetmessages: + msg108967
2010-06-29 19:04:01r.david.murraysetfiles: + mimetypes-init-test.patch
priority: low -> normal
messages: + msg108934

keywords: + easy, patch
resolution: works for me ->
2010-06-28 18:45:20sionasetmessages: + msg108853
2010-06-26 22:42:31sionasetmessages: + msg108755
2010-06-26 20:06:09r.david.murraysetstatus: closed -> open
priority: normal -> low
messages: + msg108746

stage: test needed
2010-06-26 11:09:39l0nwlfsetnosy: + l0nwlf
messages: + msg108707
2010-06-26 10:53:20sionasetmessages: + msg108704
2010-06-26 01:26:01r.david.murraysetstatus: open -> closed

nosy: + r.david.murray
messages: + msg108674

resolution: works for me
2010-06-25 23:35:31pitrousetnosy: + pitrou
messages: + msg108650
2010-06-25 23:29:30terry.reedysetversions: + Python 2.7, - Python 2.5, Python 2.4
2009-01-16 17:13:16sionasetmessages: + msg79962
2009-01-16 16:40:59terry.reedysetnosy: + terry.reedy
messages: + msg79961
2009-01-16 12:04:54sionacreate