Author r.david.murray
Recipients l0nwlf, pitrou, r.david.murray, siona, terry.reedy
Date 2010-06-29.19:04:00
SpamBayes Score 3.89777e-10
Marked as misclassified No
Message-id <1277838248.92.0.300476634689.issue4963@psf.upfronthosting.co.za>
In-reply-to
Content
It must be that the different key order only happens on the one platform because of the quirky nature of dictionary construction.  That is, there is *something* on that platform that is changing where things get hashed when the dictionary is recreated.

The problem with fixing this is that any fix is going to change the behavior, unless we go to the lengths of recording the order of the initializations in add_type and replay it when init is called a second time.  That solution is pretty much a non-starter :)

The mimetypes docs say that init can be called more than once,  They say that a MimeTypes object starts out "with the same database as provided by the rest of the module".  The docs explain how the initial database state is created.

What the docs don't do is say what *happens* when you call init more than once.  There are two possibilities: either we (1) restart from the initial state, or we (2) start from the current (possibly modified) state of the database and then add whatever is specified in the init call.  (Actually, there's a third possibility: we could also add back in anything from the default init that was deleted; but this halfway version is unlikely to be anyone's intent or expectation.)

The actual implementation of the mimetypes module does (2) if and only if you pass init a list of files.  If you don't then it does something that isn't even the third way above: it reloads *just* the data from the system files it managed to find, without reloading the data from the internal tables.

Clearly this behavior is....odd.  When no files are passed, init should do one of two things: either nothing, or reset the global db state to its initial value.

It's not so clear what the behavior should be when you pass init one or more files.  It is possible, even highly probable, that there is code out there that depends on the fact that doing so is additive.

Given this analysis, I think that the best fix would be implement (and document) the following behavior for init:

  If called with no arguments, it rebuilds the module database from scratch

  If called with a list of files, it adds the contents of those files to the module database

The second is a backward compatibility hack.  Ideally it would be deprecated in favor of some sort of load_mime_files method.

It is possible that the first will also break code, but I think it is less likely, and probably an acceptable risk in a new major release.  But I'd be prepared to change it to 'init does nothing' if breakage showed up during RC testing.

The problem with this "fix" is that it does not, in fact, address the root cause of the OP's bug report.  The specific behavior he observes when calling init() would be fixed, but the underlying problem remains.  If he were to instead instantiate a new MimeTypes db, then when it "copies" the module database, it will build its own database by running the old database in key order, and once again the results returned by guess_extension might mutate.  This means that the new db is *not* a copy of the old db when it starts.

That problem could be fixed by having MimeTypes.__init__ do a copy of the types_map and types_map_inv data structures instead of rebuilding them from scratch.  This would mean shifting the initialization of these structures out of MimeTypes and in to init (in the 'reinitialize' code path) or perhaps into _default_mime_types, but I don't see that as a big problem, once init is doing a full reinitialization by default.  (There is also the question of whether it should be a 'deep copy', but I don't think that is needed since a user would need to be doing something pretty hackish to run afoul of a shallow-copy-induced problem.)

Can anyone see flaws in this analysis and proposed solution?  I've marked the fix as easy since a python hacker should be able to knock out a solution in a day, but it isn't trivial.  And I have no clue how to write a unit test for the MimeTypes.__init__ order-shifting bug.

I'm also resetting the priority to normal since I consider the ambiguity of what calling init twice actually does to be a bigger issue than it sometimes changing the results of a function with 'guess' in its name :)

I've attached a patch with a unit test for the 'init doesn't re-init' behavior.

(By the way, it also appears to me from reading the code that read_mime_types is buggy in that it actually returns a merge of the loaded file with the current module DB state, but I haven't checked that observation.)
History
Date User Action Args
2010-06-29 19:04:09r.david.murraysetrecipients: + r.david.murray, terry.reedy, pitrou, siona, l0nwlf
2010-06-29 19:04:08r.david.murraysetmessageid: <1277838248.92.0.300476634689.issue4963@psf.upfronthosting.co.za>
2010-06-29 19:04:01r.david.murraylinkissue4963 messages
2010-06-29 19:04:00r.david.murraycreate