Author dhess
Recipients a.badger, barry, dhess, l0nwlf, martin.panter, pitrou, r.david.murray, siona, sivert, terry.reedy, wichert, wodny
Date 2017-05-13.22:39:35
SpamBayes Score -1.0
Marked as misclassified Yes
Message-id <1494715176.03.0.499792216014.issue4963@psf.upfronthosting.co.za>
In-reply-to
Content
Ok, I followed @r.david.murray's advice and decided to take a shot at this.

First, I noticed that I couldn't reproduce the non-deterministic behavior that I reported above on the latest code (i.e. pre-3.7). After doing some research it appears this was the sequence of events:

1) Pre-3.3, hashing was stable and this wasn't a problem.
2) Hash randomization became the default in version 3.3 and this non-determinism showed up.
3) A new dict implementation was introduced in 3.6 and key orders became stable between runs and this non-determinism was gone. However, as the notes on the new dict implementation indicate, this ordering should not be relied upon.

I also looked at some other issues:

* 6626 - The patch here basically rewrote the module. I agreed with the last comment on that issue that it probably doesn't need that.
* 24527 - Related to the .init() problems discussed here in r.david.murray's excellent analysis of the init behavior.
* 1043134 - Where the preferred extension issue was addressed via a proposed new map.

My approach with this patch is to address the init problem, the non-determinism and the preferred extension issue.

For the init, I made two changes:

1) I added new references to the initial values of the maps so they could be retained between init() calls. I also modified MimeTypes.__init__ to refer to these.

2) I modified the init() function to check the files argument as r.david.murray suggested. If it is supplied, then the existing database is used and the files are added to it. If it is not supplied, then the module reinitializes from scratch. I'll update the documentation to reflect this if the commit passes muster.

For the non-determinism and preferred extension, I changed the two extension type maps to be OrderedDicts. I then sorted the entries to the OrderedDict constructor by mime type and then placed the preferred extension as the first extension to be processed. This guarantees that it will be the extension returned for guess_type. The OrderedDict also guarantees that guess_all_extensions will always build and return the same value.

The commit can be reviewed here:

https://github.com/davidkhess/cpython/commit/ecabb1cb57e7e066a693653f485f2f687dcc7f6b

I'll open a PR if and when this approach gets enough positive feedback.
History
Date User Action Args
2017-05-13 22:39:36dhesssetrecipients: + dhess, barry, terry.reedy, pitrou, wichert, a.badger, r.david.murray, siona, l0nwlf, martin.panter, wodny, sivert
2017-05-13 22:39:36dhesssetmessageid: <1494715176.03.0.499792216014.issue4963@psf.upfronthosting.co.za>
2017-05-13 22:39:36dhesslinkissue4963 messages
2017-05-13 22:39:35dhesscreate