Created on 2012-06-27 16:03 by dlchambers, last changed 2013-04-18 00:27 by benhoyt.
|mimetype_flaw_demo.zip||dlchambers, 2012-06-27 16:03||Demo of bug and proposed fix|
|mimetypes.py.diff||dlchambers, 2012-06-27 17:54||diff file|
|mimetypes.py.diff.u||dlchambers, 2012-06-27 19:54||diff with universal format|
|issue15207.patch||ishimoto, 2012-07-25 03:06||review|
|mt.py||tim.golden, 2013-04-17 12:05||Show mimetype mappings|
|mt-tip.txt||tim.golden, 2013-04-17 12:07|
|mt-tip-newregistry.txt||tim.golden, 2013-04-17 12:07|
|mt-tip-noregistry.txt||tim.golden, 2013-04-17 12:07|
|different.txt||benhoyt, 2013-04-18 00:27|
|msg164167 - (view)||Author: Dave Chambers (dlchambers)||Date: 2012-06-27 16:03|
The current mimetypes.read_windows_registry() enums the values under HKCR\MIME\Database\Content Type However, this is the key for mimetype to extension lookups, NOT for extension to mimetype lookups. As a result, when >1 MIME types are mapped to a particular extension, the last-found entry is used. For example, both "image/png" and "image/x-png" map to the ".png" file extension. Unfortunately, what happens is this code finds "image/png", then later finds "image/x-png" and this steals the ".png" extension. The solution is to use the correct regkey, which is the HKCR root. This is the correct location for extension-to-mimetype lookups. What we should do is enum the HKCR root, find all subkeys that start with a dot (i.e. file extensions), then inspect those for a 'Content Type' value. The attached ZIP contains: mimetype_flaw_demo.py - this demonstrates the error (due to wrong regkey) and my fix (uses the correct regkey) mimetypes_fixed.py - My suggested fix to the standard mimetypes.py module.
|msg164176 - (view)||Author: R. David Murray (r.david.murray) *||Date: 2012-06-27 17:39|
Thanks for working on this. Could you please post the fix as a patch file? If you don't have mercurial, you can generate the diff on windows using the python diff module (scripts/diff.py -u <yourfile> <origfile>). Actually, I'm not sure exactly where diff is in the windows install, but I know it is there. Do you know if image/x-png and image/png are included in the registry on all windows versions? If so we could use that key for a unit test.
|msg164177 - (view)||Author: Dave Chambers (dlchambers)||Date: 2012-06-27 17:54|
My first diff file... I hope I did it right :)
|msg164180 - (view)||Author: Dave Chambers (dlchambers)||Date: 2012-06-27 18:06|
I added a diff file to the bug. Dunno if that's the same as a patch file, or how to create a patchfile if it's not. >Do you know if image/x-png and image/png are included in the registry on all > windows versions? I think your question is reversed, in the same way that the code was reversed. You're not looking for image/png and/or image/x-png. You're looking for .png in order to retrieve its mimetype (aka Content Type). While nothing is 100% certain on Windows :), I'm quite confident that every copy will have an HKCR\.png regkey, and that regkey will have a Content Type value, and that value's setting will be the appropriate mometype, which I'd expect to be image/png. I was kinda surprised to find this bug as it's so obvious I started chasing it because Chrome kept complaining that pngs were being served as image/x-png (by CherryPy). There are other bugs (eg: 15199, 10551) that my patch should fix. -Dave
|msg164191 - (view)||Author: R. David Murray (r.david.murray) *||Date: 2012-06-27 19:30|
Well, I had no involvement in the windows registry reading stuff, and it is relatively new. And, as issue 10551 indicates, a bit controversial. (issue 15199 is a different bug, in python's own internal table). Can you run that diff again and use the '-u' flag? The -u (universal) format is the one we are used to working with. The one you posted still lets us read the changes, though, which is very helpful.
|msg166351 - (view)||Author: Atsuo Ishimoto (ishimoto) *||Date: 2012-07-25 03:06|
This patch looks good to me. I generated a patch for current trunk, with some cosmetic changes.
|msg168833 - (view)||Author: Yap Sok Ann (sayap)||Date: 2012-08-22 01:17|
On Python 2.7, I need to add this to the original diff by Dave, in the same try-except block: mimetype = mimetype.encode(default_encoding) # omit in 3.x!
|msg168918 - (view)||Author: R. David Murray (r.david.murray) *||Date: 2012-08-23 01:47|
Unfortunately I don't feel qualified to review the patch itself since I'm not a windows user and don't currently even have a windows box to test on. Hopefully one of the windows devs will take a look; the patch looks to be fairly straightforward to evaluate if one understands _winreg.
|msg177030 - (view)||Author: Ben Hoyt (benhoyt)||Date: 2012-12-06 07:57|
Ah, thanks for making this an issue of its own! As I commented over at Issue10551, it's a serious problem, and makes mimetypes.guess_type() unusable out of the box on Windows. Yes, the fix in Issue4969 uses "MIME\Database\Content Type", which is a mime type -> file extension mapping, *not the other way around*. So while this patch is definitely an improvement (for the most part it doesn't produce wrong values!), but I'm not sure it's the way to go, for a few reasons: 1) Many of the most important keys aren't in the Windows registry (in HKEY_CLASSES_ROOT, where this patch looks). This includes .png, .jpg, and .gif. All of these important types fall back to the hard-coded "types_map" in mimetypes.py anyway. 2) Some that do exist are wrong in the registry (or at the least, different from the built-in "types_map"). This includes .zip, which is "application/x-zip-compressed" (at least in my registry) but should be "application/zip". 3) It's slowish, as it has to load about 6000 registry keys (and more as you install more stuff on your system), but only about 200 of those have the "Content Type" subkey. On my machine (Windows 7, 64 bit CPython) this adds over 100ms to the startup time even on subsequent runs when cached -- and I think 100ms is pretty significant. Issue4969's version takes about 25ms, and reverting this completely would of course take 0ms. 4) Users and other programs can (and sometimes do!) change the Content Type keys in the registry -- whereas one wants mime type mappings to be consistent across systems. This last point is debatable for various reasons, and I think the above three points should carry the day, but I include it here for completeness. ;-) For these reasons, I think we should revert the fix for Issue4969 and leave Windows users to get the default types_map as before, which is at least consisent -- and for mimetypes.guess_type(), you want consistency.
|msg177085 - (view)||Author: Dave Chambers (dlchambers)||Date: 2012-12-07 13:01|
Disappointing that "faster but broken" is preferable to "slower but fixed"
|msg177089 - (view)||Author: R. David Murray (r.david.murray) *||Date: 2012-12-07 13:33|
I will note that on unix the user is also free to update the machine's mime types registry (that's more than half the point of the mimetypes module). Usually this is only done by installed software...as I believe is the case on Windows as well. That said, there should be a way to explicitly bypass this loading of local data for a program that wishes to use only the Python supplied types. And indeed, this is possible: just pass an empty list of filenames to init. This bypasses the windows registry lookup. (Note that this could be better documented...it is not made explicit that an empty list is different from not specifying a list or specifying it as None, but it is).
|msg177091 - (view)||Author: R. David Murray (r.david.murray) *||Date: 2012-12-07 13:37|
That said, the fact that windows is just *wrong* about some mimetypes is definitely an issue. We could call it a platform bug, but that would be a disservice to the user community.
|msg177092 - (view)||Author: Dave Chambers (dlchambers)||Date: 2012-12-07 13:46|
Seems to me that some hybrid would be a good solution: Hardcode the known types (which solves the "windows is just wrong" case) then as a default look in the registry for those that aren't hardcoded. Therefore the hit of additional time would only be for lesser-known types. In any case, it's pretty bad that python allows the wrong mimetype for PNG , even if it is a Windows registry issue.
|msg177096 - (view)||Author: R. David Murray (r.david.murray) *||Date: 2012-12-07 14:51|
To be consistent with the overall philosophy of the mimetypes module, it should be instead a list of "windows fixes" which are applied if the broken mimetype is found in the windows registry. If you want to avoid the overhead, pass an empty list to init. A note about the overhead and fixes should be added to the docs.
|msg177257 - (view)||Author: Ben Hoyt (benhoyt)||Date: 2012-12-10 03:18|
Either way -- this needs to be reverted or fixed. It's a nasty gotcha for folks writing Python web services at the moment. I'm still for reverting, per my reasons above. Dave Chambers, I'm not for "faster but broken" but for "faster and fixed" -- from what I've shown above, it's the Windows registry that's broken, so removing read_windows_registry() entirely would fix this (and as a bonus, be faster and simplify the code :-). Per your suggestion http://bugs.python.org/issue15207#msg177092 -- I don't understand how mimetypes.py would know the types "that aren't hardcoded". R. David Murray, I don't understand the advantage of trying to maintain a list of "Windows fixes". What if this list was wrong, or there was a Windows update which broke more mime types? Why can't we just avoid the complication and go back to the hardcoded types for Windows?
|msg177259 - (view)||Author: Dave Chambers (dlchambers)||Date: 2012-12-10 03:31|
> removing read_windows_registry() If you're suggesting hardcoding *ALL* the mimetypes for *ALL* OSes, I think that's probably the best overall solution. No variability, as fast as can be. The downside is that there would occasionally be an unrecognized type, thus there'd need to be diligence to keep the hardcoded list up to date, but overall I think Ben Hoyt's suggestion is best.
|msg177260 - (view)||Author: Ben Hoyt (benhoyt)||Date: 2012-12-10 03:35|
Actually, I was suggesting using the hardcoded types for Windows only (i.e., only removing read_windows_registry). Several bugs have been opened on problems with the Windows registry mimetypes, but as far as I know this isn't an issue on Linux -- in other words, if Linux/other systems ain't broke, no need to fix them.
|msg177280 - (view)||Author: R. David Murray (r.david.murray) *||Date: 2012-12-10 12:19|
I'm personally OK with the option of removing the registry support (or making it optional-by-default), but I'm not going to make that call, we need a windows dev opinion. Maintaining the list of windows exceptions shouldn't be much worse than maintaining the list of mime types. I can't imagine that Microsoft changes it all that often, given that you say they haven't bothered to update the zip type yet.
|msg177286 - (view)||Author: Dave Chambers (dlchambers)||Date: 2012-12-10 13:09|
(I'm a windows dev type) I would say that there are 2 issues with relying on the registry: 1) Default values (ie. set by Windows upon OS install) are broken and MS never fixes them. 2) The values can be changed at any time, by any app. Thus the values are unreliable. If I were to code it from scratch today, I'd create a three-pronged approach: a) Hardcode a list of known types (fast & reliable). b) Have a default case where unknown types are pulled from the registry. Whatever value is retrieved is likely better than returning e.g. "application/octet-stream". c) When we neither find it in hardcoded list or in the registry, return a default value (e.g. "application/octet-stream") For what it's worth, my workaround will be to have my app delete the HKCR\MIME\Database\Content Type\image/x-png regkey, thus forcing the original braindead mimetypes.py code to use HKCR\MIME\Database\Content Type\image/png And, for what it's worth, my patch is actually faster than the current mimetypes.py code because I'm not doing reverse lookups. Thus any argument about a difference in speed is moot. Arguments about the speed of pulling mimetypes from registry are valid. Another registry based approach would be to build a dictionary of mimetypes on demand. In this scenario, at startup, the dictionary would be empty. When python needs the mimetype for ".png", on the 1st request it would cause a "slow" registry lookup for only that type but on all subsequent requests for the type it would use the "fast" value from the dictionary. Given that an app will probably use only a handful of mimetypes but will use that same handful over and over, such a solution would have the benefits of (a) not using hardcoded values (thus no ongoing maintenance), (b) performing slow stuff only on demand, (c) optimizing repeat calls, and (d) consuming zero startup time. I'll code his up & run some timing tests if anyone thinks it's worthwhile. BTW, who makes the final determination as to if/when any such changes would be incorporated?
|msg177289 - (view)||Author: R. David Murray (r.david.murray) *||Date: 2012-12-10 13:35|
I would say Brian Curtin, Tim Golden, and/or Martin von Löwis, as they are the currently active committers with significant Windows expertise. Other committers may have opinions as well. If you don't get an answer here in a reasonable amount of time, please post a discussion of the issue to python-dev (it may end up there anyway).
|msg177311 - (view)||Author: Terry J. Reedy (terry.reedy) *||Date: 2012-12-10 20:09|
Gabriel and Antoine: As I understand it, the claim in this issue is that the patch in #4969 (G. wrote, A. committed) is unsatisfactory. I think it would help if either of you commented.
|msg177313 - (view)||Author: Antoine Pitrou (pitrou) *||Date: 2012-12-10 20:21|
I'll leave it to a Windows expert.
|msg177314 - (view)||Author: Tim Golden (tim.golden)||Date: 2012-12-10 20:22|
Sorry; late to the party. I'll try to take a look at the patches. Basically I'm sympathetic to the problem (which seems quite straightforwardly buggish) but I want to take a look around the issue first. TJG
|msg181006 - (view)||Author: Ben Hoyt (benhoyt)||Date: 2013-01-31 03:36|
Any update on this, Tim or other Windows developers?
|msg181007 - (view)||Author: Brian Curtin (brian.curtin) *||Date: 2013-01-31 03:45|
I can't comment on what the change should be or how it should be done as I don't do anything with mimetypes, but nothing about how the patch was written jumps out at me for being incorrect (except I would not include ishimoto's name changes). If there's a consensus that this is the appropriate change to be made, the patch still needs tests.
|msg187156 - (view)||Author: Tim Golden (tim.golden)||Date: 2013-04-17 12:05|
Attached is a q&d script to produce the list of extension -> mimetype maps for a version of the mimetypes module.
|msg187157 - (view)||Author: Tim Golden (tim.golden)||Date: 2013-04-17 12:07|
Three outputs produced by mt.py: tip as-is; tip without registry; tip with new approach to registry. The results for 2.7 are near-enough identical. Likewise the results for an elevated prompt.
|msg187158 - (view)||Author: Tim Golden (tim.golden)||Date: 2013-04-17 12:23|
There seems to be a consensus that the current behaviour is undesirable, indeed "broken" for any meaningful use. The critical argument against the current Registry approach is that it returns unexpected (or outright incorrect) mimetypes for very standard extensions. The arguments against reading the Registry at all are: * That it requires some extra level of privilege to read the appropriate keys. * That there is a startup cost to reading the Registry * That it can be and is updated by arbitrary programs (typically during installation) and therefore its values cannot be relied upon. We have 3.5 proposals on the table: 1) Don't read the registry at all, ie revert issue4969 (this is what Ben Hoyt is advocating) [noregistry] 2) Read the registry *before* reading the standard types (this is not strongly advocated by anyone). 3) Read the registry but in a different way, mapping from extension to mimetype rather than vice versa. (This is Dave Chambers' patch from issue15207). [newregistry] 3a) Lookup as per (3) but only on demand. This eliminates any startup cost. I've produced three output files from a simple dump of the mimetypes database. For the purposes of taking this forward, we're really comparing the noregistry and the newregistry variants. One key issue is what to do when the same key occurs in both sets but with a different value. (Examples include .avi -> video/x-msvideo vs video/avi; and .zip -> application/zip vs application/x-zip-compressed). And the other key issue is whether the overheads (security, speed) of using the registry outweigh its usefulness in any case. Could I ask those who would remove the registry use altogether to comment on the newregistry output (generating your own if it helps) to see whether it changes your views at all. Either approach -- no-registry or new-registry -- feasible and the code churn is about equal. I'm unsure about compatibility issues: it's hard to see anyone relying on the incorrect mimetypes; but it's certainly possible to see someone relying on the additional (correct) mimetypes.
|msg187159 - (view)||Author: Marc-Andre Lemburg (lemburg) *||Date: 2013-04-17 12:45|
I think it's important to stick to established standards for MIME types and to make sure that Python returns the same values on all platforms using the default settings. Apache comes with a mime.types file which includes both the official IANA types and many common, but unregistered types: http://svn.apache.org/viewvc/httpd/httpd/trunk/docs/conf/mime.types?view=markup This can be used as reference (much like we use the X.org locale database as reference for the locale module). If an application wants to use the Windows registry settings instead, that's fine, but it should not be the default if there's a difference in output compared to the hard-coded list in mimetypes. Note that this would probably require a redesign of the internals of the mimetypes module. It currently provides only a small subset as default mapping and then reads the full set from one of the mime.types files it can find on the system. Such a redesign would only be possible for Python 3.4, not Python 2.7.
|msg187167 - (view)||Author: Dave Chambers (dlchambers)||Date: 2013-04-17 14:29|
Enough with the bikeshedding... it's been 10 months... fix the bug.
|msg187168 - (view)||Author: Brian Curtin (brian.curtin) *||Date: 2013-04-17 14:43|
Just an FYI, but if it takes 10 more months to get it right, we'll do that.
|msg187216 - (view)||Author: Ben Hoyt (benhoyt)||Date: 2013-04-18 00:27|
Okay, I'm looking at the diff between mt-tip-noregistry.txt and mt-tip-newregistry.txt, and I've attached a file showing the lines that are *different* between the two, as well as the Apache mime.types value for that file extension. In most cases, noregistry gives the right mime type, and newregistry is wrong. However, in a few cases, the registry value is right (i.e., according to Apache's mime.types). However, I think that's a totally separate issue, and maybe we should probably open a bug to update a few of the hard-coded mappings in mimetypes.py. The cases where noregistry is right (according to Apache): * .aif * .aifc * .aiff * .avi * .sh * .wav * .xsl *. zip The cases where noregistry is wrong (according to Apache): * .bmp is hard-coded as "image/x-ms-bmp", but it should be image/bmp * .dll and .exe are hard-coded as "application/octet-stream", but should be "application/x-msdownload" * .ico is hard-coded as "image/vnd.microsoft.icon" but should be "image/x-icon" * .m3u is hard-coded as "application/vnd.apple.mpegurl" but should be "audio/x-mpegurl" None of these are standardized IANA mime types, and they're not particularly common for web servers to be serving, so it probably doesn't matter too much that the current hard-coded values are wrong. Also, I'm guessing web browsers will interpret the older type image/x-ms-bmp as image/bmp anyway. So maybe we should open another issue to fix the hard-coded types in mimetypes.py shown above, but again, that's another issue. The other thing here is all the *new types* that the registry adds, such as ".acrobatsecuritysettings". I don't see that these add much value -- just a bunch of types that depend on the programs you have installed. And in my mind at least, the behaviour of mimetypes.guess_type() should not change based on what programs you have installed. In short, "noregistry" gives more accurate results in most cases that matter, and I still strongly feel we should revert to that. (The only alternative, in my opinion, is to switch to Dave Chambers' version of read_windows_registry(), but not call it by default.)
messages: + msg187216
|2013-04-17 14:43:27||brian.curtin||set||messages: + msg187168|
|2013-04-17 14:29:27||dlchambers||set||messages: + msg187167|
messages: + msg187159
|2013-04-17 12:23:36||tim.golden||set||messages: + msg187158|
+ mt-tip.txt, mt-tip-newregistry.txt, mt-tip-noregistry.txt|
messages: + msg187157
messages: + msg187156
|2013-01-31 03:45:51||brian.curtin||set||messages: + msg181007|
|2013-01-31 03:36:42||benhoyt||set||messages: + msg181006|
|2012-12-10 20:22:55||tim.golden||set||messages: + msg177314|
|2012-12-10 20:21:26||pitrou||set||stage: commit review -> patch review|
versions: + Python 2.7, Python 3.3, Python 3.4
versions: - Python 2.7, Python 3.3
+ gagenellina, pitrou|
messages: + msg177311
|2012-12-10 13:35:12||r.david.murray||set||messages: + msg177289|
|2012-12-10 13:09:14||dlchambers||set||messages: + msg177286|
|2012-12-10 12:19:47||r.david.murray||set||messages: + msg177280|
|2012-12-10 03:35:24||benhoyt||set||messages: + msg177260|
|2012-12-10 03:31:56||dlchambers||set||messages: + msg177259|
|2012-12-10 03:18:31||benhoyt||set||messages: + msg177257|
|2012-12-07 14:51:13||r.david.murray||set||messages: + msg177096|
|2012-12-07 13:46:06||dlchambers||set||messages: + msg177092|
|2012-12-07 13:37:00||r.david.murray||set||messages: + msg177091|
|2012-12-07 13:33:44||r.david.murray||set||messages: + msg177089|
|2012-12-07 13:01:40||dlchambers||set||messages: + msg177085|
messages: + msg177030
messages: + msg168918
stage: needs patch -> commit review
messages: + msg168833
nosy: + ishimoto
messages: + msg166351
|2012-06-27 19:54:01||dlchambers||set||files: + mimetypes.py.diff.u|
+ tim.golden, brian.curtin|
messages: + msg164191
|2012-06-27 18:06:03||dlchambers||set||messages: + msg164180|
keywords: + patch
messages: + msg164177
+ Python 3.2, Python 3.3|
nosy: + barry, r.david.murray
messages: + msg164176
components: + email
stage: needs patch