classification
Title: mimetypes read from the registry should not overwrite standard mime mappings
Type: behavior Stage: resolved
Components: Library (Lib), Windows Versions: Python 3.2, Python 2.7
process
Status: open Resolution: not a bug
Dependencies: Superseder:
Assigned To: Nosy List: BreamoreBoy, benhoyt, fhamand, ggenellina, kovid, loewis, ocean-city, pitrou, r.david.murray, tercero12, tim.golden
Priority: normal Keywords:

Created on 2010-11-27 19:15 by kovid, last changed 2014-07-30 15:43 by BreamoreBoy.

Messages (13)
msg122542 - (view) Author: Kovid Goyal (kovid) Date: 2010-11-27 19:15
Hi,

I am the primary developer of calibre (http:/calibre-ebook.com) and yesterday I released an upgrade of calibre based on python 2.7. Here is a small sampling of all the diverse errors that my users experienced, related to reading mimetypes from the registry:

1. Permission denied if running from non privileged account
Traceback (most recent call last):
File "site.py", line 103, in main
File "site.py", line 84, in run_entry_point
File "site-packages\calibre\__init__.py", line 31, in <module>
File "mimetypes.py", line 344, in add_type
File "mimetypes.py", line 355, in init
File "mimetypes.py", line 261, in read_windows_registry
WindowsError: [Error 5] Acceso denegado (Access not allowed)

The fix for this is to trap WindowsError and ignore it in mimetypes.py

2. Mishandling of encoding of registry entries

Traceback (most recent call last):      
  File "site.py", line 103, in main     
  File "site.py", line 84, in run_entry_point
  File "site-packages\calibre\__init__.py", line 31, in <module>                                                
  File "mimetypes.py", line 344, in add_type                                                                
  File "mimetypes.py", line 355, in init                                                                    
  File "mimetypes.py", line 260, in read_windows_registry                                                   
  File "mimetypes.py", line 250, in enum_types                                                              
UnicodeDecodeError: 'utf8' codec can't decode byte 0xe0 in position 0: invalid continuation byte

The fix for this is to change

except UnicodeEncodeError

to

except ValueError

3. python -c "import mimetypes; print mimetypes.guess_type('img.jpg')"
('image/pjpeg', None)

Where the output should have been

(image/jpeg', None)

The fix for this is to load the registry entries before the default entris defined in mimetypes.py


Of course, IMHO, the best possible fix is to simply remove the reading of mimetypes from the registry. But that is up to whoever maintains this module. 

Duplicate (less comprehensive) tickets ont his isuue in your traceker already are: 9291, 10490, 104314

If the maintainer of this module is unable to fix these issues, let me know and I will submit a patch, either removing _winreg or fixing the issues individually.
msg122543 - (view) Author: Ned Deily (ned.deily) * (Python committer) Date: 2010-11-27 19:34
The first issue you note appears to be a duplicate of Issue10162, a fix for which should be available in the 2.7.1 maintenance release.

The second issue appears to be a duplicate of Issue9291.  Since that issue is still open, I suggest any further discussion be pursued there.  You may want to add yourself to the nosy list of that issue.
msg122583 - (view) Author: Kovid Goyal (kovid) Date: 2010-11-27 22:54
And what about the third issue?

Allow me to elaborate:

mimetypes are a relatively standard set of mappings from well known file extensions to MIME descriptors. 

Reading mimetype mappings from the registry, a location that is writable to by random programs the user may have installed on his machine, let alone malware, is a BAD idea.

It leads to situations like asking for the mimetype of file.jpg and getting iage/pjpeg back. Or asking for the mimetype of file.png and getting image/x-png back.

If you still consider it good to read mimetypes from the registry, at the very least, they should be read before the standard mimetype mappings defined in mimetypes.py are applied. That way at least for that set of mappings, users of python can be assured of sane query results. 

As it stands now, mimetypes.py is useless and to workaround the problem I essentially had to define the mimetype mappings for all the mimetypes my program knows about by hand.
msg122587 - (view) Author: Ned Deily (ned.deily) * (Python committer) Date: 2010-11-27 23:12
(Sorry, I skipped over the third: this is one reason why one should not include multiple problems in one tracker issue.)

As to your third point, a quick search of "mimetypes" in the bugtracker shows that looking in the Windows registry for mimetypes was a new feature in 2.7 and the upcoming 3.2 added by Issue4969.

Adding the Windows maintainers and the Nosy List from that issue.
msg122589 - (view) Author: Kovid Goyal (kovid) Date: 2010-11-27 23:20
I apologize for the multiple issue in the ticket. To my mind they were all basically one issue, stemming from the decision to read mimetypes from the registry.

Since there are other tickets for the first two issues, I'll change the summary for this issue to reflect only the third.
msg122924 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2010-11-30 17:33
Kovid: so essentially what you are saying is that the windows platform is broken with respect to MIME types and with respect to its security model.  Why am I not surprised? :)

You would have the same problem if software installation altered the /etc/mimetypes file on a unix box and created weird entries.  Perhaps unix programmers are just better disciplined?

Reading the registry first and having the built in settings override would IMO defeat the purpose of reading the values from the registry: those are (theoretically!!) the settings the user chose to change.

However, working around it in your program should be simple: just call mimetypes.init with an empty file list.  The windows registry is only read if the files parameter is None.  This will also give you consistent behavior on windows and unix: only the default mime types in the mimetypes module will be used.  If, on the other hand, you want to retain the Unix behavior, you can pass init mimetypes.knownfiles instead of the empty list.

(By they way, thanks very much for calibre, I have used the CLI tools to great benefit, and love the fact that the CLI is the basis of the program.)
msg122925 - (view) Author: Kovid Goyal (kovid) Date: 2010-11-30 18:07
It is, of course, your decision, but IMO, since the mimetypes database in windows appears to be always broken, the default behavior of the mimetypes module in python 2.7 on windows is broken for most (all?) windows installs. For me personally, it doesn't matter anymore, as I have already fixed calibre, but it would be surprising/unexpected behavior for someone new to using mimetypes.py on windows. Certainly, my expectation (perhaps naively) was that guess_type('image.jpg') would always return 'image/jpeg'. 

Users on windows rarely (ever?) modify the registry to change mimetypes. The only thing that does change mimetypes is installed software, without the users' knowledge/consent. So treating the registry as a reliable store of mime information, is not a good idea. 

On unix, the knownfiles are system files. I dont know about OS X, but on linux, since most software is installed by package managers, the package managers usually have policies that prevent application installs from clobbering system files. And of course, running userland applications dont have the necessary privileges to modify the files. 

Out of curiosity, what is the upside of reading mimetypes from the registry, given that it's information cannot be trusted?

And you're most welcome, for calibre :)
msg122926 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2010-11-30 18:42
I would expect that it would not be people new to mimetypes that would have the issues, but people like you for whom the behavior on Windows has changed.  And this is indeed a concern.

The people involved in making the windows mimetypes enhancement are nosy on this ticket, perhaps they will have thoughts on the issue of the (in)validity of the windows mime data.
msg122928 - (view) Author: Kovid Goyal (kovid) Date: 2010-11-30 18:56
I actually had in mind people that (like me) develop primarily on unix and assume that mimetypes works the same way on both windows and unix. Of course, the changed behavior is also a concern.

At the very least, I would encourage the addition of a warning to the documentation of the mimetypes module.
msg172531 - (view) Author: Ben Hoyt (benhoyt) * Date: 2012-10-09 21:21
This is definitely a real issue, and makes mimetypes.guess_type() useless out of the box on Windows.

However, I believe the reason it's broken is that the fix for Issue4969 doesn't actually work, and I'm not sure this is possible with the Windows registry.

You see, "MIME\Database\Content Type" in the Windows registry is a mime type -> file extension mapping, *not the other way around*. But read_windows_registry() tries to use it as a file extension -> mime type mapping, and bad things happen, because there are multiple mime types for certain file extensions.

As far as I can tell, there's nothing in the Windows registry that says which is the "canonical" mime type for a given extension. Again, this is because Microsoft intends it (and uses it) as a mime type -> extension mapping. See more here: http://msdn.microsoft.com/en-us/library/ms775148(v=vs.85).aspx

For example, in my "MIME\Database\Content Type" we have:

image/jpeg -> .jpg
image/jpg -> .jpg
image/pjpeg -> .jpg

And read_windows_registry() picks the last one for .jpg, which in this case is image/pjpeg -- NOT what users expect.

In short, I think the fix for Issue4969 is broken as is, and that you can't actually use the mime types database in the Windows registry in this way. I suggest reverting the fix for Issue4969.

Or, we could get clever and only use the Windows registry value if there's a single mime type -> extension mapping for a given extension, and if there's more than one (meaning it'd be ambiguous), use the mimetypes default from types_map / common_types.
msg224275 - (view) Author: Mark Lawrence (BreamoreBoy) * Date: 2014-07-30 00:02
msg185039 from #4969 also complains about this issue.  I agree with the solution put forward in the last sentence of msg172531. If we think this is the best idea I'll work on a patch unless anybody else wants to pick this up.
msg224281 - (view) Author: Ben Hoyt (benhoyt) * Date: 2014-07-30 00:54
Mark, are you referring to part 3 of this issue, the image/pjpeg type of problem? This was fixed in Python 2.7.6 -- see changeset http://hg.python.org/cpython/rev/e8cead08c556 and http://bugs.python.org/issue15207
msg224317 - (view) Author: Mark Lawrence (BreamoreBoy) * Date: 2014-07-30 15:43
Ben you're correct.  The other issues have been addressed in #10162 and #9291 so I believe this can be closed.  One 2.7 regression regarding mixed str and unicode objects is addressed in #21652.
History
Date User Action Args
2014-07-30 15:43:15BreamoreBoysetmessages: + msg224317
2014-07-30 00:54:05benhoytsetmessages: + msg224281
2014-07-30 00:02:30BreamoreBoysetnosy: + fhamand, BreamoreBoy, - brian.curtin
messages: + msg224275
2012-10-10 18:39:26ned.deilysetnosy: - ned.deily
2012-10-09 21:21:43benhoytsetnosy: + benhoyt
messages: + msg172531
2010-11-30 18:56:41kovidsetstatus: pending -> open

messages: + msg122928
2010-11-30 18:42:37r.david.murraysetstatus: closed -> pending

messages: + msg122926
2010-11-30 18:07:50kovidsetmessages: + msg122925
2010-11-30 17:33:56r.david.murraysetstatus: open -> closed

type: behavior

nosy: + r.david.murray
messages: + msg122924
resolution: not a bug
stage: resolved
2010-11-27 23:20:21kovidsetmessages: + msg122589
title: mimetypes reading from registry in windows completely broken -> mimetypes read from the registry should not overwrite standard mime mappings
2010-11-27 23:12:41ned.deilysetversions: + Python 3.2
nosy: + ggenellina, pitrou, tim.golden, ocean-city, tercero12, loewis

messages: + msg122587

superseder: mimetypes initialization fails on Windows because of non-Latin characters in registry ->
2010-11-27 22:54:31kovidsetstatus: closed -> open
resolution: duplicate -> (no value)
messages: + msg122583
2010-11-27 19:34:18ned.deilysetstatus: open -> closed

superseder: mimetypes initialization fails on Windows because of non-Latin characters in registry
components: + Windows

nosy: + brian.curtin, ned.deily
messages: + msg122543
resolution: duplicate
2010-11-27 19:15:19kovidcreate