classification
Title: mimetypes initialization fails on Windows because of non-Latin characters in registry
Type: behavior Stage: patch review
Components: Library (Lib), Windows Versions: Python 2.7
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: Dmitry.Jemerov, Vladimir Iofik, aclover, brian.curtin, eric.araujo, frankoid, haypo, kaizhu, r.david.murray, vldmit
Priority: normal Keywords: easy, patch

Created on 2010-07-18 11:54 by Dmitry.Jemerov, last changed 2011-03-08 14:30 by pitrou.

Files
File name Uploaded Description Edit
9291.patch Dmitry.Jemerov, 2010-07-23 12:17 review
9291a.patch Vladimir Iofik, 2010-10-22 06:11 Issue 9291 patch review
Messages (9)
msg110637 - (view) Author: Dmitry Jemerov (Dmitry.Jemerov) Date: 2010-07-18 11:54
On Windows, mimetypes initialization reads the list of MIME types from the Windows registry. It assumes that all characters are Latin-1 encoded, and fails when it's not the case, with the following exception:

Traceback (most recent call last):
  File "mttest.py", line 3, in <module>
    mimetypes.init()
  File "c:\Python27\lib\mimetypes.py", line 355, in init
    db.read_windows_registry()
  File "c:\Python27\lib\mimetypes.py", line 260, in read_windows_registry
    for ctype in enum_types(mimedb):
  File "c:\Python27\lib\mimetypes.py", line 250, in enum_types
    ctype = ctype.encode(default_encoding) # omit in 3.x!
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe0 in position 0: ordinal not in range(128)

This can be reproduced, for example, on a Russian Windows XP installation which has QuickTime installed (QuickTime creates the non-Latin entries in the registry). The following line causes the exception to happen:

import mimetypes; mimetypes.init()
msg110760 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2010-07-19 14:30
I'm guessing this problem doesn't occur in 3.x?  If so, the quick fix would be to have the registry code catch UnicodeError instead of UnicodeEncodeError.  That may be the correct fix anyway.

The "fun" part of this bug is going to be creating a unit test for it.
msg110881 - (view) Author: Dmitry Jemerov (Dmitry.Jemerov) Date: 2010-07-20 10:11
The problem doesn't happen on Python 3.1.2 because it doesn't have the code in mimetypes that accesses the Windows registry. Haven't tried the 3.2 alphas yet.
msg111288 - (view) Author: Dmitry Jemerov (Dmitry.Jemerov) Date: 2010-07-23 12:17
Patch (suggested fix and unittest) attached.
msg111291 - (view) Author: Dmitry Jemerov (Dmitry.Jemerov) Date: 2010-07-23 12:17
And by the way I've verified that the problem doesn't happen in py3k trunk.
msg111318 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2010-07-23 13:35
And just for clarity: py3k trunk does contain the _winreg code path.
msg113662 - (view) Author: kai zhu (kaizhu) Date: 2010-08-12 07:13
python 3.1.2 mimetypes initialization also fails in redhat linux:


>>> import http.server
Traceback (most recent call last):
  File "/home/public/i386-redhat-linux-gnu/python/lib/python3.1/http/server.py", line 588, in <module>
    class SimpleHTTPRequestHandler(BaseHTTPRequestHandler):
  File "/home/public/i386-redhat-linux-gnu/python/lib/python3.1/http/server.py", line 764, in SimpleHTTPRequestHandler
    mimetypes.init() # try to read system mime.types
  File "/home/public/i386-redhat-linux-gnu/python/lib/python3.1/mimetypes.py", line 305, in init
    db.readfp(open(file))
  File "/home/public/i386-redhat-linux-gnu/python/lib/python3.1/mimetypes.py", line 209, in readfp
    line = fp.readline()
  File "/home/public/i386-redhat-linux-gnu/bin/../python/lib/python3.1/encodings/ascii.py", line 26, in decode
    return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 3921: ordinal not in range(128)
msg119362 - (view) Author: Vladimir Iofik (Vladimir Iofik) Date: 2010-10-22 06:11
Here is a better patch.
msg119364 - (view) Author: Vladimir Iofik (Vladimir Iofik) Date: 2010-10-22 06:43
UnicodeDecodeException is thrown because 'ctype' is already a string, 
so it is first implicitly decoded by default encoder (which is 'ascii') and then reencoded back. I see no reason in all these actions, so I simply removed them. I think Antoine Pitrou (who is the author of these lines) can shed some light on this, but I guess it's just a copy-paste of the code below.
History
Date User Action Args
2012-01-29 22:27:56r.david.murraylinkissue13906 superseder
2011-08-30 22:53:16amaury.forgeotdarclinkissue12865 superseder
2011-03-08 14:30:56pitrousetnosy: + haypo
2011-03-08 13:50:34frankoidsetnosy: + frankoid
2010-11-27 23:12:41ned.deilyunlinkissue10551 superseder
2010-11-27 19:34:18ned.deilylinkissue10551 superseder
2010-11-22 20:25:22r.david.murraysetnosy: + aclover
2010-11-22 20:24:33r.david.murraylinkissue10490 superseder
2010-10-22 06:43:11Vladimir Iofiksetmessages: + msg119364
2010-10-22 06:11:04Vladimir Iofiksetfiles: + 9291a.patch
nosy: + Vladimir Iofik
messages: + msg119362

2010-10-15 13:45:40eric.araujosetnosy: + eric.araujo
2010-10-15 12:53:03r.david.murraysetnosy: + vldmit
2010-10-15 12:52:38r.david.murraylinkissue10113 superseder
2010-08-12 07:13:44kaizhusetnosy: + kaizhu
messages: + msg113662
2010-07-23 13:36:27brian.curtinsetnosy: + brian.curtin
2010-07-23 13:35:12r.david.murraysetmessages: + msg111318
stage: test needed -> patch review
2010-07-23 12:17:43Dmitry.Jemerovsetmessages: + msg111291
2010-07-23 12:17:13Dmitry.Jemerovsetfiles: + 9291.patch
keywords: + patch
messages: + msg111288
2010-07-20 10:11:50Dmitry.Jemerovsetmessages: + msg110881
2010-07-19 14:30:25r.david.murraysetnosy: + r.david.murray
messages: + msg110760

keywords: + easy
type: behavior
stage: test needed
2010-07-18 11:54:53Dmitry.Jemerovcreate