Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mimetypes initialization fails on Windows because of non-Latin characters in registry #53537

Closed
DmitryJemerov mannequin opened this issue Jul 18, 2010 · 32 comments
Closed
Assignees
Labels
easy OS-windows stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error

Comments

@DmitryJemerov
Copy link
Mannequin

DmitryJemerov mannequin commented Jul 18, 2010

BPO 9291
Nosy @loewis, @jaraco, @vstinner, @tjguk, @merwok, @bobince, @bitdancer, @briancurtin, @shimizukawa, @Fak3
Files
  • 9291.patch
  • 9291a.patch: Issue 9291 patch
  • sitecustomize.py
  • issue9291-key-utf8.ini: Offending REG key in Windows Registry file encoded with utf-8
  • issue9291-key.reg
  • issue9291.8.patch
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/tjguk'
    closed_at = <Date 2014-04-29.17:27:43.543>
    created_at = <Date 2010-07-18.11:54:53.841>
    labels = ['easy', 'type-bug', 'library', 'OS-windows']
    title = 'mimetypes initialization fails on Windows because of non-Latin characters in registry'
    updated_at = <Date 2014-06-10.17:01:14.898>
    user = 'https://bugs.python.org/DmitryJemerov'

    bugs.python.org fields:

    activity = <Date 2014-06-10.17:01:14.898>
    actor = 'exarkun'
    assignee = 'tim.golden'
    closed = True
    closed_date = <Date 2014-04-29.17:27:43.543>
    closer = 'tim.golden'
    components = ['Library (Lib)', 'Windows']
    creation = <Date 2010-07-18.11:54:53.841>
    creator = 'Dmitry.Jemerov'
    dependencies = []
    files = ['18143', '19332', '33268', '34187', '34188', '34983']
    hgrepos = []
    issue_num = 9291
    keywords = ['patch', 'easy']
    message_count = 32.0
    messages = ['110637', '110760', '110881', '111288', '111291', '111318', '113662', '119362', '119364', '177044', '202755', '202840', '206494', '206510', '206528', '206540', '206579', '206938', '207076', '211921', '211931', '211932', '211934', '211935', '211936', '211938', '216571', '216903', '217137', '217271', '217273', '220175']
    nosy_count = 24.0
    nosy_names = ['loewis', 'exarkun', 'jaraco', 'vstinner', 'tim.golden', 'eric.araujo', 'kaizhu', 'aclover', 'r.david.murray', 'brian.curtin', 'Suzumizaki', 'frankoid', 'Dmitry.Jemerov', 'shimizukawa', 'vldmit', 'Vladimir Iofik', 'python-dev', 'Roman.Evstifeev', 'adamhj', 'me21', 'Hugo.Lol', 'Daniel.Szoska', 'Micha\xc5\x82.Pasternak', 'quick.es']
    pr_nums = []
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue9291'
    versions = ['Python 2.7']

    @DmitryJemerov
    Copy link
    Mannequin Author

    DmitryJemerov mannequin commented Jul 18, 2010

    On Windows, mimetypes initialization reads the list of MIME types from the Windows registry. It assumes that all characters are Latin-1 encoded, and fails when it's not the case, with the following exception:

    Traceback (most recent call last):
      File "mttest.py", line 3, in <module>
        mimetypes.init()
      File "c:\Python27\lib\mimetypes.py", line 355, in init
        db.read_windows_registry()
      File "c:\Python27\lib\mimetypes.py", line 260, in read_windows_registry
        for ctype in enum_types(mimedb):
      File "c:\Python27\lib\mimetypes.py", line 250, in enum_types
        ctype = ctype.encode(default_encoding) # omit in 3.x!
    UnicodeDecodeError: 'ascii' codec can't decode byte 0xe0 in position 0: ordinal not in range(128)

    This can be reproduced, for example, on a Russian Windows XP installation which has QuickTime installed (QuickTime creates the non-Latin entries in the registry). The following line causes the exception to happen:

    import mimetypes; mimetypes.init()

    @DmitryJemerov DmitryJemerov mannequin added stdlib Python modules in the Lib dir OS-windows labels Jul 18, 2010
    @bitdancer
    Copy link
    Member

    I'm guessing this problem doesn't occur in 3.x? If so, the quick fix would be to have the registry code catch UnicodeError instead of UnicodeEncodeError. That may be the correct fix anyway.

    The "fun" part of this bug is going to be creating a unit test for it.

    @bitdancer bitdancer added easy type-bug An unexpected behavior, bug, or error labels Jul 19, 2010
    @DmitryJemerov
    Copy link
    Mannequin Author

    DmitryJemerov mannequin commented Jul 20, 2010

    The problem doesn't happen on Python 3.1.2 because it doesn't have the code in mimetypes that accesses the Windows registry. Haven't tried the 3.2 alphas yet.

    @DmitryJemerov
    Copy link
    Mannequin Author

    DmitryJemerov mannequin commented Jul 23, 2010

    Patch (suggested fix and unittest) attached.

    @DmitryJemerov
    Copy link
    Mannequin Author

    DmitryJemerov mannequin commented Jul 23, 2010

    And by the way I've verified that the problem doesn't happen in py3k trunk.

    @bitdancer
    Copy link
    Member

    And just for clarity: py3k trunk does contain the _winreg code path.

    @kaizhu
    Copy link
    Mannequin

    kaizhu mannequin commented Aug 12, 2010

    python 3.1.2 mimetypes initialization also fails in redhat linux:

    >>> import http.server
    Traceback (most recent call last):
      File "/home/public/i386-redhat-linux-gnu/python/lib/python3.1/http/server.py", line 588, in <module>
        class SimpleHTTPRequestHandler(BaseHTTPRequestHandler):
      File "/home/public/i386-redhat-linux-gnu/python/lib/python3.1/http/server.py", line 764, in SimpleHTTPRequestHandler
        mimetypes.init() # try to read system mime.types
      File "/home/public/i386-redhat-linux-gnu/python/lib/python3.1/mimetypes.py", line 305, in init
        db.readfp(open(file))
      File "/home/public/i386-redhat-linux-gnu/python/lib/python3.1/mimetypes.py", line 209, in readfp
        line = fp.readline()
      File "/home/public/i386-redhat-linux-gnu/bin/../python/lib/python3.1/encodings/ascii.py", line 26, in decode
        return codecs.ascii_decode(input, self.errors)[0]
    UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 3921: ordinal not in range(128)

    @VladimirIofik
    Copy link
    Mannequin

    VladimirIofik mannequin commented Oct 22, 2010

    Here is a better patch.

    @VladimirIofik
    Copy link
    Mannequin

    VladimirIofik mannequin commented Oct 22, 2010

    UnicodeDecodeException is thrown because 'ctype' is already a string,
    so it is first implicitly decoded by default encoder (which is 'ascii') and then reencoded back. I see no reason in all these actions, so I simply removed them. I think Antoine Pitrou (who is the author of these lines) can shed some light on this, but I guess it's just a copy-paste of the code below.

    @vstinner
    Copy link
    Member

    vstinner commented Dec 6, 2012

    File "c:\Python27\lib\mimetypes.py", line 250, in enum_types
    ctype = ctype.encode(default_encoding) # omit in 3.x!
    UnicodeDecodeError: 'ascii' codec can't decode byte 0xe0 in position 0: ordinal not in range(128)

    The encoding is wrong. We should read the registry using Unicode, or at least use the correct encoding. The correct encoding is the ANSI code page: sys.getfilesystemencoding().

    Can you please try with: default_encoding = sys.getfilesystemencoding() ?

    python 3.1.2 mimetypes initialization also fails in redhat linux: (...)

    In Python 3.3, MimeTypes.read() opens files in UTF-8. The issue bpo-13025 explains why UTF-8 is used instead the locale encoding, or another encoding.

    I see that read_mime_types() uses the locale encoding, it looks like a bug, it should also use UTF-8.

    @tjguk
    Copy link
    Member

    tjguk commented Nov 13, 2013

    Only just been reminded of this one; it's possible that it's been superseded by bpo-15207. At the least, that issue resulted in a code change in this area of mimetypes. I'll have a look later.

    @adamhj
    Copy link
    Mannequin

    adamhj mannequin commented Nov 14, 2013

    The encoding is wrong. We should read the registry using Unicode, or at least use the correct encoding. The correct encoding is the ANSI code page: sys.getfilesystemencoding().

    Can you please try with: default_encoding = sys.getfilesystemencoding() ?

    This does not work. In fact it doesn't matter what default_encoding is. The variable ctype, which is returned by _winreg.EnumKey(), is a byte string(b'blahblah'), at least on my computer(win2k3sp2, python 2.7.6). Because the interpreter is asked to encode a byte string, it tries to convert the byte string to unicode string first, by calling decode implicitly with 'ascii' encoding, so the exception UnicodeDecodeError.

    the variable ctype, which is read from registry key name, can be decoded correctly with sys.getfilesystemencoding()(which returns 'mbcs'), but in fact what we need is a byte string, so there should be neither encoding nor decoding here.

    if there is a case that _winreg.EnumKey() returns unicode string, then a type check should be added before the encode. Or maybe the case is that the return type of _winreg.EnumKey() is different in 2.x and 3.x?

    @Suzumizaki
    Copy link
    Mannequin

    Suzumizaki mannequin commented Dec 18, 2013

    There is possibility that the installation of setuptools fails with
    any Windows machine because of this bug. I want change the priority of this issue higher...

    I failed the installation of setuptools with Python 2.7.6 on my machine, Windows 8.1 Pro Japanese Edition 64bit, but no problem with both Python 2.7.4 and Python 3.3.3.

    @bitdancer
    Copy link
    Member

    OK, that means the bpo-15207 fix didn't fix it, since that's in 2.7.6.

    @tjguk
    Copy link
    Member

    tjguk commented Dec 18, 2013

    I'll try to look at this soonish. Thanks for bringing it back to the
    surface.

    @vstinner
    Copy link
    Member

    Issue bpo-20017 has been marked as a duplicate of this issue. Copy of the message:

    Running Windows 8 (64-bit) and Python 2.7.6 (64-bit).

    > python -m SimpleHTTPServer
    Traceback (most recent call last):
      File "C:\Python27\lib\runpy.py", line 162, in _run_module_as_main
        "__main__", fname, loader, pkg_name)
      File "C:\Python27\lib\runpy.py", line 72, in _run_code
        exec code in run_globals
      File "C:\Python27\lib\SimpleHTTPServer.py", line 27, in <module>
        class SimpleHTTPRequestHandler(BaseHTTPServer.BaseHTTPRequestHandler):
      File "C:\Python27\lib\SimpleHTTPServer.py", line 208, in SimpleHTTPRequestHand
    ler
        mimetypes.init() # try to read system mime.types
      File "C:\Python27\lib\mimetypes.py", line 358, in init
        db.read_windows_registry()
      File "C:\Python27\lib\mimetypes.py", line 258, in read_windows_registry
        for subkeyname in enum_types(hkcr):
      File "C:\Python27\lib\mimetypes.py", line 249, in enum_types
        ctype = ctype.encode(default_encoding) # omit in 3.x!
    UnicodeDecodeError: 'ascii' codec can't decode byte 0xd7 in position 2: ordinal
    not in range(128)

    @shimizukawa
    Copy link
    Mannequin

    shimizukawa mannequin commented Dec 19, 2013

    This issue affects mercurial too.
    http://bz.selenic.com/show_bug.cgi?id=3624

    @me21
    Copy link
    Mannequin

    me21 mannequin commented Dec 26, 2013

    An alternative solution, which worked for me, is:
    add file named sitecustomize.py in Lib\site-packages folder.

    The contents of the file:
    import sys
    sys.setdefaultencoding("cp1251")

    The default encoding should be determined for every localized Windows version.
    Also, when creating virtual environments, the same file should be placed in site-packages folder of virtual environment being created prior to installing setuptools in it.

    @jaraco
    Copy link
    Member

    jaraco commented Dec 29, 2013

    @MichaPasternak
    Copy link
    Mannequin

    MichaPasternak mannequin commented Feb 22, 2014

    I just hit this bug on 2.7.6, running on polish WinXP (I need to build some packages there, I hope I'll avoid a nasty py2exe bug). Any reasons this is not fixed yet? Do you need any assistance?

    @loewis
    Copy link
    Mannequin

    loewis mannequin commented Feb 22, 2014

    Michał: Can you please report the exact registry key and value that is causing the problem? It's difficult to test a patch if one is not able to reproduce the problem.

    Of the patches suggested: does any of them fix the problem for you? If so, which one?

    I personally fine Vladimir's patch more plausible (EnumKeys gives bytes objects in 2.x, so it is pointless to apply .encode to them). The introduction of the count() call is unrelated, though, and should be omitted from a bug fix.

    @DanielSzoska
    Copy link
    Mannequin

    DanielSzoska mannequin commented Feb 22, 2014

    Martin: I had the same problem after upgrading to 2.7.6.

    System here: German XP 32 Bit

    I used the solution from Alexandr with sitecustomize.py (with cp1252) and it works fine for me.

    @MichaPasternak
    Copy link
    Mannequin

    MichaPasternak mannequin commented Feb 22, 2014

    Martin: the problematic key is "[HKEY_CLASSES_ROOT\BDATuner.Składniki]". I am pasting its name, because I suppose, that as bugs.python.org is utf-8, special characters will be pasted properly.

    Included you will find a .REG file, which is Windows Registry Editor file, which is plaintext. It is encoded with CP-1250 charset (I believe). In any case of confusion, I inlcude also the same file encoded with utf-8.

    If you add those information to your Windows registry, you should be able to reproduce this bug just by simply using "pip install" anything. "pip install wokkel", for example.

    @MichaPasternak
    Copy link
    Mannequin

    MichaPasternak mannequin commented Feb 22, 2014

    Another REG file, encoded with CP1250, I believe.

    @MichaPasternak
    Copy link
    Mannequin

    MichaPasternak mannequin commented Feb 22, 2014

    As for the fix, sitecustomize.py works for me, too, but I somehow believe, that adding sitecustomize.py for new Python installations would propably do more harm than good. I'll check those 2 patches and I'll let you know.

    @MichaPasternak
    Copy link
    Mannequin

    MichaPasternak mannequin commented Feb 22, 2014

    9291.patch works for me too, but I am unsure about its idea. Silently ignoring non-ASCII registry entries - does it sound like a good idea? Maybe. Is it pythonic? I doubt so.

    I don't exactly understand what 9291a.patch is doing. For me it does look like a re-iteration of the first patch. I have not tested it.

    @tjguk
    Copy link
    Member

    tjguk commented Apr 16, 2014

    The attached patch bpo-9291.7.patch (which is essentially an amalgam of 9291.patch & 9291a.patch with some tweaks of my own) does appear to solve the issue. My Windows setup is UK, so if any of the people still watching this issue could test against a non-English Windows, that would be useful.

    Even this fix does leave some room for encoding mismatches between the stored values (mbcs encoded) and any string passed to guess_type. But it's not clear how that should be handled, and at least it doesn't crash out on .init.

    @tjguk
    Copy link
    Member

    tjguk commented Apr 20, 2014

    Another version of the patch: this one, in addition to removing the unnecessary encodes, also does the check for extensions before attempting to open the registry key, and narrows down the try-catch block to just the attempt to read the "Content Type" value.

    This does mean that if any process is unable to read HKCR or its subkeys the mimetypes.init will fail. Frankly, I can't see how that could happen, but if anyone feels strongly enough I can add extra guards so it fails silently.

    @quickes
    Copy link
    Mannequin

    quickes mannequin commented Apr 24, 2014

    Alternative temporary solution
    def enum_types(mimedb):
    ....
    try:
    ctype = ctype.encode(default_encoding) # omit in 3.x!
    except UnicodeEncodeError:
    pass
    except Exception: #<--
    pass #<--
    else:
    yield ctype

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Apr 27, 2014

    New changeset 18cfc2a42772 by Tim Golden in branch '2.7':
    Issue bpo-9291 Do not attempt to re-encode mimetype data read from registry in ANSI mode. Initial patches by Dmitry Jemerov & Vladimir Iofik
    http://hg.python.org/cpython/rev/18cfc2a42772

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Apr 27, 2014

    New changeset 0c8a7299c7e3 by Tim Golden in branch '2.7':
    Issue bpo-9291 Add ACKS & NEWS
    http://hg.python.org/cpython/rev/0c8a7299c7e3

    @tjguk tjguk closed this as completed Apr 29, 2014
    @tjguk tjguk self-assigned this Apr 29, 2014
    @exarkun
    Copy link
    Mannequin

    exarkun mannequin commented Jun 10, 2014

    Please see http://bugs.python.org/issue21652 for a regression introduced by this change.

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    easy OS-windows stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    No branches or pull requests

    4 participants