This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author eryksun
Recipients eryksun, ezio.melotti, ionelmc, vstinner
Date 2014-06-19.13:06:23
SpamBayes Score -1.0
Marked as misclassified Yes
Message-id <1403183184.83.0.336599168444.issue21808@psf.upfronthosting.co.za>
In-reply-to
Content
cp65001 was added in Python 3.3, for what it's worth. For me codepage 65001 (CP_UTF8) is broken for most console programs. 

Windows API WriteFile gets routed to WriteConsoleA for a console buffer handle, but WriteConsoleA has a different spec. It returns the number of wide characters written instead of the number of bytes. Then WriteFile returns this number without adjusting for the fact that 1 character != 1 byte. For example, the following writes 5 bytes (3 wide characters), but WriteFile returns that NumberOfBytesWritten is 3:

    >>> import sys, msvcrt 
    >>> from ctypes import windll, c_uint, byref

    >>> windll.kernel32.SetConsoleOutputCP(65001)
    1

    >>> h_out = msvcrt.get_osfhandle(sys.stdout.fileno())
    >>> buf = '\u0100\u0101\n'.encode('utf-8')
    >>> n = c_uint()
    >>> windll.kernel32.WriteFile(h_out, buf, len(buf),                
    ...                           byref(n), None)
    Āā
    1

    >>> n.value
    3
    >>> len(buf)
    5

There's a similar problem with ReadFile calling ReadConsoleA.

ANSICON (github.com/adoxa/ansicon) can hook WriteFile to fix this for select programs. However, it doesn't hook ReadFile, so stdin.read remains broken. 

>    >>> import locale
>    >>> locale.getpreferredencoding()
>    'cp1252'

The preferred encoding is based on the Windows locale codepage, which is returned by kernel32!GetACP, i.e. the 'ANSI' codepage. If you want the console codepages that were set at program startup, look at sys.stdin.encoding and sys.stdout.encoding:

    >>> windll.kernel32.SetConsoleCP(1252)       
    1
    >>> windll.kernel32.SetConsoleOutputCP(65001)
    1
    >>> script = r'''
    ... import sys
    ... print(sys.stdin.encoding, sys.stdout.encoding)
    ... '''

    >>> subprocess.call('py -3 -c "%s"' % script)
    cp1252 cp65001
    0

>    >>> locale.getlocale()
>    (None, None)
>    >>> locale.getlocale(locale.LC_ALL)
>    (None, None)

On most POSIX platforms nowadays, Py_Initialize sets the LC_CTYPE category to its default value by calling setlocale(LC_CTYPE, "") in order to "obtain the locale's charset without having to switch locales". On the other hand, the bootstrapping process for Windows doesn't use the C runtime locale, so at startup LC_CTYPE is still in the default "C" locale:

    >>> locale.setlocale(locale.LC_CTYPE, None)
    'C'

This in turn gets parsed into the (None, None) tuple that getlocale() returns:

    >>> locale._parse_localename('C')
    (None, None)
History
Date User Action Args
2014-06-19 13:06:24eryksunsetrecipients: + eryksun, vstinner, ezio.melotti, ionelmc
2014-06-19 13:06:24eryksunsetmessageid: <1403183184.83.0.336599168444.issue21808@psf.upfronthosting.co.za>
2014-06-19 13:06:24eryksunlinkissue21808 messages
2014-06-19 13:06:23eryksuncreate