This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: 65001 code page not supported
Type: Stage: resolved
Components: Interpreter Core, Unicode, Windows Versions: Python 2.7
process
Status: closed Resolution: wont fix
Dependencies: Superseder:
Assigned To: Nosy List: BreamoreBoy, eryksun, ezio.melotti, ionelmc, r.david.murray, vstinner
Priority: normal Keywords:

Created on 2014-06-19 10:59 by ionelmc, last changed 2022-04-11 14:58 by admin. This issue is now closed.

Messages (9)
msg220964 - (view) Author: Ionel Cristian Mărieș (ionelmc) Date: 2014-06-19 10:59
cp65001 is purported to be an alias for utf8.

I get these results:

C:\Python27>chcp 65001
Active code page: 65001

C:\Python27>python
Python 2.7.6 (default, Nov 10 2013, 19:24:24) [MSC v.1500 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import locale

LookupError: unknown encoding: cp65001
>>>

LookupError: unknown encoding: cp65001
>>> locale.getpreferredencoding()

LookupError: unknown encoding: cp65001
>>>




And on Python 3.4 chcp doesn't seem to have any effect:

C:\Python34>chcp 65001
Active code page: 65001

C:\Python34>python
Python 3.4.1 (v3.4.1:c0e311e010fc, May 18 2014, 10:38:22) [MSC v.1600 32 bit (Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import locale
>>> locale.getpreferredencoding()
'cp1252'
>>> locale.getlocale()
(None, None)
>>> locale.getlocale(locale.LC_ALL)
(None, None)
msg220971 - (view) Author: Eryk Sun (eryksun) * (Python triager) Date: 2014-06-19 13:06
cp65001 was added in Python 3.3, for what it's worth. For me codepage 65001 (CP_UTF8) is broken for most console programs. 

Windows API WriteFile gets routed to WriteConsoleA for a console buffer handle, but WriteConsoleA has a different spec. It returns the number of wide characters written instead of the number of bytes. Then WriteFile returns this number without adjusting for the fact that 1 character != 1 byte. For example, the following writes 5 bytes (3 wide characters), but WriteFile returns that NumberOfBytesWritten is 3:

    >>> import sys, msvcrt 
    >>> from ctypes import windll, c_uint, byref

    >>> windll.kernel32.SetConsoleOutputCP(65001)
    1

    >>> h_out = msvcrt.get_osfhandle(sys.stdout.fileno())
    >>> buf = '\u0100\u0101\n'.encode('utf-8')
    >>> n = c_uint()
    >>> windll.kernel32.WriteFile(h_out, buf, len(buf),                
    ...                           byref(n), None)
    Āā
    1

    >>> n.value
    3
    >>> len(buf)
    5

There's a similar problem with ReadFile calling ReadConsoleA.

ANSICON (github.com/adoxa/ansicon) can hook WriteFile to fix this for select programs. However, it doesn't hook ReadFile, so stdin.read remains broken. 

>    >>> import locale
>    >>> locale.getpreferredencoding()
>    'cp1252'

The preferred encoding is based on the Windows locale codepage, which is returned by kernel32!GetACP, i.e. the 'ANSI' codepage. If you want the console codepages that were set at program startup, look at sys.stdin.encoding and sys.stdout.encoding:

    >>> windll.kernel32.SetConsoleCP(1252)       
    1
    >>> windll.kernel32.SetConsoleOutputCP(65001)
    1
    >>> script = r'''
    ... import sys
    ... print(sys.stdin.encoding, sys.stdout.encoding)
    ... '''

    >>> subprocess.call('py -3 -c "%s"' % script)
    cp1252 cp65001
    0

>    >>> locale.getlocale()
>    (None, None)
>    >>> locale.getlocale(locale.LC_ALL)
>    (None, None)

On most POSIX platforms nowadays, Py_Initialize sets the LC_CTYPE category to its default value by calling setlocale(LC_CTYPE, "") in order to "obtain the locale's charset without having to switch locales". On the other hand, the bootstrapping process for Windows doesn't use the C runtime locale, so at startup LC_CTYPE is still in the default "C" locale:

    >>> locale.setlocale(locale.LC_CTYPE, None)
    'C'

This in turn gets parsed into the (None, None) tuple that getlocale() returns:

    >>> locale._parse_localename('C')
    (None, None)
msg220972 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2014-06-19 13:15
The support of the code page 65001 (CP_UTF8, "cp65001") was added in Python 3.3. It is usually used for the OEM code page. The chcp command changes the Windows console encoding which is used by sys.{stdin,stdout,stderr).encoding. locale.getpreferredencoding() is the ANSI code page.

Read also:
http://unicodebook.readthedocs.org/operating_systems.html#code-pages
http://unicodebook.readthedocs.org/programming_languages.html#windows

> cp65001 is purported to be an alias for utf8.

No, cp65001 is not an alias of utf8: it handles surrogate characters differently. The behaviour of CP_UTF8 depends on the flags and the Windows version.

If you really want to use the UTF-8 codec: force the stdio encoding using PYTHONIOENCODING envrionment variable:
https://docs.python.org/dev/using/cmdline.html#envvar-PYTHONIOENCODING

Setting the Windows console encoding to cp65001 using the chcp command doesn't make the Windows console fully Unicode compliant. It is a little bit better using TTF fonts, but it's not enough. See the old issue #1602 opened 7 years ago and not fixed yet.

Backporting the cp65001 codec requires too many changes in the codec code. I made these changes between Python 3.1 and 3.3, I don't want to redo them in Python 2.7 because it may break backward compatibility. For example, in Python 3.3, the "strict" mode really means "strict", whereas in Python 2.7, code page codecs use the default flags which is not strict. See:
http://unicodebook.readthedocs.org/operating_systems.html#encode-and-decode-functions

So I'm in favor of closing the issue as "wont fix". The fix is to upgrade to Python 3!
msg220975 - (view) Author: Mark Lawrence (BreamoreBoy) * Date: 2014-06-19 13:56
See also Issue20574.
msg220977 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2014-06-19 14:04
I agree with Haypo, because if he isn't interested in doing it, it is unlikely anyone else will find the problem tractable :)  Certainly not anyone else on the core team.  But, the danger of breaking things in 2.7 is the clincher.
msg220978 - (view) Author: Eryk Sun (eryksun) * (Python triager) Date: 2014-06-19 14:07
> Setting the Windows console encoding to cp65001 using the chcp 
> command doesn't make the Windows console fully Unicode compliant. 
> It is a little bit better using TTF fonts, but it's not enough. 
> See the old issue #1602 opened 7 years ago and not fixed yet.

It's annoyingly broken for me due to the problems with WriteFile and ReadFile.

    >>> print('\u0100')             
    Ā
    
    >>>

Note the extra line because write() returns that 2 characters were written instead of 3 bytes. So the final linefeed byte gets written again. 

Let's buy 4 and get 1 free:

    >>> print('\u0100' * 4)
    ĀĀĀĀ
    Ā
    
    >>>
msg220982 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2014-06-19 14:22
> It's annoyingly broken for me due to the problems with WriteFile and ReadFile.

sys.stdout.write() doen't use WriteFile. Again, see the issue #1602 if you are interested to improve the Unicode support of the Windows console.

A workaround is for example to play with IDLE which has a better Unicode support.
msg220985 - (view) Author: Eryk Sun (eryksun) * (Python triager) Date: 2014-06-19 14:38
> sys.stdout.write() doen't use WriteFile. Again, see the 
> issue #1602 if you are interested to improve the Unicode 
> support of the Windows console.

_write calls WriteFile because Python 3 sets standard I/O to binary mode. The source is distributed with Visual Studio, so here's the relevant excerpt from write.c:

        else {
                /* binary mode, no translation */
                if ( WriteFile( (HANDLE)_osfhnd(fh),
                                (LPVOID)buf,
                                cnt,
                               (LPDWORD)&written,
                                NULL) )
                {
                        dosretval = 0;
                        charcount = written;
                }
                else
                        dosretval = GetLastError();
        }

In a debugger you can trace that WriteFile detects the handle is a console buffer handle (the lower 2 tag bits are set on the handle), and redirects the call to WriteConsoleA, which makes an LPC interprocess call to the console server (e.g. csrss.exe or conhost.exe). The LPC call, and associated heap limit, is the reason you had to modify _io.FileIO.write to limit the buffer size to 32767 when writing to the Windows console. See issue 11395.
msg220988 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2014-06-19 15:02
@eryksun: I agree that using the Python interactive interpreter in the Windows console has many important issues when using non-ASCII characters. But the title of this issue and the initial message is about the code page 65001. The *code page* is supported in Python 3.3 and we are not going to backport the Python codec in Python 2.7. For issues specific to the *Windows console*, there is already an open issue: #1602. It looks like you understand well the problem, so please continue the discussion there.

This issue is closed. Stop commenting a closed issue, others will not see your messages (the issue is not listed in the main bug tracker page).

(Except if someone is interested to backport the Python codec of the Windows code page 65001 in Python 2.7, so we may reopen the issue.)
History
Date User Action Args
2022-04-11 14:58:05adminsetgithub: 66007
2014-06-19 15:02:01vstinnersetmessages: + msg220988
2014-06-19 14:38:46eryksunsetmessages: + msg220985
2014-06-19 14:22:13vstinnersetmessages: + msg220982
2014-06-19 14:07:10eryksunsetmessages: + msg220978
2014-06-19 14:04:47r.david.murraysetstatus: open -> closed

versions: - Python 3.4
nosy: + r.david.murray

messages: + msg220977
resolution: wont fix
stage: resolved
2014-06-19 13:56:17BreamoreBoysetnosy: + BreamoreBoy
messages: + msg220975
2014-06-19 13:15:08vstinnersetmessages: + msg220972
2014-06-19 13:06:24eryksunsetnosy: + eryksun
messages: + msg220971
2014-06-19 10:59:05ionelmccreate