Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test_site.StartupImportTests.test_startup_imports fails if default code page is cp65001 #80959

Closed
paulmon mannequin opened this issue May 2, 2019 · 31 comments
Closed
Labels
3.8 only security fixes OS-windows tests Tests in the Lib/test dir type-bug An unexpected behavior, bug, or error

Comments

@paulmon
Copy link
Mannequin

paulmon mannequin commented May 2, 2019

BPO 36778
Nosy @pfmoore, @vstinner, @tjguk, @methane, @zware, @serhiy-storchaka, @eryksun, @zooba, @paulmon
PRs
  • bpo-36778: fix test_startup_imports if default codepage is UTF-8 #13069
  • bpo-36778: fix test_startup_imports if default codepage is cp65001 #13072
  • bpo-36778: Avoid functools in encodings.cp65001 #13110
  • bpo-36778: Avoid functools in encodings.cp65001 #13211
  • bpo-36778: cp65001 encoding becomes an alias to utf_8 #13230
  • bpo-36778: Update cp65001 codec documentation #13240
  • bpo-36778: Remove outdated comment #13807
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2019-05-10.03:30:36.027>
    created_at = <Date 2019-05-02.20:46:25.306>
    labels = ['3.8', 'type-bug', 'tests', 'OS-windows']
    title = 'test_site.StartupImportTests.test_startup_imports fails if default code page is cp65001'
    updated_at = <Date 2019-06-04.15:09:15.116>
    user = 'https://github.com/paulmon'

    bugs.python.org fields:

    activity = <Date 2019-06-04.15:09:15.116>
    actor = 'vstinner'
    assignee = 'none'
    closed = True
    closed_date = <Date 2019-05-10.03:30:36.027>
    closer = 'vstinner'
    components = ['Tests', 'Windows']
    creation = <Date 2019-05-02.20:46:25.306>
    creator = 'Paul Monson'
    dependencies = []
    files = []
    hgrepos = []
    issue_num = 36778
    keywords = ['patch']
    message_count = 31.0
    messages = ['341316', '341323', '341324', '341326', '341372', '341377', '341383', '341401', '341520', '341531', '341570', '341671', '341924', '341926', '341947', '341955', '341968', '342004', '342006', '342008', '342010', '342019', '342020', '342025', '342027', '342029', '342032', '342047', '342053', '342290', '344588']
    nosy_count = 9.0
    nosy_names = ['paul.moore', 'vstinner', 'tim.golden', 'methane', 'zach.ware', 'serhiy.storchaka', 'eryksun', 'steve.dower', 'Paul Monson']
    pr_nums = ['13069', '13072', '13110', '13211', '13230', '13240', '13807']
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue36778'
    versions = ['Python 3.8']

    @paulmon
    Copy link
    Mannequin Author

    paulmon mannequin commented May 2, 2019

    Windows desktop skus have a default ANSI codepage (returned by GetACP()) of 1252 (Western European). Windows IoT Core and Windows Nano Server have a default codepage of 65001 (UTF-8).

    This causes test_site.StartupImportTests.test_startup_imports to fail on Windows IoT Core and Windows Nano Server because cp65001.py is loaded instead of the frozen cp1252.py at startup.

    I tried changing the default codepage to 65001 on my dev machine and rebuilding Python and it had no effect that I could tell on the generated frozen importlibs.

    The simplest solutions would be for the test_startup_imports test to be skipped or changed to pass when the locale.getpreferredencoding() returns 'cp65001'

    @paulmon paulmon mannequin added 3.8 only security fixes tests Tests in the Lib/test dir OS-windows type-bug An unexpected behavior, bug, or error labels May 2, 2019
    @methane
    Copy link
    Member

    methane commented May 3, 2019

    Could you paste how the test fails?

    @paulmon
    Copy link
    Mannequin Author

    paulmon mannequin commented May 3, 2019

    ======================================================================
    FAIL: test_startup_imports (test.test_site.StartupImportTests)
    ----------------------------------------------------------------------

    Traceback (most recent call last):
      File "c:\docker\pythond\lib\test\test_site.py", line 542, in test_startup_imports
        self.assertFalse(modules.intersection(collection_mods), stderr)
    AssertionError: {'operator', 'keyword', 'functools', 'heapq', 'collections', 'reprlib'} is not false : import _frozen_importlib # frozen
    import _imp # builtin
    import '_thread' # <class '_frozen_importlib.BuiltinImporter'>
    import '_warnings' # <class '_frozen_importlib.BuiltinImporter'>
    import '_weakref' # <class '_frozen_importlib.BuiltinImporter'>
    import '_frozen_importlib_external' # <class '_frozen_importlib.FrozenImporter'>
    import '_io' # <class '_frozen_importlib.BuiltinImporter'>
    import 'marshal' # <class '_frozen_importlib.BuiltinImporter'>
    import 'nt' # <class '_frozen_importlib.BuiltinImporter'>
    import _thread # previously loaded ('_thread')
    import '_thread' # <class '_frozen_importlib.BuiltinImporter'>
    import _weakref # previously loaded ('_weakref')
    import '_weakref' # <class '_frozen_importlib.BuiltinImporter'>
    import 'winreg' # <class '_frozen_importlib.BuiltinImporter'>
    # installing zipimport hook
    import 'time' # <class '_frozen_importlib.BuiltinImporter'>
    import 'zipimport' # <class '_frozen_importlib.FrozenImporter'>
    # installed zipimport hook
    # c:\docker\pythond\lib\encodings\__pycache__\__init__.cpython-38.pyc matches c:\docker\pythond\lib\encodings\__init__.py
    # code object from 'c:\\docker\\pythond\\lib\\encodings\\__pycache__\\__init__.cpython-38.pyc'
    # c:\docker\pythond\lib\__pycache__\codecs.cpython-38.pyc matches c:\docker\pythond\lib\codecs.py
    # code object from 'c:\\docker\\pythond\\lib\\__pycache__\\codecs.cpython-38.pyc'
    import '_codecs' # <class '_frozen_importlib.BuiltinImporter'>
    import 'codecs' # <_frozen_importlib_external.SourceFileLoader object at 0x01D9DBD0>
    # c:\docker\pythond\lib\encodings\__pycache__\aliases.cpython-38.pyc matches c:\docker\pythond\lib\encodings\aliases.py
    # code object from 'c:\\docker\\pythond\\lib\\encodings\\__pycache__\\aliases.cpython-38.pyc'
    import 'encodings.aliases' # <_frozen_importlib_external.SourceFileLoader object at 0x01EFF900>
    import 'encodings' # <_frozen_importlib_external.SourceFileLoader object at 0x01D9DA50>
    # c:\docker\pythond\lib\encodings\__pycache__\utf_8.cpython-38.pyc matches c:\docker\pythond\lib\encodings\utf_8.py
    # code object from 'c:\\docker\\pythond\\lib\\encodings\\__pycache__\\utf_8.cpython-38.pyc'
    import 'encodings.utf_8' # <_frozen_importlib_external.SourceFileLoader object at 0x01D9DCC0>
    import '_signal' # <class '_frozen_importlib.BuiltinImporter'>
    # c:\docker\pythond\lib\encodings\__pycache__\cp65001.cpython-38.pyc matches c:\docker\pythond\lib\encodings\cp65001.py
    # code object from 'c:\\docker\\pythond\\lib\\encodings\\__pycache__\\cp65001.cpython-38.pyc'
    # c:\docker\pythond\lib\__pycache__\functools.cpython-38.pyc matches c:\docker\pythond\lib\functools.py
    # code object from 'c:\\docker\\pythond\\lib\\__pycache__\\functools.cpython-38.pyc'
    # c:\docker\pythond\lib\__pycache__\abc.cpython-38.pyc matches c:\docker\pythond\lib\abc.py
    # code object from 'c:\\docker\\pythond\\lib\\__pycache__\\abc.cpython-38.pyc'
    import '_abc' # <class '_frozen_importlib.BuiltinImporter'>
    import 'abc' # <_frozen_importlib_external.SourceFileLoader object at 0x01F16FC0>
    # c:\docker\pythond\lib\collections\__pycache__\__init__.cpython-38.pyc matches c:\docker\pythond\lib\collections\__init__.py
    # code object from 'c:\\docker\\pythond\\lib\\collections\\__pycache__\\__init__.cpython-38.pyc'
    # c:\docker\pythond\lib\__pycache__\_collections_abc.cpython-38.pyc matches c:\docker\pythond\lib\_collections_abc.py
    # code object from 'c:\\docker\\pythond\\lib\\__pycache__\\_collections_abc.cpython-38.pyc'
    import '_collections_abc' # <_frozen_importlib_external.SourceFileLoader object at 0x01F423C0>
    # c:\docker\pythond\lib\__pycache__\operator.cpython-38.pyc matches c:\docker\pythond\lib\operator.py
    # code object from 'c:\\docker\\pythond\\lib\\__pycache__\\operator.cpython-38.pyc'
    import '_operator' # <class '_frozen_importlib.BuiltinImporter'>
    import 'operator' # <_frozen_importlib_external.SourceFileLoader object at 0x01F4D630>
    # c:\docker\pythond\lib\__pycache__\keyword.cpython-38.pyc matches c:\docker\pythond\lib\keyword.py
    # code object from 'c:\\docker\\pythond\\lib\\__pycache__\\keyword.cpython-38.pyc'
    import 'keyword' # <_frozen_importlib_external.SourceFileLoader object at 0x01F58810>
    # c:\docker\pythond\lib\__pycache__\heapq.cpython-38.pyc matches c:\docker\pythond\lib\heapq.py
    # code object from 'c:\\docker\\pythond\\lib\\__pycache__\\heapq.cpython-38.pyc'
    import '_heapq' # <class '_frozen_importlib.BuiltinImporter'>
    import 'heapq' # <_frozen_importlib_external.SourceFileLoader object at 0x01F588D0>
    import 'itertools' # <class '_frozen_importlib.BuiltinImporter'>
    # c:\docker\pythond\lib\__pycache__\reprlib.cpython-38.pyc matches c:\docker\pythond\lib\reprlib.py
    # code object from 'c:\\docker\\pythond\\lib\\__pycache__\\reprlib.cpython-38.pyc'
    import 'reprlib' # <_frozen_importlib_external.SourceFileLoader object at 0x01F59900>
    import '_collections' # <class '_frozen_importlib.BuiltinImporter'>
    import 'collections' # <_frozen_importlib_external.SourceFileLoader object at 0x01F25810>
    import '_functools' # <class '_frozen_importlib.BuiltinImporter'>
    import 'functools' # <_frozen_importlib_external.SourceFileLoader object at 0x01EFFCC0>
    import 'encodings.cp65001' # <_frozen_importlib_external.SourceFileLoader object at 0x01EFF9F0>
    # c:\docker\pythond\lib\encodings\__pycache__\latin_1.cpython-38.pyc matches c:\docker\pythond\lib\encodings\latin_1.py
    # code object from 'c:\\docker\\pythond\\lib\\encodings\\__pycache__\\latin_1.cpython-38.pyc'
    import 'encodings.latin_1' # <_frozen_importlib_external.SourceFileLoader object at 0x01EFF810>
    # c:\docker\pythond\lib\__pycache__\io.cpython-38.pyc matches c:\docker\pythond\lib\io.py
    # code object from 'c:\\docker\\pythond\\lib\\__pycache__\\io.cpython-38.pyc'
    import 'io' # <_frozen_importlib_external.SourceFileLoader object at 0x01D88DB0>
    Python 3.8.0a3+ (heads/iot-merged-dirty:88716a51a3, Apr  5 2019, 11:11:18) [MSC v.1916 32 bit (ARM)] on win32
    Type "help", "copyright", "credits" or "license" for more information.
    # c:\docker\pythond\lib\__pycache__\site.cpython-38.pyc matches c:\docker\pythond\lib\site.py
    # code object from 'c:\\docker\\pythond\\lib\\__pycache__\\site.cpython-38.pyc'
    # c:\docker\pythond\lib\__pycache__\os.cpython-38.pyc matches c:\docker\pythond\lib\os.py
    # code object from 'c:\\docker\\pythond\\lib\\__pycache__\\os.cpython-38.pyc'
    # c:\docker\pythond\lib\__pycache__\stat.cpython-38.pyc matches c:\docker\pythond\lib\stat.py
    # code object from 'c:\\docker\\pythond\\lib\\__pycache__\\stat.cpython-38.pyc'
    import '_stat' # <class '_frozen_importlib.BuiltinImporter'>
    import 'stat' # <_frozen_importlib_external.SourceFileLoader object at 0x01F25990>
    # c:\docker\pythond\lib\__pycache__\ntpath.cpython-38.pyc matches c:\docker\pythond\lib\ntpath.py
    # code object from 'c:\\docker\\pythond\\lib\\__pycache__\\ntpath.cpython-38.pyc'
    # c:\docker\pythond\lib\__pycache__\genericpath.cpython-38.pyc matches c:\docker\pythond\lib\genericpath.py
    # code object from 'c:\\docker\\pythond\\lib\\__pycache__\\genericpath.cpython-38.pyc'
    import 'genericpath' # <_frozen_importlib_external.SourceFileLoader object at 0x01F9CDE0>
    import 'ntpath' # <_frozen_importlib_external.SourceFileLoader object at 0x01F9C5D0>
    import 'os' # <_frozen_importlib_external.SourceFileLoader object at 0x01F873F0>
    # c:\docker\pythond\lib\__pycache__\_sitebuiltins.cpython-38.pyc matches c:\docker\pythond\lib\_sitebuiltins.py
    # code object from 'c:\\docker\\pythond\\lib\\__pycache__\\_sitebuiltins.cpython-38.pyc'
    import '_sitebuiltins' # <_frozen_importlib_external.SourceFileLoader object at 0x01F87FC0>
    import 'site' # <_frozen_importlib_external.SourceFileLoader object at 0x01F16C60>
    # cleanup[3] wiping _functools
    # cleanup[3] wiping _collections
    # cleanup[3] wiping heapq
    # cleanup[3] wiping _heapq
    # destroy _heapq
    # cleanup[3] wiping _operator
    # cleanup[3] wiping _collections_abc
    # cleanup[3] wiping _abc
    # cleanup[3] wiping encodings.utf_8
    # cleanup[3] wiping encodings.aliases
    # cleanup[3] wiping codecs
    # cleanup[3] wiping _codecs
    # cleanup[3] wiping winreg
    # cleanup[3] wiping _weakref
    # cleanup[3] wiping _thread
    # cleanup[3] wiping nt
    # cleanup[3] wiping marshal
    # cleanup[3] wiping _io
    # cleanup[3] wiping _frozen_importlib_external
    # destroy io
    # destroy nt
    # destroy winreg
    # destroy marshal
    # cleanup[3] wiping _warnings
    # cleanup[3] wiping _imp
    # cleanup[3] wiping _frozen_importlib
    # destroy _frozen_importlib_external
    # destroy _imp
    # destroy _warnings
    # cleanup[3] wiping sys
    # clear builtins._
    # clear sys.path
    # clear sys.argv
    # clear sys.ps1
    # clear sys.ps2
    # clear sys.last_type
    # clear sys.last_value
    # clear sys.last_traceback
    # clear sys.path_hooks
    # clear sys.path_importer_cache
    # clear sys.meta_path
    # clear sys.__interactivehook__
    # clear sys.flags
    # clear sys.float_info
    # restore sys.stdin
    # restore sys.stdout
    # restore sys.stderr
    # cleanup[2] removing sys
    # cleanup[2] removing builtins
    # cleanup[2] removing _frozen_importlib
    # cleanup[2] removing _imp
    # cleanup[2] removing _warnings
    # cleanup[2] removing _frozen_importlib_external
    # cleanup[2] removing _io
    # cleanup[2] removing marshal
    # cleanup[2] removing nt
    # cleanup[2] removing _thread
    # cleanup[2] removing _weakref
    # cleanup[2] removing winreg
    # cleanup[2] removing time
    # cleanup[2] removing zipimport
    # destroy zipimport
    # cleanup[2] removing _codecs
    # cleanup[2] removing codecs
    # cleanup[2] removing encodings.aliases
    # cleanup[2] removing encodings
    # destroy encodings
    # cleanup[2] removing encodings.utf_8
    # cleanup[2] removing _signal
    # cleanup[2] removing __main__
    # destroy __main__
    # cleanup[2] removing _abc
    # cleanup[2] removing abc
    # cleanup[2] removing _collections_abc
    # cleanup[2] removing _operator
    # cleanup[2] removing operator
    # destroy operator
    # cleanup[2] removing keyword
    # destroy keyword
    # cleanup[2] removing _heapq
    # cleanup[2] removing heapq
    # cleanup[2] removing itertools
    # cleanup[2] removing reprlib
    # destroy reprlib
    # cleanup[2] removing _collections
    # cleanup[2] removing collections
    # destroy collections
    # cleanup[2] removing _functools
    # cleanup[2] removing functools
    # cleanup[2] removing encodings.cp65001
    # cleanup[2] removing encodings.latin_1
    # cleanup[2] removing io
    # destroy io
    # cleanup[2] removing _stat
    # cleanup[2] removing stat
    # cleanup[2] removing genericpath
    # cleanup[2] removing ntpath
    # cleanup[2] removing os.path
    # cleanup[2] removing os
    # cleanup[2] removing _sitebuiltins
    # cleanup[2] removing site
    # destroy site
    # destroy time
    # destroy _signal
    # destroy itertools
    # destroy _sitebuiltins
    # destroy abc
    # destroy ntpath
    # destroy _stat
    # destroy os
    # destroy stat
    # destroy genericpath
    # cleanup[3] wiping encodings.latin_1
    # cleanup[3] wiping encodings.cp65001
    # destroy functools
    # cleanup[3] wiping builtins
    # destroy _functools
    # destroy _collections_abc
    # destroy _operator
    # destroy heapq
    # destroy _weakref
    # destroy _collections
    # destroy _thread
    # destroy _abc
    # destroy _frozen_importlib

    @methane
    Copy link
    Member

    methane commented May 3, 2019

    @victor It seems you added cp65001 as Windows-only encoding in bpo-13216.

    How do you think about removing cp65001 encoding, and add 'cp65001' -> 'utf_8' alias which is available on all platforms?

    @vstinner
    Copy link
    Member

    vstinner commented May 4, 2019

    cp65001 is *not* utf-8: Microsoft decided to handle surrogates differently
    for some reasons.

    @eryksun
    Copy link
    Contributor

    eryksun commented May 4, 2019

    cp65001 is *not* utf-8: Microsoft decided to handle surrogates
    differently for some reasons.

    Do you mean valid UTF-16 surrogate pairs? For example:

        >>> codecs.code_page_encode(65001, '\ud800\udc00')
        (b'\xf0\x90\x80\x80', 2)

    PyUnicode_AsUnicodeAndSize is neutral about storing surrogate codes in a 16-bit wchar_t string. In particular, the Python string in this case contains two surrogate codes, but they're passed to WideCharToMultiByte as a UTF-16 surrogate pair for the single character U+10000.

    Anyway, it seems to me this issue will be resolved if cp65001.py is rewritten without functools.partial.

    @serhiy-storchaka
    Copy link
    Member

    I think it is better to just make the check in the test conditional. It already contains some macOs specific conditions.

    @eryksun
    Copy link
    Contributor

    eryksun commented May 4, 2019

    I think it is better to just make the check in the test conditional.

    Okay. The test verifies work done to minimize interpreter startup time, but probably the relative cost of importing functools (and thus collections et al) isn't significant compared to the overall cost of spawning a process in a Windows desktop environment. That may not be the case for Nano Server and IoT Core.

    @vstinner
    Copy link
    Member

    vstinner commented May 6, 2019

    Paul Monson: I'm unable to reproduce exactly your issue, but I tried to reproduce it partially using PYTHONIOENCODING=cp65001.

    My PR 13110 avoids "import functools" at startup. Can you please try it and check if it fix test_site?

    @vstinner
    Copy link
    Member

    vstinner commented May 6, 2019

    Victor:

    cp65001 is *not* utf-8: Microsoft decided to handle surrogates differently for some reasons.

    Eryk:

    Do you mean valid UTF-16 surrogate pairs? (...)

    Code page 65001 handles lone surrogate differently on Windows XP and older. It changed in Windows Vista:
    https://unicodebook.readthedocs.io/operating_systems.html#encode-and-decode-functions

    Steve Dower removed support for Vista from test_codecs.py 3 years ago:

    commit f5aba58
    Author: Steve Dower <steve.dower@microsoft.com>
    Date: Tue Sep 6 19:42:27 2016 -0700

    Issue bpo-27959: Adds oem encoding, alias ansi to mbcs, move aliasmbcs to codec lookup
    

    Maybe it's time to remove Lib/encodings/cp65001.py and add an alias cp65001 => utf_8 in Lib/encodings/aliases.py? See bpo-32592.

    @paulmon paulmon mannequin changed the title test_site.StartupImportTests.test_startup_imports fails if default code page is not cp1252 test_site.StartupImportTests.test_startup_imports fails if default code page is cp65001 May 6, 2019
    @paulmon
    Copy link
    Mannequin Author

    paulmon mannequin commented May 6, 2019

    Okay. The test verifies work done to minimize interpreter startup time, but probably the relative cost of importing functools (and thus collections et al) isn't significant compared to the overall cost of spawning a process in a Windows desktop environment. That may not be the case for Nano Server and IoT Core.

    Is there an easy way to measure this?

    PYTHONIOENCODING=cp65001

    I tried setting PYTHONIOENCODING=cp1252 on Windows IoT Core as a workaround and it didn't work.

    Victor> My PR 13110 avoids "import functools" at startup. Can you please try it and check if it fix test_site?

    I tried the PR and it fixes test_startup_imports, which seems promising. The PR breaks other test_site tests on Windows IoT Core.
    The same ones you pointed out in the PR discussion.

    @methane
    Copy link
    Member

    methane commented May 7, 2019

    FYI, I expect cp65001 will be used more widely in near future,
    because non UTF-8 default encoding reduced Developer eXperience,
    and Microsoft try to improve DX recent years.

    Today, Microsoft announced new Terminal application.
    It seems use SetConsoleOutputCP(65001) and SetConsoleCP(65001).

    I think treating cp65001 as right "UTF-8" locale is better for all
    Windows developers.

    @paulmon
    Copy link
    Mannequin Author

    paulmon mannequin commented May 8, 2019

    cp65001 is the default codepage on Windows IoT Core and Windows NanoServer.

    There is also an option in control panel in Windows desktop 1809 (version 17763) and greater which changes the default codepage to cp65001.

    1. Run control.exe
    2. Click Clock and Region> change date, time or number formats
    3. Click administrative tab
    4. Click "Change System locale..." button
    5. Check "Beta: Use Unicode UTF-8 for worldwide language support"
    6. Click OK twice.
    7. You will be prompted to reboot.

    Code page 65001 handles lone surrogate differently on Windows XP and older.

    If I read the docs correctly a lone surrogate is an error. I don't think a corner case like handling errors differently makes cp65001 not UTF-8. Am I misunderstanding this point?
    Also, Why is Windows XP still relevant in this discussion?

    @zooba
    Copy link
    Member

    zooba commented May 8, 2019

    The XP/Vista change is just context - we don't have to worry about OS that old any more.

    If we remove the functools.partial call, does that help?

    @paulmon
    Copy link
    Mannequin Author

    paulmon mannequin commented May 8, 2019

    Removing import functools from cp65001.py fixes test_startup_imports.

    Victor proposed this PR: #13110
    but new test_codecs fails because it's passing self on to the lambda I think.

    I tried to build on Victor's change but there is still one test failure I haven't tracked down yet: #13211

    FAIL: test_incremental_surrogatepass (test.test_codecs.CP65001Test)
    ----------------------------------------------------------------------

    Traceback (most recent call last):
      File "C:\master\pythond\lib\test\test_codecs.py", line 436, in test_incremental_surrogatepass
        self.assertEqual(dec.decode(data[i:], True), '\uD901')
    AssertionError: '' != '\ud901'
    + \ud901

    @eryksun
    Copy link
    Contributor

    eryksun commented May 9, 2019

    FYI, I expect cp65001 will be used more widely in near future,
    [...]
    It seems use SetConsoleOutputCP(65001) and SetConsoleCP(65001).

    Unless PYTHONLEGACYWINDOWSSTDIO is defined, Python 3.6+ doesn't use the console's codepage-based interface (except for low-level os.read and os.write). Console files uses the wide-character console API internally, and have a "utf-8" encoding. "cp65001" isn't a factor in this context.

    This issue probably occurs due to the encoding returned by locale.getpreferredencoding(). This calls _locale._getdefaultlocale, which returns a tuple that mixes the user locale with the system ANSI codepage. For example, with ANSI set to UTF-8 (Windows 10):

        >>> _locale._getdefaultlocale()
        ('en_GB', 'cp65001')

    The Universal CRT special cases CP_UTF8 (codepage 65001) as "utf8" and accepts "utf-8" as an alias. For example, after setting the ANSI codepage to UTF-8:

        >>> locale.setlocale(locale.LC_CTYPE, '')
        'English_United Kingdom.utf8'

    Python could similarly special case CP_UTF8 as "utf-8" in _locale._getdefaultlocale.

    @methane
    Copy link
    Member

    methane commented May 9, 2019

    @eryk I didn't say new Terminal will cause this issue. I know ConsoeIO too.

    I just meant Microsoft use cp65001 more widely for better UTF-8 support nowadays.
    So I want to make cp65001 as alias of UTF-8.

    Python could similarly special case CP_UTF8 as "utf-8" in _locale._getdefaultlocale.

    I like this idea too.

    @vstinner
    Copy link
    Member

    vstinner commented May 9, 2019

    I wrote PR 13230 to remove Lib/encodings/cp65001.py and simply reuse Lib/encodings/utf_8.py.

    @vstinner
    Copy link
    Member

    vstinner commented May 9, 2019

    My PR 13110 (avoid functools) makes codecs.lookup('cp65001').encode() made 2.7x slower:
    #13110 (comment)
    417 ns +- 17 ns

    My PR 13230 (remove cp65001.py) makes it 1.5x faster :-)
    #13230 (comment)
    105 ns +- 3 ns

    The reference is: 156 ns +- 3 ns.

    @vstinner
    Copy link
    Member

    vstinner commented May 9, 2019

    Python could similarly special case CP_UTF8 as "utf-8" in _locale._getdefaultlocale.

    I dislike lying in the locale module. This change is basically useless with my PR 13230.

    @eryksun
    Copy link
    Contributor

    eryksun commented May 9, 2019

    I dislike lying in the locale module. This change is basically useless
    with my PR 13230.

    Yes, functionally it's no different than using 'cp65001' as an alias. That said, the CRT special cases 65001 as "utf8":

        >>> locale.setlocale(locale.LC_CTYPE, '')
        'English_United Kingdom.utf8'
        >>> crt_locale = ctypes.CDLL('api-ms-win-crt-locale-l1-1-0', use_errno=True)
        >>> crt_locale.___lc_codepage_func()
        65001

    So the suggested change makes the locale module internally consistent on Windows and more transparent for anyone who doesn't know off the top of their head that "cp65001" is just UTF-8.

    @paulmon
    Copy link
    Mannequin Author

    paulmon mannequin commented May 10, 2019

    I can verify that PR 13110 fixes the issue with test_startup_imports on Windows IoT Core ARM32

    @paulmon
    Copy link
    Mannequin Author

    paulmon mannequin commented May 10, 2019

    Sorry that was supposed to say:
    I can verify that PR 13230 fixes the issue with test_startup_imports on Windows IoT Core ARM32

    @methane
    Copy link
    Member

    methane commented May 10, 2019

    I dislike lying in the locale module. This change is basically useless with my PR 13230.

    Note that Python produce "cpNNN" encoding name, not Windows.

    PyOS_snprintf(encoding, sizeof(encoding), "cp%u", GetACP());

    So I don't think it is lie. It is just "what encoding name we should choose when GetACP() returned 65001.".
    With your PR 13230, cp65001 is truly utf-8. So returning "utf-8" seems right behavior.

    @vstinner
    Copy link
    Member

    New changeset d267ac2 by Victor Stinner in branch 'master':
    bpo-36778: cp65001 encoding becomes an alias to utf_8 (GH-13230)
    d267ac2

    @vstinner
    Copy link
    Member

    About the ANSI code page, Lib/encodings/init.py calls _winapi.GetACP() to avoid relying on locale.getpreferredencoding() which lies when UTF-8 Mode is enabled:

                import _winapi
                ansi_code_page = "cp%s" % _winapi.GetACP()
                if encoding == ansi_code_page:
                    import encodings.mbcs
                    return encodings.mbcs.getregentry()

    INADA-san:

    So I don't think it is lie. It is just "what encoding name we should choose when GetACP() returned 65001.".
    With your PR 13230, cp65001 is truly utf-8. So returning "utf-8" seems right behavior.

    Well, feel free to propose a PR. I have no strong opinion on this level of detail :-)

    @vstinner
    Copy link
    Member

    Paul Monson: Your initial issue has been fixed in the master branch.

    I'm not sure what are Windows IoT Core and Windows Nano Server. Do you care of Python 3.7? If someone wants to support running test_site with ANSI code page set to 65001, I suggest to fix test_site directly like PR 13072 in Python 3.7. My attempt to avoid functools made cp65001 codec way slower. Fixing one specific test should not make Python that much slower ;-)

    @paulmon
    Copy link
    Mannequin Author

    paulmon mannequin commented May 10, 2019

    Thanks Victor! Since we aren't backporting ARM32 changes, I don't think it's important to fix this test in 3.7. I am trying to get the buildbot tests for Windows ARM32 to zero errors.

    Windows IoT Core runs on Raspberry Pi and similar devices: https://developer.microsoft.com/en-us/windows/iot

    Windows NanoServer is a very small version of Windows Server for running in Docker containers hosted on Windows Server.

    @vstinner
    Copy link
    Member

    Since we aren't backporting ARM32 changes, I don't think it's important to fix this test in 3.7. I am trying to get the buildbot tests for Windows ARM32 to zero errors.

    Ok, thanks. I close the issue.

    @vstinner
    Copy link
    Member

    New changeset 3aef48e by Victor Stinner in branch 'master':
    bpo-36778: Update cp65001 codec documentation (GH-13240)
    3aef48e

    @vstinner
    Copy link
    Member

    vstinner commented Jun 4, 2019

    New changeset ca612a9 by Victor Stinner in branch 'master':
    bpo-36778: Remove outdated comment from CodePageTest (GH-13807)
    ca612a9

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.8 only security fixes OS-windows tests Tests in the Lib/test dir type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    No branches or pull requests

    5 participants