This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: PyUnicode_AsEncodedString, PyUnicode_Decode: add fast-path for "us-ascii" encoding
Type: performance Stage:
Components: Interpreter Core, Unicode Versions: Python 3.7, Python 3.6
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: vstinner Nosy List: ezio.melotti, koobs, martin.panter, python-dev, scop, serhiy.storchaka, steve.dower, vstinner
Priority: normal Keywords: patch

Created on 2016-09-02 10:19 by vstinner, last changed 2022-04-11 14:58 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
normalize_encoding.patch vstinner, 2016-09-02 10:19 review
Pull Requests
URL Status Linked Edit
PR 4871 merged scop, 2017-12-14 20:44
PR 4881 merged python-dev, 2017-12-15 10:20
Messages (17)
msg274222 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2016-09-02 10:19
The "us-ascii" encoding is an alias to the Python ASCII encoding. PyUnicode_AsEncodedString() and PyUnicode_Decode() functions have a fast-path for the "ascii" string, but not for "us-ascii".

Attached patch uses also the fast-path for "us-ascii". It's a more generic change than the issue #27915. The "us-ascii" name is common in the email and xml.etree modules.

Other changes of the patch:

* Rewrite _Py_normalize_encoding() as a C implementation of encodings.normalize_encoding(). For example, " utf-8 " is now normalized to "utf_8". So the fast path is now used for more name variants of the same encoding.
* Avoid strcpy() when encoding is NULL: call directly the UTF-8 codec
* Reorder encodings: UTF-8, ASCII, MBCS, Latin1, UTF-16
* Remove fast-path for UTF-32: seriously, nobody uses this codec. Latin9 is much faster but has no fast-path.
msg274232 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2016-09-02 11:37
See also get_standard_encoding() in Python/codecs.c. I suppose it is faster.

UTF-32 is rarely used as external encoding, but it is still used as internal encoding in some programming languages and libraries (e.g. wchar_t* in C and std::wstring in C++ on Linux). The codec itself is very fast. I would add fast path all utf encodings (except utf-7).
msg274455 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2016-09-05 22:48
New changeset 99818330b4c0 by Victor Stinner in branch 'default':
Issue #27938: Add a fast-path for us-ascii encoding
https://hg.python.org/cpython/rev/99818330b4c0
msg274456 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2016-09-05 22:51
> See also get_standard_encoding() in Python/codecs.c. I suppose it is faster.

I understand that PyCodec_SurrogatePassErrors() is already called with a normalized encoding name.

With my enhanced _Py_normalize_encoding(), strange syntaxes like " utf 8 " also take the fast path.


> UTF-32 is rarely used as external encoding, but ...

Ok, I used the same design than get_standard_encoding() to match the "utf" prefix, so having a fast-path for UTF-16 and UTF-32 doesn't add new strcmp() for "latin9".

I pushed my change, so I close the issue.
msg274512 - (view) Author: Martin Panter (martin.panter) * (Python committer) Date: 2016-09-06 04:23
It seems this change is the cause of the Free BSD buildbot failures. From memory, both failing cases involve sending or receiving non-ASCII bytes in child Python processes.

http://buildbot.python.org/all/builders/AMD64%20FreeBSD%20CURRENT%20Non-Debug%203.x/builds/110/steps/test/logs/stdio

======================================================================
FAIL: test_non_ascii (test.test_cmd_line_script.CmdLineTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/home/buildbot/python/3.x.koobs-freebsd-current.nondebug/build/Lib/test/test_cmd_line_script.py", line 517, in test_non_ascii
    rc, stdout, stderr = assert_python_ok(script_name)
  File "/usr/home/buildbot/python/3.x.koobs-freebsd-current.nondebug/build/Lib/test/support/script_helper.py", line 139, in assert_python_ok
    return _assert_python(True, *args, **env_vars)
  File "/usr/home/buildbot/python/3.x.koobs-freebsd-current.nondebug/build/Lib/test/support/script_helper.py", line 125, in _assert_python
    err))
AssertionError: Process return code is 1
command line: ['/usr/home/buildbot/python/3.x.koobs-freebsd-current.nondebug/build/python', '-X', 'faulthandler', '-I', './@test_60885_tmp\udce7w\udcf0.py']

stdout:
---

---

stderr:
---
UnicodeEncodeError: 'ascii' codec can't encode character '\xe7' in position 17: ordinal not in range(128)
---

======================================================================
FAIL: test_nonascii (test.test_readline.TestReadline)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/home/buildbot/python/3.x.koobs-freebsd-current.nondebug/build/Lib/test/test_readline.py", line 203, in test_nonascii
    self.assertIn(b"text 't\\xeb'\r\n", output)
AssertionError: b"text 't\\xeb'\r\n" not found in bytearray(b"^A^B^B^B^B^B^B^B\t\tx\t\r\n[\x07\r\x07\x07\x07\x07\x07\x07\x07\x07x[\x08\x07\r\nresult \'x[\'\r\nhistory \'x[\'\r\n")
msg274692 - (view) Author: Kubilay Kocak (koobs) (Python triager) Date: 2016-09-07 00:52
Re-open and assign for regressions. Observed in all koobs-freebsd* buildbots (9/10/11) and build types. Issue is in default branch (add version 3.7)

First failing test run: http://buildbot.python.org/all/builders/AMD64%20FreeBSD%20CURRENT%20Non-Debug%203.x/builds/110
msg274720 - (view) Author: Martin Panter (martin.panter) * (Python committer) Date: 2016-09-07 02:56
Koobs if you can, it would be good to understand where the failure is. My guess is that Python doesn’t like running a non-ASCII filename. The following is hopefully a simplified version of the test_cmd_line_script test case:

import os, subprocess, sys

script_name = os.fsdecode(b'./\xE7w\xF0.py')
script_file = open(script_name, 'w', encoding='utf-8')
script_file.write('print(ascii(__file__))\n')
script_file.close()

cmd_line = [sys.executable, '-X', 'faulthandler', '-I', script_name]
env = os.environ.copy()
env['TERM'] = ''
proc = subprocess.Popen(cmd_line, stdin=subprocess.PIPE, stdout=subprocess.PIPE, stderr=subprocess.PIPE, env=env)
out, err = proc.communicate()
print(proc.returncode)  # Should be 0 but Free BSD has 1
print(repr(err))  # Error is about encoding 0xE7 with ASCII
print(repr(out))  # If executed, this would be the file name

Hopefully fixing the above problem will help with the test_readline failure. The readline test case does Readline (tab) completions involving non-ASCII text, and it seems that the Python completion routine is no longer being called.
msg274732 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2016-09-07 03:12
Sorry, but I don't have enough information to fix the issue. I don't see how my change can break the two failing tests. Could you please try to collect more information manually?
msg274796 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2016-09-07 11:15
Maybe Windows buildbots failures are related:

http://buildbot.python.org/all/builders/AMD64%20Windows7%20SP1%203.x/builds/8294/steps/test/logs/stdio

======================================================================
FAIL: test_create_at_shutdown_without_encoding (test.test_io.PyTextIOWrapperTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "C:\buildbot.python.org\3.x.kloth-win64\build\lib\test\test_io.py", line 3174, in test_create_at_shutdown_without_encoding
    self.assertIn(self.shutdown_error, err.decode())
AssertionError: 'LookupError: unknown encoding: ascii' not found in 'Exception ignored in: <bound method C.__del__ of <__main__.C object at 0x000000000123BF60>>\r\nTraceback (most recent call last):\r\n  File "<string>", line 12, in __del__\r\n  File "C:\\buildbot.python.org\\3.x.kloth-win64\\build\\lib\\_pyio.py", line 1934, in __init__\r\n  File "C:\\buildbot.python.org\\3.x.kloth-win64\\build\\lib\\encodings\\__init__.py", line 158, in _alias_mbcs\r\nImportError: sys.meta_path is None, Python is likely shutting down'

----------------------------------------------------------------------
msg274831 - (view) Author: Steve Dower (steve.dower) * (Python committer) Date: 2016-09-07 16:29
The Windows buildbot failures are partly my fault and partly Ben's fault (I created a new error message - Ben added it to the wrong test), so I'll go and prevent the error message.

No idea on the other issue. It doesn't repro for me, but since it seems to be FreeBSD readline related that isn't a surprise.
msg274834 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2016-09-07 16:46
> FAIL: test_create_at_shutdown_without_encoding (test.test_io.PyTextIOWrapperTest)

Steve fixed it:
---
changeset:   103229:47b4dbd451f5
tag:         tip
user:        Steve Dower <steve.dower@microsoft.com>
date:        Wed Sep 07 09:31:52 2016 -0700
files:       Lib/encodings/__init__.py Lib/test/test_io.py
description:
Issue #27959: Prevent ImportError from escaping codec search function
---

Its new search function now catchs ImportError as expected.
msg275574 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2016-09-10 06:14
New changeset 3b185df3a3e2 by Victor Stinner in branch 'default':
Fix check_force_ascii()
https://hg.python.org/cpython/rev/3b185df3a3e2
msg275576 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2016-09-10 06:31
> New changeset 3b185df3a3e2 by Victor Stinner in branch 'default':
> Fix check_force_ascii()
> https://hg.python.org/cpython/rev/3b185df3a3e2

@koobs: That's my tiny gift for your birthday. Happy Birthday! ;-) (It should fix FreeBSD buildbots.)
msg275577 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2016-09-10 06:33
Sorry for the little breakage of FreeBSD buildbots, it seems to be ok now ;-)
msg275625 - (view) Author: Kubilay Kocak (koobs) (Python triager) Date: 2016-09-10 11:14
@Victor I was just checking this issue to copy the test command, to provide results to you both when I saw the lovely surprise. Thank you :)
msg308374 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2017-12-15 10:19
New changeset 297fd876aad8ef443d8992618de22c46dbda258b by Victor Stinner (Ville Skyttä) in branch 'master':
bpo-28393: Update encoding lookup docs wrt bpo-27938 (#4871)
https://github.com/python/cpython/commit/297fd876aad8ef443d8992618de22c46dbda258b
msg308392 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2017-12-15 14:23
New changeset 77bf6da7258b4a312e224860ea50ac010aa17c1e by Victor Stinner (Miss Islington (bot)) in branch '3.6':
bpo-28393: Update encoding lookup docs wrt bpo-27938 (GH-4871) (#4881)
https://github.com/python/cpython/commit/77bf6da7258b4a312e224860ea50ac010aa17c1e
History
Date User Action Args
2022-04-11 14:58:35adminsetgithub: 72125
2018-04-04 17:55:28serhiy.storchakasetpull_requests: - pull_request6086
2018-04-04 17:52:44scopsetpull_requests: + pull_request6086
2017-12-15 14:23:26vstinnersetmessages: + msg308392
2017-12-15 10:20:03python-devsetpull_requests: + pull_request4777
2017-12-15 10:19:30vstinnersetmessages: + msg308374
2017-12-14 20:44:51scopsetpull_requests: + pull_request4763
2016-09-10 11:14:31koobssetmessages: + msg275625
2016-09-10 06:33:25vstinnersetstatus: open -> closed
resolution: fixed
messages: + msg275577
2016-09-10 06:31:32vstinnersetmessages: + msg275576
2016-09-10 06:14:53python-devsetmessages: + msg275574
2016-09-07 16:46:48vstinnersetmessages: + msg274834
2016-09-07 16:29:00steve.dowersetnosy: + steve.dower
messages: + msg274831
2016-09-07 11:15:38serhiy.storchakasetmessages: + msg274796
2016-09-07 03:12:55vstinnersetstatus: closed -> open

messages: + msg274732
2016-09-07 02:56:36martin.pantersetmessages: + msg274720
2016-09-07 00:52:31koobssetversions: + Python 3.7
nosy: + koobs

messages: + msg274692

assignee: vstinner
resolution: fixed -> (no value)
2016-09-06 04:23:10martin.pantersetnosy: + martin.panter
messages: + msg274512
2016-09-05 22:51:07vstinnersetstatus: open -> closed
resolution: fixed
messages: + msg274456
2016-09-05 22:48:49python-devsetnosy: + python-dev
messages: + msg274455
2016-09-02 11:37:41serhiy.storchakasetmessages: + msg274232
2016-09-02 10:19:34vstinnercreate