This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: test_tools: test_reindent_file_with_bad_encoding() fails RHEL7 on x86-64 and s390x with GCC 4.8.5 and LTO
Type: Stage: resolved
Components: Tests Versions: Python 3.10
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: methane, pablogsal, vstinner
Priority: normal Keywords:

Created on 2021-03-29 20:32 by vstinner, last changed 2022-04-11 14:59 by admin. This issue is now closed.

Messages (12)
msg389739 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2021-03-29 20:32
https://buildbot.python.org/all/#/builders/244/builds/931
At commit 9b999479c0022edfc9835a8a1f06e046f3881048

(...)
test_reindent_file_with_bad_encoding (test.test_tools.test_reindent.ReindentTests) ... FAIL
(...)

======================================================================
FAIL: test_reindent_file_with_bad_encoding (test.test_tools.test_reindent.ReindentTests)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/dje/cpython-buildarea/3.x.edelsohn-rhel-z.lto-pgo/build/Lib/test/test_tools/test_reindent.py", line 29, in test_reindent_file_with_bad_encoding
    rc, out, err = assert_python_ok(self.script, '-r', bad_coding_path)
  File "/home/dje/cpython-buildarea/3.x.edelsohn-rhel-z.lto-pgo/build/Lib/test/support/script_helper.py", line 160, in assert_python_ok
    return _assert_python(True, *args, **env_vars)
  File "/home/dje/cpython-buildarea/3.x.edelsohn-rhel-z.lto-pgo/build/Lib/test/support/script_helper.py", line 145, in _assert_python
    res.fail(cmd_line)
  File "/home/dje/cpython-buildarea/3.x.edelsohn-rhel-z.lto-pgo/build/Lib/test/support/script_helper.py", line 72, in fail
    raise AssertionError("Process return code is %d\n"
AssertionError: Process return code is 1
command line: ['/home/dje/cpython-buildarea/3.x.edelsohn-rhel-z.lto-pgo/build/python', '-X', 'faulthandler', '-I', '/home/dje/cpython-buildarea/3.x.edelsohn-rhel-z.lto-pgo/build/Tools/scripts/reindent.py', '-r', '/home/dje/cpython-buildarea/3.x.edelsohn-rhel-z.lto-pgo/build/Lib/test/bad_coding.py']

stdout:
---

---

stderr:
---
SyntaxError: encoding problem: encoding
---




Can it be related to the following change?

commit 261a452a1300eeeae1428ffd6e6623329c085e2c
Author: Pablo Galindo <Pablogsal@gmail.com>
Date:   Sun Mar 28 23:48:05 2021 +0100

    bpo-25643: Refactor the C tokenizer into smaller, logical units (GH-25050)
msg389742 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2021-03-29 20:37
Oh. Or maybe it's related to:

commit 4827483f47906fecee6b5d9097df2a69a293a85c
Author: Inada Naoki <songofacandy@gmail.com>
Date:   Mon Mar 29 12:28:14 2021 +0900

    bpo-43510: Implement PEP 597 opt-in EncodingWarning. (GH-19481)
    
    See [PEP 597](https://www.python.org/dev/peps/pep-0597/).
    
    * Add `-X warn_default_encoding` and `PYTHONWARNDEFAULTENCODING`.
    * Add EncodingWarning
    * Add io.text_encoding()
    * open(), TextIOWrapper() emits EncodingWarning when encoding is omitted and warn_default_encoding is enabled.
    * _pyio.TextIOWrapper() uses UTF-8 as fallback default encoding used when failed to import locale module. (used during building Python)
    * bz2, configparser, gzip, lzma, pathlib, tempfile modules use io.text_encoding().
    * What's new entry
msg389747 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2021-03-29 20:40
> https://buildbot.python.org/all/#/builders/244/builds/931

test.pythoninfo:

config[filesystem_encoding]: 'utf-8'
config[filesystem_errors]: 'surrogateescape'
config[stdio_encoding]: 'utf-8'
config[stdio_errors]: 'strict'
config[use_environment]: 1
config[warn_default_encoding]: 0

locale.encoding: UTF-8

os.environ[LANG]: en_US.UTF-8

os.uname: posix.uname_result(sysname='Linux', nodename='ztcpip3.pok.ibm.com', release='3.10.0-1160.11.1.el7.s390x', version='#1 SMP Mon Nov 30 13:07:00 EST 2020', machine='s390x')

platform.libc_ver: glibc 2.17
platform.platform: Linux-3.10.0-1160.11.1.el7.s390x-s390x-with-glibc2.17

pre_config[coerce_c_locale]: 0
pre_config[coerce_c_locale_warn]: 0
pre_config[configure_locale]: 1
pre_config[isolated]: 0
pre_config[utf8_mode]: 0

sys.filesystem_encoding: utf-8/surrogateescape

sys.stderr.encoding: utf-8/backslashreplace
sys.stdin.encoding: utf-8/strict
sys.stdout.encoding: utf-8/strict

sys.version: 3.10.0a6+ (heads/master:9b99947, Mar 29 2021, 08:53:44) [GCC 4.8.5 20150623 (Red Hat 4.8.5-44)]

sysconfig[CONFIG_ARGS]: '--prefix' '/home/dje/cpython-buildarea/3.x.edelsohn-rhel-z.lto-pgo/build/target' '--with-lto' '--enable-optimizations'

sysconfig[PY_CFLAGS]: -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall
sysconfig[PY_CFLAGS_NODIST]: -flto -fuse-linker-plugin -ffat-lto-objects -flto-partition=none -g -std=c99 -Wextra -Wno-unused-result -Wno-unused-parameter -Wno-missing-field-initializers -Werror=implicit-function-declaration -fvisibility=hidden -fprofile-use -fprofile-correction -I./Include/internal

sysconfig[PY_CORE_LDFLAGS]: -flto -fuse-linker-plugin -ffat-lto-objects -flto-partition=none -g
sysconfig[PY_LDFLAGS_NODIST]: -flto -fuse-linker-plugin -ffat-lto-objects -flto-partition=none -g

sysconfig[Py_DEBUG]: 0
msg389751 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2021-03-29 20:56
> SyntaxError: encoding problem: encoding

This "encoding problem: %s" error message comes from check_coding_spec() of Parser/tokenizer.c. The "%s" argument is the cs variable which is initialized by get_coding_spec().

test_tools.test_reindent_file_with_bad_encoding() uses Lib/test/bad_coding.py which contains a single line:

# -*- coding: uft-8 -*-

The expected encoding name is "uft-8", not "encoding".
msg389752 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2021-03-29 21:02
test_tools.test_reindent_file_with_bad_encoding() runs Tools/scripts/reindent.py.

The check() function of this script calls:

    with open(file, 'rb') as f:
        try:
            encoding, _ = tokenize.detect_encoding(f.readline)
        except SyntaxError as se:
            errprint("%s: SyntaxError: %s" % (file, str(se)))

But I don't think that the buildbot reached this line since the stderr message doesn't start with the input filename. For example, locally, I get the expected error:

$ ./python Tools/scripts/reindent.py -r Lib/test/bad_coding.py; echo $?
Lib/test/bad_coding.py: SyntaxError: unknown encoding for 'Lib/test/bad_coding.py': uft-8
0
msg389753 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2021-03-29 21:07
Oh. The failure is random:

* 934 green
* 933 red: test_reindent_file_with_bad_encoding failed
* 932 green
* 931 red: test_reindent_file_with_bad_encoding failed
* 930 green
* 929 red: test_reindent_file_with_bad_encoding failed
* 928 green
* (... older builds are all green ...)
* 775 orange
* 774 green
* (... more green builds ...)

This buildbot uses PGO+LTO optimization on RHEL7 with "gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-44)". Can it be a compiler issue? Are other buildbots affected?
msg389756 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2021-03-29 21:17
We have 4 buildbot workers running RHEL7 and using LTO+PGO optimizations: aarch64, amd64, ppc64le, s390x.

I saw random failures on amd64 and s390x. amd64 failed builds:

* 910: test_reindent_file_with_bad_encoding() failed
* 911: test_reindent_file_with_bad_encoding() failed
* 914: test_reindent_file_with_bad_encoding() failed
msg389757 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2021-03-29 21:20
AMD64 RHEL7 LTO 3.x: builds 896 and 900 failed with test_reindent_file_with_bad_encoding(). This worker only uses LTO, it doesn't use PGO.
msg389758 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2021-03-29 21:24
s390x RHEL7 LTO 3.x: builds 921, 924 and 925 failed with test_reindent_file_with_bad_encoding().
msg389759 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2021-03-29 21:43
vstinner@python-builder-rhel7$ echo|PYTHONMALLOC=malloc valgrind ./python Tools/scripts/reindent.py 
==26374== Memcheck, a memory error detector
==26374== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==26374== Using Valgrind-3.15.0 and LibVEX; rerun with -h for copyright info
==26374== Command: ./python Tools/scripts/reindent.py
==26374== 
==26374== Conditional jump or move depends on uninitialised value(s)
==26374==    at 0x4C305ED: __memcmp_sse4_1 (vg_replace_strmem.c:1112)
==26374==    by 0x5D0BCF: get_coding_spec.100964 (tokenizer.c:165)
==26374==    by 0x5D4C1B: check_coding_spec.part.6.100980 (tokenizer.c:214)
==26374==    by 0x5D213A: check_coding_spec (tokenizer.c:212)
==26374==    by 0x5D213A: tok_underflow_file.101007 (tokenizer.c:966)
==26374==    by 0x5D248D: tok_nextc.101010 (tokenizer.c:1031)
==26374==    by 0x5D2C80: tok_get.101023 (tokenizer.c:1213)
==26374==    by 0x5D4632: PyTokenizer_Get (tokenizer.c:1872)
==26374==    by 0x648E4C: _PyPegen_fill_token (pegen.c:633)
==26374==    by 0x6494D9: _PyPegen_expect_token (pegen.c:832)
==26374==    by 0x667497: _tmp_15_rule.137241 (parser.c:19552)
==26374==    by 0x649488: _PyPegen_lookahead (pegen.c:823)
==26374==    by 0x64EC04: compound_stmt_rule.138437 (parser.c:2008)
==26374== 
==26374== Conditional jump or move depends on uninitialised value(s)
==26374==    at 0x5D0BD2: get_coding_spec.100964 (tokenizer.c:165)
==26374==    by 0x5D4C1B: check_coding_spec.part.6.100980 (tokenizer.c:214)
==26374==    by 0x5D213A: check_coding_spec (tokenizer.c:212)
==26374==    by 0x5D213A: tok_underflow_file.101007 (tokenizer.c:966)
==26374==    by 0x5D248D: tok_nextc.101010 (tokenizer.c:1031)
==26374==    by 0x5D2C80: tok_get.101023 (tokenizer.c:1213)
==26374==    by 0x5D4632: PyTokenizer_Get (tokenizer.c:1872)
==26374==    by 0x648E4C: _PyPegen_fill_token (pegen.c:633)
==26374==    by 0x6494D9: _PyPegen_expect_token (pegen.c:832)
==26374==    by 0x667497: _tmp_15_rule.137241 (parser.c:19552)
==26374==    by 0x649488: _PyPegen_lookahead (pegen.c:823)
==26374==    by 0x64EC04: compound_stmt_rule.138437 (parser.c:2008)
==26374==    by 0x64DE4A: statement_rule.138374 (parser.c:1365)
==26374== 
==26374== 
==26374== HEAP SUMMARY:
==26374==     in use at exit: 406,507 bytes in 4,293 blocks
==26374==   total heap usage: 63,558 allocs, 59,265 frees, 9,156,496 bytes allocated
==26374== 
==26374== LEAK SUMMARY:
==26374==    definitely lost: 0 bytes in 0 blocks
==26374==    indirectly lost: 0 bytes in 0 blocks
==26374==      possibly lost: 390,522 bytes in 4,213 blocks
==26374==    still reachable: 15,985 bytes in 80 blocks
==26374==         suppressed: 0 bytes in 0 blocks
==26374== Rerun with --leak-check=full to see details of leaked memory
==26374== 
==26374== Use --track-origins=yes to see where uninitialised values come from
==26374== For lists of detected and suppressed errors, rerun with: -s
==26374== ERROR SUMMARY: 16322 errors from 2 contexts (suppressed: 0 from 0)
msg389763 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2021-03-29 22:25
It's a buffer overflow, or at least a crash related to uninitialized bytes. See:
https://github.com/python/cpython/pull/25080#issuecomment-809752737
msg389793 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2021-03-30 07:12
This issue should be fixed by:

commit 92a02c1f7e2dcdc62913a4236589e7e5d96172b9
Author: Pablo Galindo <Pablogsal@gmail.com>
Date:   Tue Mar 30 00:24:49 2021 +0100

    Fix tokenizer error when raw decoding null bytes (GH-25080)

The fix is the usage of strlen() instead of "tok->end - tok->cur" to compute the line length.

> https://buildbot.python.org/all/#/builders/244/builds/931

The latest 6 builds are successful. I close the issue.
History
Date User Action Args
2022-04-11 14:59:43adminsetgithub: 87828
2021-03-30 07:12:48vstinnersetstatus: open -> closed
resolution: fixed
messages: + msg389793

stage: resolved
2021-03-29 22:25:43vstinnersetmessages: + msg389763
2021-03-29 21:43:18vstinnersetmessages: + msg389759
2021-03-29 21:24:42vstinnersetmessages: + msg389758
title: test_tools: test_reindent_file_with_bad_encoding() fails RHEL7 with LTO -> test_tools: test_reindent_file_with_bad_encoding() fails RHEL7 on x86-64 and s390x with GCC 4.8.5 and LTO
2021-03-29 21:20:23vstinnersetmessages: + msg389757
title: test_tools: test_reindent_file_with_bad_encoding() fails on s390x RHEL7 LTO + PGO 3.x -> test_tools: test_reindent_file_with_bad_encoding() fails RHEL7 with LTO
2021-03-29 21:17:00vstinnersetmessages: + msg389756
2021-03-29 21:07:27vstinnersetmessages: + msg389753
2021-03-29 21:02:59vstinnersetmessages: + msg389752
2021-03-29 20:56:29vstinnersetmessages: + msg389751
2021-03-29 20:40:44vstinnersetmessages: + msg389747
2021-03-29 20:37:08vstinnersetnosy: + methane
messages: + msg389742
2021-03-29 20:32:53vstinnercreate