This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: MD5SumTests.test_checksum_fodder fails on Windows
Type: behavior Stage:
Components: Tests, Unicode Versions: Python 3.11, Python 3.10, Python 3.9
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: andrei.avk, ezio.melotti, serhiy.storchaka, sobolevn, vstinner
Priority: normal Keywords:

Created on 2021-08-30 18:22 by sobolevn, last changed 2022-04-11 14:59 by admin.

Messages (8)
msg400648 - (view) Author: Nikita Sobolev (sobolevn) * (Python triager) Date: 2021-08-30 18:22
While working on https://github.com/python/cpython/pull/28060 we've noticed that `test.test_tools.test_md5sum.MD5SumTests.test_checksum_fodder` fails on Windows:

```
======================================================================
FAIL: test_checksum_fodder (test.test_tools.test_md5sum.MD5SumTests)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "D:\a\cpython\cpython\lib\test\test_tools\test_md5sum.py", line 41, in test_checksum_fodder
    self.assertIn(part.encode(), out)
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError: b'@test_1772_tmp\xc3\xa6' not found in b'd38dae2eb1ab346a292ef6850f9e1a0d @test_1772_tmp\xe6\\md5sum.fodder\r\n'
```

For now it is ignored.

Related issue: https://bugs.python.org/issue45042
msg400649 - (view) Author: Nikita Sobolev (sobolevn) * (Python triager) Date: 2021-08-30 18:23
I would love to work on this issue :)
msg400656 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2021-08-30 18:49
Test is failing because TESTFN contains now non-ASCII characters.

The path is written to stdout using the default stdout encoding on Windows (like cp1252), but test searches the path encoded with UTF-8. This test should fail also on other platforms with non-UTF-8 locale.

The simplest way to "fix" the test is using TESTFN_ASCII instead of TESTFN.

But there is also an issue in the script itself. It fails or produces a mojibake when the filesystem encoding and the stdout encoding do not match. There are similar issues in other scripts which output file names.
msg400785 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2021-08-31 21:52
> But there is also an issue in the script itself. It fails or produces a mojibake when the filesystem encoding and the stdout encoding do not match.

I don't know Tools/scripts/md5sum.py. Can you show an example which currently fails?
msg400812 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2021-09-01 06:28
$ touch тест
$ ./python Tools/scripts/md5sum.py тест
d41d8cd98f00b204e9800998ecf8427e тест
$ LC_ALL=uk_UA.koi8u PYTHONIOENCODING=koi8-u ./python Tools/scripts/md5sum.py тест
d41d8cd98f00b204e9800998ecf8427e тест
$ LC_ALL=uk_UA.koi8u PYTHONIOENCODING=utf-8 ./python Tools/scripts/md5sum.py тест
d41d8cd98f00b204e9800998ecf8427e я┌п╣я│я┌
$ PYTHONIOENCODING=koi8-u ./python Tools/scripts/md5sum.py тест
d41d8cd98f00b204e9800998ecf8427e ����
$ PYTHONIOENCODING=latin-1 ./python Tools/scripts/md5sum.py тест
Traceback (most recent call last):
  File "/home/serhiy/py/cpython/Tools/scripts/md5sum.py", line 93, in <module>
    sys.exit(main(sys.argv[1:], sys.stdout))
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/serhiy/py/cpython/Tools/scripts/md5sum.py", line 90, in main
    return sum(args, out)
           ^^^^^^^^^^^^^^
  File "/home/serhiy/py/cpython/Tools/scripts/md5sum.py", line 39, in sum
    sts = printsum(f, out) or sts
          ^^^^^^^^^^^^^^^^
  File "/home/serhiy/py/cpython/Tools/scripts/md5sum.py", line 53, in printsum
    sts = printsumfp(fp, filename, out)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/serhiy/py/cpython/Tools/scripts/md5sum.py", line 69, in printsumfp
    out.write('%s %s\n' % (m.hexdigest(), filename))
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
UnicodeEncodeError: 'latin-1' codec can't encode characters in position 33-36: ordinal not in range(256)
msg401016 - (view) Author: Nikita Sobolev (sobolevn) * (Python triager) Date: 2021-09-03 19:51
Yes, it was encodings problem :)

This line solved it (here: https://github.com/python/cpython/blob/6f8bc464e006f672d1aeafbfd7c774a40215dab2/Tools/scripts/md5sum.py#L69):

```python
out.write('%s %s\n' % (m.hexdigest(), filename.encode(
        sys.getfilesystemencoding(),
    ).decode(sys.stdout.encoding)))
```

> The simplest way to "fix" the test is using TESTFN_ASCII instead of TESTFN.

I haven't changed this, because right now it should work for non-ASCII symbols as well. I can even add an explicit ASCII test if needed.

Shouldn't https://github.com/python/cpython/pull/28060 be merge before I submit a new PR, so we can be sure that test now works? In the current state it will be just ignored.
msg401038 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2021-09-04 07:08
It will not work in all cases. For example if the stdio encoding is UTF-8 and the filesystem encoding is Latin1. Or the stdio encoding is CP1251 and the filesystem encoding is UTF-8. I am not also sure that it gives us the result which we want if it doesn't fail.

It is a general and complex issue, and every program which writes file names to stdout is affected.

For now I suggest just use TESTFN_ASCII instead of TESTFN. We will find better solution in future. I hesitate about merging PR 28060 because it can fail also on some non-Windows buildbots with uncommon locale settings.
msg406129 - (view) Author: Andrei Kulakov (andrei.avk) * (Python triager) Date: 2021-11-10 19:51
This was fixed in https://github.com/python/cpython/commit/dd7b816ac87, perhaps this should be closed as fixed?

It sounds like the general solution is beyond the scope of this issue and doesn't need to be tracked here.
History
Date User Action Args
2022-04-11 14:59:49adminsetgithub: 89216
2021-11-10 19:51:57andrei.avksetnosy: + andrei.avk
messages: + msg406129
2021-09-04 07:08:21serhiy.storchakasetmessages: + msg401038
2021-09-03 19:51:55sobolevnsetmessages: + msg401016
2021-09-01 06:28:25serhiy.storchakasetmessages: + msg400812
2021-08-31 21:52:56vstinnersetmessages: + msg400785
2021-08-30 18:49:54serhiy.storchakasetnosy: + vstinner, serhiy.storchaka, ezio.melotti
messages: + msg400656
components: + Unicode
2021-08-30 18:23:16sobolevnsetmessages: + msg400649
2021-08-30 18:22:59sobolevncreate