classification
Title: Multiple test failures with OSError: [Errno 84] Invalid or incomplete multibyte or wide character on ZFS with utf8only=on
Type: behavior Stage: test needed
Components: Tests, Unicode Versions: Python 3.9, Python 3.8, Python 3.7
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: benjamin.peterson, dimitern, ezio.melotti, serhiy.storchaka, vstinner
Priority: normal Keywords:

Created on 2019-07-13 10:13 by dimitern, last changed 2019-07-16 08:46 by vstinner.

Files
File name Uploaded Description Edit
cpython_test_output.log dimitern, 2019-07-13 10:13 Tests output
Messages (4)
msg347794 - (view) Author: Dimiter Naydenov (dimitern) * Date: 2019-07-13 10:13
I'm running Ubuntu 19.04 on a ZFS mirrored pool, where my home partition is configured with 'utf8only=on' attribute. I've cloned cpython and after running the tests, as described in devguide.python.org, I have 11 test failures:

== Tests result: FAILURE ==

389 tests OK.

11 tests failed:
    test_cmd_line_script test_httpservers test_imp test_import
    test_ntpath test_os test_posixpath test_socket test_unicode_file
    test_unicode_file_functions test_zipimport

I've been looking for similar or matching reported issues, but could not find one. I'm on the EuroPython 2019 CPython sprint and we'll be looking into this with the help of some of the core devs.
msg347801 - (view) Author: Dimiter Naydenov (dimitern) * Date: 2019-07-13 10:35
Here's some additional information I found for that specific attribute:

From the documentation at
http://dlc.sun.com/osol/docs/content/ZFSADMIN/gazss.html
(link is dead, but here's where I found the section below: https://zfs-discuss.opensolaris.narkive.com/3NqQVG0H/utf8only-and-normalization-properties#post1)

utf8only
Boolean
Off
This property indicates whether a file system should reject file names
that include characters that are not present in the UTF-8 character code
set. If this property is explicitly set to off, the normalization
property must either not be explicitly set or be set to none. The
default value for the utf8only property is off. This property cannot be
changed after the file system is created.
msg347998 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2019-07-16 01:05
I think Dimiter was able to fix most of the failures, except test_unicode_file_functions.
Yesterday during the sprints we were looking at it, and we did some tests using the following snippet:

import os
import unicodedata
upsilon_diaeresis_and_hook = "ϔ"

for form in ["NFC", "NFD", "NFKC", "NFKD"]:                       
  unicode_filename = unicodedata.normalize(form, upsilon_diaeresis_and_hook)
  with open(unicode_filename, "w") as f: f.write(form)
  print("N:", ascii(unicode_filename))
  print([ascii(filename) for filename in os.listdir('.')])

On ext4 this creates 4 different files: ['\u03d4', '\u03d2\u0308', '\u03ab', '\u03a5\u0308']
On ZFS with utf8only=true (and I believe normalization=formD), only 2 files are created but each of the 4 filenames can be used to access either of the 2 files.
This is also the default behavior on Mac.

The test is already skipped on darwin (Lib/test/test_unicode_file_functions.py:120), and should be skipped for ZFS too (might depend on the exact flags used), however we weren't able to find a portable way to determine the filesystem and flags.

An alternative is to try creating the 4 files and skip the test if only 2 gets created and if all the names can be used to open these two files, however this might mask other failures.  Unless someone can come up with a better way to do this, I think this is the only option.

In addition, different filesystems that don't exhibit this behavior can be used on Mac, so the test shouldn't be skipped in those cases.
msg348006 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-07-16 08:46
"""
On ext4 this creates 4 different files: ['\u03d4', '\u03d2\u0308', '\u03ab', '\u03a5\u0308']
On ZFS with utf8only=true (and I believe normalization=formD), only 2 files are created but each of the 4 filenames can be used to access either of the 2 files.
This is also the default behavior on Mac.

The test is already skipped on darwin (Lib/test/test_unicode_file_functions.py:120), and should be skipped for ZFS too (might depend on the exact flags used), however we weren't able to find a portable way to determine the filesystem and flags.
"""

I suggest to create a temporary directory, create the 4 files and see how many files you can using os.listdir(). If you get 4, the FS doesn't normalize anything. If you get less, it's likely that the FS normalizes names.
History
Date User Action Args
2019-07-16 08:46:21vstinnersetmessages: + msg348006
2019-07-16 01:05:20ezio.melottisetnosy: + serhiy.storchaka

messages: + msg347998
stage: test needed
2019-07-13 10:35:22dimiternsetmessages: + msg347801
2019-07-13 10:13:57dimiterncreate