Title: Pathlib does not support a Cyrillic character 'й'
Type: behavior Stage: resolved
Components: macOS, Unicode Versions: Python 3.8
Status: closed Resolution: not a bug
Dependencies: Superseder:
Assigned To: Nosy List: ezio.melotti, hidr0.frbg, ned.deily, ronaldoussoren, steven.daprano
Priority: normal Keywords:

Created on 2020-12-10 11:32 by hidr0.frbg, last changed 2020-12-14 12:55 by ronaldoussoren. This issue is now closed.

File name Uploaded Description Edit hidr0.frbg, 2020-12-10 12:18
Messages (7)
msg382823 - (view) Author: Mihail Kirilov (hidr0.frbg) Date: 2020-12-10 11:32
I have a file with a Cirilyc name - "Файл на български", which when I load with path.Path and call name on it behaves differently

(Pdb) pathlib.Path("/tmp/pytest-of-root/pytest-15/test_bulgarian_name0/data/encoding/Файл на български.ldr").name
'Файл на български.ldr'
(Pdb) pathlib.Path("/tmp/pytest-of-root/pytest-15/test_bulgarian_name0/data/encoding/Файл на български.ldr").name[2]
(Pdb) pathlib.Path("/tmp/pytest-of-root/pytest-15/test_bulgarian_name0/data/encoding/Файл на български.ldr").name == "Файл на български"
msg382824 - (view) Author: Ronald Oussoren (ronaldoussoren) * (Python committer) Date: 2020-12-10 12:01
What platform are you using?
msg382826 - (view) Author: Steven D'Aprano (steven.daprano) * (Python committer) Date: 2020-12-10 12:07
You are comparing the name with the file extension against the name without the file extension:

>>> "Файл на български.ldr" == "Файл на български"
msg382827 - (view) Author: Steven D'Aprano (steven.daprano) * (Python committer) Date: 2020-12-10 12:15
In addition, you are probably hitting normalization issues. There are two ways to get the Cyrillic character 'й' in your string, one of them is a single code point, the other is two code points:

>>> a = 'й'
>>> b = 'й'
>>> len(a),
>>> len(b),[0]),[1])
msg382828 - (view) Author: Mihail Kirilov (hidr0.frbg) Date: 2020-12-10 12:18
I am uploading an Archive with

1 - mac.png
Using a mac I cannot generate the other 'й', but I can load the file, it exists, but .name is wrong.

2 - linux.png
Using a linux the exact same thing generates the file not existing.

3 - The file itself.

It is very tricky to generate the problem on the mac I can hop on a call with you to show you exactly what I do.
msg382906 - (view) Author: Ronald Oussoren (ronaldoussoren) * (Python committer) Date: 2020-12-12 10:14
What filesystem is used on macOS? If it is HFS+ you're likely running into unicode normalisation in the filesystem.

That is, 'й' can be represented as a single unicode codepoint (and likely is in your script), but in the NFD normalisation used by HFS+ the same character is represented using two codepoints (one of which is a combining character). Python string comparison compares code points and is not normalisation aware.

For APFS (used by default in recent macOS versions) the situation is more complicated according to what I've found on Google. However, APFS doesn't seen to normalise names (I've created a file name 'й' and os.listdir() returns a name with a single codepoint).
msg382984 - (view) Author: Ronald Oussoren (ronaldoussoren) * (Python committer) Date: 2020-12-14 12:55
I'm closing this as "not a bug" because this is likely caused by different unicode normalisations for strings.
Date User Action Args
2020-12-14 12:55:48ronaldoussorensetstatus: open -> closed
type: crash -> behavior
messages: + msg382984

resolution: not a bug
stage: resolved
2020-12-14 09:16:55vstinnersetnosy: - vstinner
2020-12-12 10:15:03ronaldoussorensetnosy: + ned.deily
components: + macOS
2020-12-12 10:14:46ronaldoussorensetmessages: + msg382906
2020-12-10 12:18:02hidr0.frbgsetfiles: +

messages: + msg382828
2020-12-10 12:15:13steven.dapranosetmessages: + msg382827
2020-12-10 12:07:14steven.dapranosetnosy: + steven.daprano
messages: + msg382826
2020-12-10 12:01:37ronaldoussorensetnosy: + ronaldoussoren
messages: + msg382824
2020-12-10 11:32:57hidr0.frbgcreate