-
-
Notifications
You must be signed in to change notification settings - Fork 29.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Mac OS X: Decompose filenames on encode, and precompose filenames on decode #54418
Comments
PyUnicode_EncodeFSDefault() and os.fsencode() should decompose the filename (NFD) before encoding it to utf-8. PyUnicode_DecodeFSDefault(AndSize)() and os.fsdecode() should precompose the filename (NFC) after decoding it from utf-8. Qt library does this on Mac: see locale_encode()/locale_decode() (filename encoder/decoder) functions in src/corelib/io/qfile.cpp. It should fix some issues of test_pep277 on Mac OS X (see bpo-8423). I'm not completly sure that we should do that :-) (I used the nosy list from issues bpo-4388 and bpo-8423). -- Technical Q&A QA1173, Text Encodings in VFS: Q: I'm writing a file system (VFS) plug-in for Mac OS X. How do I handle text encodings correctly? |
Patch for os.fsencode/fsdecode importing unicodedata in the function (instead of a global import). unicodedata module is not builtin and is dynamically loaded. We should maybe ignore ImportError if the module is not available? With a warning? For PyUnicode_EncodeFSDefault() and PyUnicode_DecodeFSDefault(AndSize)() (C implementation), we can maybe use a hook (eg. implemented as as configurable callback) and set the hook after loading the unicodedata module. It would be easier if unicodedata would be builtin module :-) |
I'd like to see this patch reverted. I don't think it is useful.
To give an analogy: if we have a case-insensitive file system, we don't normalize into lower-case, either, do we? |
I created a specific branch to test the patch (I also patched PyUnicode_EncodeFSDefault() and PyUnicode_DecodeFSDefaultAndSize()): bpo-10209. test_pep277 now pass in this branch!
Yes, but not exactly... Mac OS X NFD normalization is a little bit different than Python's normalization: see msg105669 and I don't understand why test_pep277 pass on bpo-10209 branch, but it works. I suppose that normalize the filename to NFD in Python avoids some Mac OS X normalization bugs?
I propose to normalize to NFC because Qt does that. On Linux, the keyboard uses NFC. Eg. press é key writes U+00e9, not U+0065 U+0301. If you ask the user to write a filename, the filename will be stored in the same norm. So indirectly, Linux stores filenames as NFC. Which norm is used on Mac OS X, eg. for the keyboard? To display a filename, the norm is not important. With my patch, the norm is also no more important when accessing to the filesystem (no more strange Mac OS X normalization bug). So it's only important when comparing two filenames. If the two filenames are normalized in different norms (eg. NFC vs NFD), they will be seen as different even if they are the same name. -- Anyway, I think that os.fsencode(os.fsdecode(name)) should be equal to name. If it's different, "open(name, 'w').close(); name in listdir()" is False (on systems storing filenames as bytes). So if you change fsdecode(), fsencode() should also be changed. |
Some pointers. "MacFUSE" "FILENAME_ENCODING_PROPOSAL" (MacFUSE) "Converting to Precomposed Unicode" "Unicode NFD and file attachment on Mac OS X" (filenames of email attachments) "Bug: TWiki on Mac OS X server with I18N generates odd looking file names" |
I see. This is one more reason not to convert strings into NFD, no?
My question is rather why it failed in the first place, when bpo-8207
Hmm. I find that a weak argument - in particular given that the
I think this is technically incorrect. When you press é, then some
Same reasoning: pressing a key initially does not generate any Unicode http://developer.apple.com/library/mac/#qa/qa2001/qa1235.html which says "Macintosh keyboards generally produce precomposed Unicode"
I agree. and that is currently already the case.
I'm saying that fsdecode shouldn't change, either, the primary reason |
r79426 (of bpo-8207) only disabled some tests. The problem with test_normalize() and test_listdir() of test_pep277 is maybe that these tests are irrevelant on Mac OS X? I still don't understand exaclty why the tests fail and what the tests do check. |
I tried a different approach (different than my patch and the svn branch):
Let's watch the buildbots... |
It looks like r85897 is enough to fix test_pep277 on "x86 Tiger 3.x" buildbot. But r85899 should not make the situation worse :-) |
I now agree with Martin: "Mac OS X: Decompose filenames on encode, and precompose filenames on decode" was a bad idea, fix the test is the right solution. test_pep277 now pass on "x86 Tiger 3.x" buildbot, and so I can close this issue and issue bpo-8423. |
For completeness sake: Apple's Cocoa APIs do not renormalize strings, that is: I've created a file named 'één' in the Terminal, then (using a python 3.2 build): # Terminal input seems NFC:
>>> len('één')
3
# Output from os.listdir isn't:
>>> os.listdir('.')
['één']
>>> len(_[0])
5
# Output from the Cocoa equivalant also isn't:
>>> import Foundation
>>> mgr = Foundation.NSFileManager.defaultManager()
>>> mgr.directoryContentsAtPath_('.')
(
"e\U0301e\U0301n"
)
>>> len(_[0])
5 BTW. fsdecode(fsencode(x)) cannot in general be a no-op, unicode normalizations can screw things up (with the now withdrawn proposal the expression wouldn't be a no-op for NFD strings). |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: