Message 126656 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	vstinner
Recipients	belopolsky, loewis, vstinner
Date	2011-01-20.22:23:20
SpamBayes Score	2.2150892e-11
Marked as misclassified	No
Message-id	<1295562193.29835.15.camel@marge>
In-reply-to	<AANLkTim0POiAYAH8e0gxTSqAmFuPNvhf7j8P2P3+dufm@mail.gmail.com>

Content
> A packaging mechanism that prepares code developed on a Latin-1 > filesystem for distribution, would have to NFKC-normalize > filenames before encoding them using UTF-8. It causes portability issues: if you copy a non-ASCII module on a new host, the program will work or not depending on the filesystem encoding. Having to transform the filename when you copy a file, just to fix a corner case, is a pain. > One possible solution to this problem is to define a 'compat' error > handler that would detect unencodable strings with encodable > compatibility equivalents and produce encoding of an NFKC equivalent > string instead of raising an error. Only few people use non-ASCII module names and most operating systems are able to store all Unicode characters, so I don't think that we need to support U+00B5 in a module name with Latin1 filesystem at all. If you use an old system using Latin1 filesystem, you have to limit your expectation on Python unicode support :-) os.fsencode() and os.fsdecode() already use a custom error handler: surrogateescape. compat will conflict with surrogateescape. Loading a module concatenates two parts: a path from sys.path (decoded from the filesystem encoding and surrogateescape error handler) and a module name. If custom is used to encode the filename, the module name will be encoded correctly, but not the path.

> A packaging mechanism that prepares code developed on a Latin-1
> filesystem for distribution, would have to NFKC-normalize 
> filenames before encoding them using UTF-8.

It causes portability issues: if you copy a non-ASCII module on a new
host, the program will work or not depending on the filesystem encoding.
Having to transform the filename when you copy a file, just to fix a
corner case, is a pain.

> One possible solution to this problem is to define a 'compat' error
> handler that would detect unencodable strings with encodable
> compatibility equivalents and produce encoding of an NFKC equivalent
> string instead of raising an error.

Only few people use non-ASCII module names and most operating systems
are able to store all Unicode characters, so I don't think that we need
to support U+00B5 in a module name with Latin1 filesystem at all. If you
use an old system using Latin1 filesystem, you have to limit your
expectation on Python unicode support :-)

os.fsencode() and os.fsdecode() already use a custom error handler:
surrogateescape. compat will conflict with surrogateescape. Loading a
module concatenates two parts: a path from sys.path (decoded from the
filesystem encoding and surrogateescape error handler) and a module
name. If custom is used to encode the filename, the module name will be
encoded correctly, but not the path.

History
Date	User	Action	Args
2011-01-20 22:23:22	vstinner	set	recipients: + vstinner, loewis, belopolsky
2011-01-20 22:23:20	vstinner	link	issue10952 messages
2011-01-20 22:23:20	vstinner	create