Message 149924 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	gz
Recipients	benjamin.peterson, gz
Date	2011-12-20.19:02:20
SpamBayes Score	5.2957083e-12
Marked as misclassified	No
Message-id	<1324407741.91.0.375379899581.issue13643@psf.upfronthosting.co.za>
In-reply-to

Content
Currently when running Python on a non-OSX posix environment under either the C locale, or with an invalid or missing locale, it's not possible to operate using unicode filenames outside the ascii range. Using bytes works, as does reading expecting unicode, using the surrogates hack. This makes robustly working with non-ascii filenames on different platforms needlessly annoying, given no modern nix should have problems just using UTF-8 in these cases. See the downstream bzr bug for more: <https://bugs.launchpad.net/bzr/+bug/794353> One option is to just use UTF-8 for encoding and decoding filenames when otherwise ascii would be used. As a strict superset, this shouldn't break too many existing assumptions, and it's unlikely that non-UTF-8 filenames will accidentally be mangled due to a locale setting blip. See the attached patch for this behaviour change. It does not include a test currently, but it's possible to write one using subprocess and overriden LANG and LC_ALL vars.

Currently when running Python on a non-OSX posix environment under either the C locale, or with an invalid or missing locale, it's not possible to operate using unicode filenames outside the ascii range. Using bytes works, as does reading expecting unicode, using the surrogates hack.

This makes robustly working with non-ascii filenames on different platforms needlessly annoying, given no modern nix should have problems just using UTF-8 in these cases.

See the downstream bzr bug for more:
<https://bugs.launchpad.net/bzr/+bug/794353>

One option is to just use UTF-8 for encoding and decoding filenames when otherwise ascii would be used. As a strict superset, this shouldn't break too many existing assumptions, and it's unlikely that non-UTF-8 filenames will accidentally be mangled due to a locale setting blip. See the attached patch for this behaviour change. It does not include a test currently, but it's possible to write one using subprocess and overriden LANG and LC_ALL vars.

History
Date	User	Action	Args
2011-12-20 19:02:22	gz	set	recipients: + gz, benjamin.peterson
2011-12-20 19:02:21	gz	set	messageid: <1324407741.91.0.375379899581.issue13643@psf.upfronthosting.co.za>
2011-12-20 19:02:21	gz	link	issue13643 messages
2011-12-20 19:02:20	gz	create