Message 150039 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	gz
Recipients	benjamin.peterson, gz, pitrou, poolie, r.david.murray, vila, vstinner
Date	2011-12-21.19:47:04
SpamBayes Score	1.1584754e-09
Marked as misclassified	No
Message-id	<1324496825.25.0.531524103923.issue13643@psf.upfronthosting.co.za>
In-reply-to

Content
> Nope. The locale encoding is chosen using LC_ALL, LC_CTYPE or LANG > variable: use the first non-empty variable. LC_MESSAGES doesn't affect > the encoding. Example: That's good to know, thanks. Only leaves the case where setlocale is called again with a different value. > Yes, and ASCII and UTF-8 are incompatible. ASCII is unable to decode an > UTF-8 encoded string. I think we're envisioning different things here. os.stat("\u2601") # with LANG=C current -> UnicodeEncodeError changed -> works if utf-8 encoded file exists os.listdir() # with LANG=C current -> returns non-ascii as unicode with funky surrogates changed -> returns non-utf-8 as unicode with funky surrogates > It affects everything because filenames are used everywhere. But currently everything handling filenames as unicode on nix needs to worry about surrogates (that can't be encoded as ascii) already, or it will still be passing values that can't be interpreted by other processes as you highlighed earlier. Making utf-8 names come out correctly rather than as surrogates doesn't seem like it increases the burden.

> Nope. The locale encoding is chosen using LC_ALL, LC_CTYPE or LANG 
> variable: use the first non-empty variable. LC_MESSAGES doesn't affect 
> the encoding. Example:

That's good to know, thanks. Only leaves the case where setlocale is called again with a different value.

> Yes, and ASCII and UTF-8 are incompatible. ASCII is unable to decode an 
> UTF-8 encoded string.

I think we're envisioning different things here.

  os.stat("\u2601") # with LANG=C
    current -> UnicodeEncodeError
    changed -> works if utf-8 encoded file exists

  os.listdir() # with LANG=C
    current -> returns non-ascii as unicode with funky surrogates
    changed -> returns non-utf-8 as unicode with funky surrogates

> It affects everything because filenames are used everywhere.

But currently everything handling filenames as unicode on nix needs to worry about surrogates (that can't be encoded as ascii) already, or it will still be passing values that can't be interpreted by other processes as you highlighed earlier. Making utf-8 names come out correctly rather than as surrogates doesn't seem like it increases the burden.

History
Date	User	Action	Args
2011-12-21 19:47:05	gz	set	recipients: + gz, pitrou, vstinner, vila, benjamin.peterson, r.david.murray, poolie
2011-12-21 19:47:05	gz	set	messageid: <1324496825.25.0.531524103923.issue13643@psf.upfronthosting.co.za>
2011-12-21 19:47:04	gz	link	issue13643 messages
2011-12-21 19:47:04	gz	create