classification
Title: os.confstr() doesn't decode result according to PEP 383
Type: behavior Stage:
Components: Extension Modules Versions: Python 3.1, Python 3.2
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: baikie, vstinner
Priority: normal Keywords: patch

Created on 2010-08-12 19:16 by baikie, last changed 2010-09-10 23:51 by vstinner. This issue is now closed.

Files
File name Uploaded Description Edit
confstr-pep383.diff baikie, 2010-08-12 19:16 Decode confstr() result according to PEP 383
confstr-bytes-3.2.diff baikie, 2010-08-19 18:48 Make os.confstr() return a bytes object, attributing the change to Python 3.2
Messages (7)
msg113700 - (view) Author: David Watson (baikie) Date: 2010-08-12 19:16
The attached patch applies on top of the patch from issue #9579 to
make it use PyUnicode_DecodeFSDefaultAndSize().  (You could use
it in the existing code, but until that issue is fixed, there is
sometimes nothing to decode!)
msg113723 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010-08-12 23:52
Can you give me examples of configuration keys with undecodable values?

PyUnicode_DecodeFSDefault(AndSize) encoding depends on the locale whereas PyUnicode_FromString uses utf-8. I don't know the encoding of confstr() values.

You can decode an utf-8 value using surrogateescape (PEP 383 error handler) with PyUnicode_DecodeUTF8(value, strlen(value), "surrogateescape").
msg113807 - (view) Author: David Watson (baikie) Date: 2010-08-13 18:36
The CS_PATH variable is a colon-separated list of directories ("the value for the PATH environment variable that finds all standard utilities"), so the file system encoding is certainly correct there.

I don't see any reference to an encoding in the POSIX spec for confstr().
msg113846 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010-08-13 22:31
Le vendredi 13 août 2010 20:36:22, vous avez écrit :
> The CS_PATH variable is a colon-separated list of directories ("the value
> for the PATH environment variable that finds all standard utilities"), so
> the file system encoding is certainly correct there.

CS_PATH is hardcoded to "/bin:/usr/bin" in the GNU libc for UNIX. Do you know 
another key for which the value can be controled by the user (or the system 
administrator)?

> I don't see any reference to an encoding in the POSIX spec for confstr().

CS_PATH is just an example, there are other keys. I'm not sure that all values 
are encoded to the filesystem encodings, it might be another encoding?

Well, if we really doesn't know the encoding, a solution is to use a bytes API 
(which may avoid the question of the usage of the PEP 383).
msg113923 - (view) Author: David Watson (baikie) Date: 2010-08-14 19:17
> CS_PATH is hardcoded to "/bin:/usr/bin" in the GNU libc for UNIX. Do you know 
> another key for which the value can be controled by the user (or the system 
> administrator)?

No, not a specific example, but CS_PATH could conceivably refer
to some POSIX compatibility suite that's been installed in a
non-ASCII location, and implementations can add their own
variables for whatever they want.

> CS_PATH is just an example, there are other keys. I'm not sure that all values 
> are encoded to the filesystem encodings, it might be another encoding?
> 
> Well, if we really doesn't know the encoding, a solution is to use a bytes API 
> (which may avoid the question of the usage of the PEP 383).

The other variables defined by POSIX refer to environment
variables and command-line options for the C compiler and the
getconf utility, all of which would use the FS encoding in
Python, but I agree there's no way to know the appropriate
encoding in general, or even whether anything cares about
encodings.

Personally, I have no objections to making it return bytes.
msg114402 - (view) Author: David Watson (baikie) Date: 2010-08-19 18:48
I wrote this patch to make confstr() return bytes (with code
similar to 2.x), and document the change in "Porting to Python
3.2" and elsewhere, but it then occurred to me that you might
have been talking about making a separate bytes API like
os.environb.  Which did you have in mind?

There is another option for a str API, which is to decode the
value as ASCII with the surrogateescape error handler.  The
returned string will then round-trip correctly through
PyUnicode_FSConverter(), etc., as long as the file system
encoding is compatible with ASCII, which PEP 383 requires it to
be.  This is how undecodable command line arguments are currently
handled when mbrtowc() is unavailable.
msg116063 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010-09-10 23:51
Fixed in r84696+r84697: confstr-minimal.diff from #9579 + PyUnicode_DecodeFSDefaultAndSize().

Thanks for the patch, sorry for the delay.
History
Date User Action Args
2010-09-10 23:51:51vstinnersetresolution: duplicate -> fixed
2010-09-10 23:51:42vstinnersetstatus: open -> closed
resolution: duplicate
messages: + msg116063
2010-08-19 18:48:18baikiesetfiles: + confstr-bytes-3.2.diff

messages: + msg114402
2010-08-14 19:17:19baikiesetmessages: + msg113923
2010-08-13 22:32:00vstinnersetmessages: + msg113846
2010-08-13 18:36:20baikiesetmessages: + msg113807
2010-08-12 23:52:38vstinnersetmessages: + msg113723
2010-08-12 20:38:38pitrousetnosy: + vstinner
2010-08-12 19:16:53baikiecreate