Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create a bytes version of os.environ and getenvb() #52849

Closed
vstinner opened this issue May 3, 2010 · 38 comments
Closed

Create a bytes version of os.environ and getenvb() #52849

vstinner opened this issue May 3, 2010 · 38 comments
Labels
interpreter-core (Objects, Python, Grammar, and Parser dirs) stdlib Python modules in the Lib dir topic-unicode type-feature A feature request or enhancement

Comments

@vstinner
Copy link
Member

vstinner commented May 3, 2010

BPO 8603
Nosy @malemburg, @loewis, @gpshead, @pitrou, @vstinner, @ezio-melotti
Files
  • os_environb-3.patch
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2010-09-10.22:19:03.024>
    created_at = <Date 2010-05-03.08:56:24.666>
    labels = ['interpreter-core', 'type-feature', 'library', 'expert-unicode']
    title = 'Create a bytes version of os.environ and getenvb()'
    updated_at = <Date 2010-09-10.22:19:03.023>
    user = 'https://github.com/vstinner'

    bugs.python.org fields:

    activity = <Date 2010-09-10.22:19:03.023>
    actor = 'vstinner'
    assignee = 'none'
    closed = True
    closed_date = <Date 2010-09-10.22:19:03.024>
    closer = 'vstinner'
    components = ['Interpreter Core', 'Library (Lib)', 'Unicode']
    creation = <Date 2010-05-03.08:56:24.666>
    creator = 'vstinner'
    dependencies = []
    files = ['17236']
    hgrepos = []
    issue_num = 8603
    keywords = ['patch']
    message_count = 38.0
    messages = ['104825', '104827', '104838', '104839', '104871', '104877', '104878', '104879', '104880', '104883', '104884', '104886', '104889', '104894', '104895', '104897', '104898', '104900', '104901', '104913', '104920', '104987', '105003', '105006', '105080', '105115', '105116', '105124', '105131', '105142', '105143', '105167', '111878', '111880', '111897', '111972', '111976', '116050']
    nosy_count = 7.0
    nosy_names = ['lemburg', 'loewis', 'gregory.p.smith', 'pitrou', 'vstinner', 'ezio.melotti', 'Arfrever']
    pr_nums = []
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'enhancement'
    url = 'https://bugs.python.org/issue8603'
    versions = ['Python 3.2']

    @vstinner
    Copy link
    Member Author

    vstinner commented May 3, 2010

    As discussed in issue bpo-8514, I propose a bytes version of os.envionb which would be synchronized with os.environ (which is possible thanks to surrogateescape error handler).

    I also propose a os.getenvb()->bytes function.

    I don't know yet if it's a good idea of not, but my patch accepts both bytes and str for os.environ(b).get() and os.getenv(b)().

    antoine> In posixmodule.c, (...) if memory allocation of the bytes
    antoine> object fails, we should error out.

    I would require to change also the Windows version and the code specific to OS/2. Ok to do that, but after closing this issue ;-) I don't want to change to much things at the same time.

    @vstinner vstinner added interpreter-core (Objects, Python, Grammar, and Parser dirs) stdlib Python modules in the Lib dir topic-unicode labels May 3, 2010
    @vstinner
    Copy link
    Member Author

    vstinner commented May 3, 2010

    The patch creates also fsencode()/fsdecode() functions proposed in bpo-8514: I can rename them to use protected name (eg. "_encodeenv" and "_decodeenv") until we decided for these functions.

    @malemburg
    Copy link
    Member

    A view comments on the patch:

    + def __init__(self, data, encodekey, decodekey, encodevalue, decodevalue, putenv, unsetenv):

    As general guideline: When adding new parameter, please add them to the
    end of the parameter list and preferably with a default argument in order
    to not break the API.

    Doesn't matter much in this case, since _Environ is only used internally,
    but it's still good practice.

    --

    +data = {}
    +for key, value in environ.items():
    + data[_keymap(key)] = fsencode(value)

    Please put such init code into a function or make sure that the global
    module space is not polluted with temporary variables such as data,
    key, value.

    --

    + # bytes environ
    + environb = _Environ(data, _keymap, fsencode, fsencode, fsencode, _putenv, _unsetenv)

    This looks wrong even though it will work, but that's only a
    side-effect of how fsencode is coded and that's not how the
    stdlib should be coded (see below).

    --

    + def fsencode(value):
    + """
    + unicode to file system
    + """
    + if isinstance(value, bytes):
    + return value
    + else:
    + return value.encode(sys.getfilesystemencoding(), 'surrogateescape')

    The function should not accept bytes as input or at least
    document this pass-through behavior, leaving the user to decide
    whether that's wanted or not.

    --

    The patch is also missing code which keeps the two dictionaries in
    sync. If os.environ changes, os.environb would have to change as
    well.

    @vstinner
    Copy link
    Member Author

    vstinner commented May 3, 2010

    The patch is also missing code which keeps the two dictionaries in
    sync. If os.environ changes, os.environb would have to change as
    well.

    No, it doesn't :-) os.environ and os.environb are synchronized and there is a test for this! ;-)

    I will see later for your other comments.

    @loewis
    Copy link
    Mannequin

    loewis mannequin commented May 3, 2010

    Can somebody please explain what problem is being solved with this patch?

    @malemburg
    Copy link
    Member

    Martin v. Löwis wrote:

    Martin v. Löwis <martin@v.loewis.de> added the comment:

    Can somebody please explain what problem is being solved with this patch?

    The way os.environ is currently set up on Unix breaks applications
    using the environment to pass data to helper applications.

    Please see the discussion on http://bugs.python.org/issue8514
    for details.

    @loewis
    Copy link
    Mannequin

    loewis mannequin commented May 3, 2010

    Please see the discussion on http://bugs.python.org/issue8514
    for details.

    I can't see any report of actual breakage in that report, only claims of
    potential breakage (with no supporting examples)

    @vstinner
    Copy link
    Member Author

    vstinner commented May 3, 2010

    An issue was opened 2 years ago: "It was brought up in a discussion of sending non-ASCii data to a CGI-WSGI script where the data would be transferred via os.environ." => bpo-4006 (closed as "wont fix").

    @loewis
    Copy link
    Mannequin

    loewis mannequin commented May 3, 2010

    An issue was opened 2 years ago: "It was brought up in a discussion
    of sending non-ASCii data to a CGI-WSGI script where the data would
    be transferred via os.environ." => bpo-4006 (closed as "wont fix").

    Fortunately, that issue could now be reconsidered as "fixed"; the
    example in the report (msg74118) now works correctly - thanks to PEP-383. So I still fail to see the problem.

    @malemburg
    Copy link
    Member

    Martin v. Löwis wrote:

    Martin v. Löwis <martin@v.loewis.de> added the comment:

    > Please see the discussion on http://bugs.python.org/issue8514
    > for details.

    I can't see any report of actual breakage in that report, only claims of
    potential breakage (with no supporting examples)

    Set your CODESET to ASCII and watch the surrogate escaping
    begin... seriously, Martin, if you've ever worked with CGI
    or WSGI or FastCGI or SCGI or any of the many other protocols
    that use the OS environment for passing data between processes,
    it doesn't take much imagination to come up with examples
    that fail left and right.

    Here's one (RFC 3875, sections 4.1.7 and 4.1.5):

    LANG = 'en_US.utf8'
    CONTENT_TYPE = 'application/x-www-form-urlencoded'
    QUERY_STRING = 'type=example&name=Löwis'
    PATH_INFO = '/home/löwis/bin/mycgi.py'

    (HTML uses Latin-1 as default encoding and so do many of the
    protocols invented for it !)

    The file system encoding simply doesn't relate to the OS
    environment at all - it's just a collection of name=value mappings
    with no explicit encoding information. It may be a good guess,
    but that's it.

    @loewis
    Copy link
    Mannequin

    loewis mannequin commented May 3, 2010

    Set your CODESET to ASCII and watch the surrogate escaping
    begin... seriously, Martin, if you've ever worked with CGI
    or WSGI or FastCGI or SCGI or any of the many other protocols
    that use the OS environment for passing data between processes,
    it doesn't take much imagination to come up with examples
    that fail left and right.

    Here's one (RFC 3875, sections 4.1.7 and 4.1.5):

    LANG = 'en_US.utf8'
    CONTENT_TYPE = 'application/x-www-form-urlencoded'
    QUERY_STRING = 'type=example&name=Löwis'
    PATH_INFO = '/home/löwis/bin/mycgi.py'

    I still don't see a *failure* here. AFAICT, it all works correctly.
    In particular, I fail to see the advantage of using bytes over using
    escaped strings, in terms of correctness. I'm even skeptical that there
    is an advantage in terms of usability (and if there is, I'd like to see
    a demonstration of that, as well).

    @loewis
    Copy link
    Mannequin

    loewis mannequin commented May 3, 2010

    Here's one (RFC 3875, sections 4.1.7 and 4.1.5):

    LANG = 'en_US.utf8'
    CONTENT_TYPE = 'application/x-www-form-urlencoded'
    QUERY_STRING = 'type=example&name=Löwis'
    PATH_INFO = '/home/löwis/bin/mycgi.py'

    (HTML uses Latin-1 as default encoding and so do many of the
    protocols invented for it !)

    BTW, I think you are misinterpreting the RFC. It doesn't actually say
    that QUERY_STRING is Latin-1 encoded, but instead, it says

    "the details of the parsing, reserved characters and support for non
    US-ASCII characters depends on the context"

    Latin-1 is only given as a possible example. Apache passes the URL from
    the HTTP request unescaped; browsers will likely CGI-escape it. So most
    likely, it will be

    QUERY_STRING = 'type=example&name=L%F6wis'
    or
    QUERY_STRING = 'type=example&name=L%C3%B6wis'

    IMO, applications are much better off to consider QUERY_STRING as a
    character string.

    @malemburg
    Copy link
    Member

    Martin v. Löwis wrote:

    Martin v. Löwis <martin@v.loewis.de> added the comment:

    > Set your CODESET to ASCII and watch the surrogate escaping
    > begin... seriously, Martin, if you've ever worked with CGI
    > or WSGI or FastCGI or SCGI or any of the many other protocols
    > that use the OS environment for passing data between processes,
    > it doesn't take much imagination to come up with examples
    > that fail left and right.
    >
    > Here's one (RFC 3875, sections 4.1.7 and 4.1.5):
    >
    > LANG = 'en_US.utf8'
    > CONTENT_TYPE = 'application/x-www-form-urlencoded'
    > QUERY_STRING = 'type=example&name=Löwis'
    > PATH_INFO = '/home/löwis/bin/mycgi.py'

    I still don't see a *failure* here. AFAICT, it all works correctly.

    Your name will end up being partially escaped as surrogate:

    'L\udcf6wis'

    Further processing will fail, since the application would
    correctly assume that the data is Latin-1 only (see the RFC):

    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    UnicodeEncodeError: 'latin-1' codec can't encode character '\udcf6' in position 1: ordinal not in
    range(256)

    In particular, I fail to see the advantage of using bytes over using
    escaped strings, in terms of correctness. I'm even skeptical that there
    is an advantage in terms of usability (and if there is, I'd like to see
    a demonstration of that, as well).

    The use of the 'surrogateescape' error handler modifies the
    encoding used for the decoding of the bytes data and does this
    implicitly.

    This works fine as long as the data is only used *as reference* to
    some entity (e.g. as in a file name) and manipulation of that
    data is limited to concatenation and slicing. Things that you do
    with file names and paths.

    It doesn't work if an application tries to work *with* the data,
    e.g. tries to convert it, parse it, decode it, etc. The reason is
    that information included by the use of the 'surrogateescape'
    error handler is lost along the way and this then causes data
    corruption.

    @loewis
    Copy link
    Mannequin

    loewis mannequin commented May 3, 2010

    Your name will end up being partially escaped as surrogate:

    'L\udcf6wis'

    Further processing will fail

    That depends on the further processing, no?

    > Traceback (most recent call last):
    >   File "<stdin>", line 1, in <module>
    > UnicodeEncodeError: 'latin-1' codec can't encode character '\udcf6' in position 1: ordinal not in
    > range(256)

    Where did you get this error from?

    It doesn't work if an application tries to work *with* the data,
    e.g. tries to convert it

    Converting it to what?

    parse it

    Parsing will work fine.

    decode it

    It's a string. You shouldn't decode it.

    The reason is
    that information included by the use of the 'surrogateescape'
    error handler is lost along the way and this then causes data
    corruption.

    And how would that not happen if it was bytes? The problems you describe
    were one of the primary motivations to switch to Unicode: it's *byte*
    strings that have these problems.

    @vstinner
    Copy link
    Member Author

    vstinner commented May 3, 2010

    Can somebody please explain what problem is being solved with this patch?

    The problem is that we don't have reliable algorithm to get the encoding of each environment variable. We cannot ensure that all variables are encoded "correctly".

    Using os.getenvb(), you can decode the string using the right encoding (which may be different for each variable).

    Marc proposed os.getenv(key, encoding=...) but I don't like this solution: you have to split correctly all unicode things and all bytes things. You should have the choice to keep environment unchanged, as byte strings, and manipulate only byte strings, or to use the classic API (unicode only, os.getenv, os.environ(), ....).

    os.environb and os.getenvb() will be required to applications in real world application supporting all applications and misconfigured environments. Python3 shouldn't try to fix misconfigured systems but leave this problem to the developer.

    @loewis
    Copy link
    Mannequin

    loewis mannequin commented May 3, 2010

    Using os.getenvb(), you can decode the string using the right
    encoding (which may be different for each variable).

    Ok. If that's the motivation, the documentation should make that
    clear (there isn't any documentation in the patch, anyway). I'm
    worried that people start processing the bytes as-is (i.e. without
    decoding them), and then start complaining that this and that library
    doesn't support bytes. Then they start complaining that you can't
    mix bytes and Unicode...

    I also worry that people won't get it right any better than Python:
    when they find that it doesn't work, they ask what to do, and people
    will tell them decode with "iso-8859-1" (say). Then they do that,
    and end up in moji-bake the next day.

    @malemburg
    Copy link
    Member

    Martin v. Löwis wrote:

    Martin v. Löwis <martin@v.loewis.de> added the comment:

    > Here's one (RFC 3875, sections 4.1.7 and 4.1.5):
    >
    > LANG = 'en_US.utf8'
    > CONTENT_TYPE = 'application/x-www-form-urlencoded'
    > QUERY_STRING = 'type=example&name=Löwis'
    > PATH_INFO = '/home/löwis/bin/mycgi.py'
    >
    > (HTML uses Latin-1 as default encoding and so do many of the
    > protocols invented for it !)

    BTW, I think you are misinterpreting the RFC. It doesn't actually say
    that QUERY_STRING is Latin-1 encoded, but instead, it says

    "the details of the parsing, reserved characters and support for non
    US-ASCII characters depends on the context"

    Please read on:

    """
    For example, form submission from an HTML
    document [18] uses application/x-www-form-urlencoded encoding, in
    which the characters "+", "&" and "=" are reserved, and the ISO
    8859-1 encoding may be used for non US-ASCII characters.
    """

    I could have also given you an example using 'multipart/form-data'
    in which each part uses a different encoding or even sends binary
    data by means of 'Content-Transfer-Encoding: binary'

    These are not made up examples, they do occur in the real world for
    which we are coding.

    Latin-1 is only given as a possible example. Apache passes the URL from
    the HTTP request unescaped; browsers will likely CGI-escape it. So most
    likely, it will be

    QUERY_STRING = 'type=example&name=L%F6wis'
    or
    QUERY_STRING = 'type=example&name=L%C3%B6wis'

    IMO, applications are much better off to consider QUERY_STRING as a
    character string.

    Believe me, I've been working with HTML, forms, web apps, etc.
    for almost 20 years now. In the real world, your application has
    to cope with any kind of data in QUERY_STRING.

    And this is just one example of how the OS environment can
    be used, e.g. to provide the user meta-data, license data,
    company names.

    Even if these all use UTF-8, a user might still want to stick
    to ASCII as her CODESET and then all her Python application would
    start to fail at first sight of a French accent or German
    Umlaut.

    PEP-383 is nice for file names and paths, but it's unfortunately
    not going to save the world...

    @vstinner
    Copy link
    Member Author

    vstinner commented May 3, 2010

    I also worry that people won't get it right any better than Python (...)

    Developers coming from Python2 will continue to use os.getenv() and will not worry about encoding, and maybe not notice that the result is now unicode (and not more a byte string).

    I think that a developer will only switch to os.getenvb() if he/she has troubles with the encodings.

    @loewis
    Copy link
    Mannequin

    loewis mannequin commented May 3, 2010

    I think that a developer will only switch to os.getenvb() if he/she
    has troubles with the encodings.

    That's indeed a positive feature of this proposed change.

    @malemburg
    Copy link
    Member

    Martin v. Löwis wrote:
    > 
    > Martin v. Löwis <martin@v.loewis.de> added the comment:
    > 
    >> Your name will end up being partially escaped as surrogate:
    >>
    >> 'L\udcf6wis'
    >>
    >> Further processing will fail
    > 
    > That depends on the further processing, no?
    > 
    >> Traceback (most recent call last):
    >>   File "<stdin>", line 1, in <module>
    >> UnicodeEncodeError: 'latin-1' codec can't encode character '\udcf6' in position 1: ordinal not in
    >> range(256)
    > 
    > Where did you get this error from?
    The roundup email interface must have eaten this
    first line of the traceback: >>> _.encode('latin-1')

    > It doesn't work if an application tries to work *with* the data,
    > e.g. tries to convert it

    Converting it to what?

    > parse it

    Parsing will work fine.

    > decode it

    It's a string. You shouldn't decode it.

    > The reason is
    > that information included by the use of the 'surrogateescape'
    > error handler is lost along the way and this then causes data
    > corruption.

    And how would that not happen if it was bytes? The problems you describe
    were one of the primary motivations to switch to Unicode: it's *byte*
    strings that have these problems.

    Martin, it's obvious that you are not even trying to understand
    what I'm saying. That's not a good basis for discussion.

    @vstinner
    Copy link
    Member Author

    vstinner commented May 4, 2010

    New version of my patch, which looks much better. Summary:

    Issue bpo-8603: Create os.environb and os.getenvb() on POSIX system.
    os.unsetenv() encodes str argument using file system encoding and
    surrogateescape error handler (instead of utf8/strict), and accept bytes.

    Changes with my previous patch:

    • update os module documentation
    • os.getenv() only accepts str, os.getenvb() only accepts bytes (to avoid mojibake)
    • fix test_environb() of test_os for ASCII locale
    • fix os.putenv() if key is an unicode string: use the string encoded to bytes as key for posix_putenv_garbage
    • _Envion.__setitem__() encodes the key and value before calling putenv()
    • _Environ.__delitem__() encodes the key before calling unsetenv()
    • create a temporary function to create os.environ (to use temporary variables like encode, decode, keymap, data)
    • annotation types in os.getenv() and os.getenvb()
    • remove fsencode()/fsdecode() from the patch and don't touch os._execvpe() (will be fixed in other issues)
    • putenv() uses PyBytes_GET_SIZE() instead of strlen()

    @vstinner
    Copy link
    Member Author

    vstinner commented May 5, 2010

    @loewis: So do you agree to add os.environb and os.getenvb()?

    The documentation of os.environb and os.getenvb() in my last patch is very short. I'm not inspired.

    We told me on IRC to not use function annotations because annotation semantic was not decided yet.

    I will try to improve the documentation and remove the annotations in my next patch.

    @malemburg
    Copy link
    Member

    STINNER Victor wrote:

    The documentation of os.environb and os.getenvb() in my last patch is very short. I'm not inspired.

    We told me on IRC to not use function annotations because annotation semantic was not decided yet.

    I will try to improve the documentation and remove the annotations in my next patch.

    Patch looks good. +1 on adding it.

    One nit: I'd rename the keymap function to encodekey.

    @vstinner
    Copy link
    Member Author

    vstinner commented May 5, 2010

    MaL> Patch looks good. +1 on adding it.

    Cool. I didn't understood if MvL is +1, but at least he's not -1 on this, and
    we are at least two at +1 :-)

    MaL> One nit: I'd rename the keymap function to encodekey.

    Ok, I will also change that in the final patch.

    @loewis
    Copy link
    Mannequin

    loewis mannequin commented May 5, 2010

    @loewis: So do you agree to add os.environb and os.getenvb()?

    I agree with the patch (-2) in principle. I think the error handling
    needs to improve:

    py> os.getenvb('PATH')
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "/home/martin/work/3k/Lib/os.py", line 484, in getenvb
        return environb.get(key, default)
      File "/home/martin/work/3k/Lib/_abcoll.py", line 357, in get
        return self[key]
      File "/home/martin/work/3k/Lib/os.py", line 400, in __getitem__
        value = self.data[self.encodekey(key)]
    TypeError: string argument without an encoding

    which then leads to the natural, but incorrect

    py> os.getenvb('PATH', encoding='ascii')
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    TypeError: getenvb() got an unexpected keyword argument 'encoding'

    The first error should remain TypeError, but say something like
    "unsupported type string", or "bytes expected".

    I notice an incompatible change: posix.environ has now a different
    element type. This is probably fine.

    There is a couple of white-space only changes in the patch; it would be
    good if you could reduce them.

    @vstinner
    Copy link
    Member Author

    vstinner commented May 6, 2010

    There is a couple of white-space only changes in the patch;
    it would be good if you could reduce them.

    "When two paths open to you, you should always choose the most difficult" (in french: "Quand deux chemins s'ouvrent à nous, il faut toujours choisir le plus difficile"). I fixed posixmodule.c instead of my patch :-)

    @vstinner
    Copy link
    Member Author

    vstinner commented May 6, 2010

    I notice an incompatible change: posix.environ has now a
    different element type. This is probably fine.

    I don't understand, what is an "element type"?

    @loewis
    Copy link
    Mannequin

    loewis mannequin commented May 6, 2010

    > I notice an incompatible change: posix.environ has now a
    > different element type. This is probably fine.

    I don't understand, what is an "element type"?

    In a container, the contents is sometimes called "elements"; their
    type is the element type. Currently, keys and values are strings;
    with the change, they are bytes.

    @vstinner
    Copy link
    Member Author

    vstinner commented May 6, 2010

    Aaaah, *posix*.environ, not *os*.environ, ok. I will fix posix documentation. I didn't knew this dictionary :-)

    @vstinner
    Copy link
    Member Author

    vstinner commented May 6, 2010

    Patch version 3:

    • update posix documentation
    • improve os.environ and os.getenv() documentation: specify the type and document the encoding/error handler, add a link to environb/getenvb
    • os.environ and os.environb now check the argument types (raise a better error), on Windows and Unix. Before my patch, os.environ[b'key']=b'value' sets the variable "b'key'" to "b'value'" :-(
    • restore os.environb in TestEnviron.tearDown() (test_os)
    • fix patch on posixmodule.c indentation
    • fix a regression introduced by my patch: set PyErr_NoMemory() on error (newstr==NULL in posixmodule.c)
    • rename keymap to encodekey

    @vstinner
    Copy link
    Member Author

    vstinner commented May 6, 2010

    Oh no, I forgot to remove the annotations from getenv() and getenvb() in os.py. I only removed them from the documentation.

    @vstinner
    Copy link
    Member Author

    vstinner commented May 6, 2010

    Commited as r80885 (py3k), blocked in 3.1 (r80886).

    Thank you Martin and Marc for your great help and all your reviews ;-)

    @vstinner vstinner closed this as completed May 6, 2010
    @ezio-melotti
    Copy link
    Member

    FWIW os.environb is missing from os.__all__.

    @ezio-melotti ezio-melotti added the type-feature A feature request or enhancement label Jul 29, 2010
    @ezio-melotti
    Copy link
    Member

    A quick search0 also shows that environ.data is used by several projects. Changing it from str to bytes will most likely break these programs, so I'm not sure it's a good idea.
    Also, if I understand correctly, on Windows os.environ.data still contains str, so it's inconsistent with other systems.
    IMHO os.environ.data should contain str, whereas os.environb.data bytes. If they must share the same data I would prefer them to be both str.

    @ezio-melotti ezio-melotti reopened this Jul 29, 2010
    @malemburg
    Copy link
    Member

    Ezio Melotti wrote:

    Ezio Melotti <ezio.melotti@gmail.com> added the comment:

    A quick search0 also shows that environ.data is used by several projects. Changing it from str to bytes will most likely break these programs, so I'm not sure it's a good idea.
    Also, if I understand correctly, on Windows os.environ.data still contains str, so it's inconsistent with other systems.
    IMHO os.environ.data should contain str, whereas os.environb.data bytes. If they must share the same data I would prefer them to be both str.

    Direct use of os.environ.data is not permitted as it is not a documented
    feature, so I wouldn't worry about this.

    The few uses you found are easy to fix.

    @vstinner
    Copy link
    Member Author

    Le jeudi 29 juillet 2010 03:08:27, Ezio Melotti a écrit :

    A quick search[0] also shows that environ.data is used by several projects.

    _Environ is a wrapper on data: call putenv() when a value is changed and
    unputenv() when a value is removed. Since os.environb, it does also convert
    key and value types. So .data should not be used directly.

    It looks like a common usage of .data is to get get to "a real dict" (and not
    "an user dict"). I think that os.environ.data or os.environ.data.copy() can be
    replaced by dict(os.environ) is that case.

    Changing it from str to bytes will most likely break these programs,

    I'm not sure of that. It looks (in the search result) that os.environ.data is
    just stored and passed to a function, but not used as a dictionary to get a
    value (well, I'm not sure because it's hard to say with just 3 lines of the
    program). In that case, it doesn't matter if the dictionary contains byte or
    unicode strings.

    Eg. zope2instance:

      old_env = os.environ.data.copy()
      ...
      os.environ.data = old_env

    It does still work with bytes.

    so I'm not sure it's a good idea

    Yes, it's not a good idea to use .data :-) This attribute should be protected,
    not public (even if it is not documented).

    IMHO os.environ.data should contain str, whereas os.environb.data
    bytes.

    data is shared between os.environ and os.environb, because data is the
    solution to synchronize both dictionaries.

    If they must share the same data I would prefer them to be both
    str.

    No, data should use the native type: bytes on UNIX and BSD, str on Windows.

    --

    If you still consider that the change on .data as a bug, I think that the fix
    is to remove .data (mark it as protected: environ.data => environ._data).

    @vstinner
    Copy link
    Member Author

    Le jeudi 29 juillet 2010 03:00:03, Ezio Melotti a écrit :

    FWIW os.environb is missing from os.__all__.

    Fixed by r83237

    @vstinner
    Copy link
    Member Author

    If you still consider that the change on .data as a bug,
    I think that the fix is to remove .data (mark it as
    protected: environ.data => environ._data).

    r84690 marks os.environ.data as protected. Close this issue again.

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    interpreter-core (Objects, Python, Grammar, and Parser dirs) stdlib Python modules in the Lib dir topic-unicode type-feature A feature request or enhancement
    Projects
    None yet
    Development

    No branches or pull requests

    3 participants