This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: incorrect sys.stdout.encoding within a io.StringIO buffer
Type: behavior Stage:
Components: Library (Lib) Versions: Python 3.9
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: pcarbonn, steven.daprano
Priority: normal Keywords:

Created on 2021-07-30 08:52 by pcarbonn, last changed 2022-04-11 14:59 by admin.

Messages (6)
msg398528 - (view) Author: Pierre Carbonnelle (pcarbonn) Date: 2021-07-30 08:52
The following code 

    print("outside:", sys.stdout.encoding)
    with redirect_stdout(io.StringIO()) as f:
        print("inside: ", sys.stdout.encoding)
    print(f.getvalue())

yields:

    outside: utf-8
    inside:  None

Because StringIO is a string buffer, the expected result is:

    outside: utf-8
    inside:  utf-8

This creates problem when using packages whose output depends on the sys.stdout.encoding, such as z3-solver.
msg398534 - (view) Author: Steven D'Aprano (steven.daprano) * (Python committer) Date: 2021-07-30 10:21
Why do you expect UTF-8 inside the redirect_stdout block?

io.StringIO doesn't have an encoding - it stores strings, not bytes.

If z3-solver cannot deal with StringIO, then surely that's a bug in z3-solver?
msg398536 - (view) Author: Pierre Carbonnelle (pcarbonn) Date: 2021-07-30 11:09
I expect sys.stdout to have utf-8 encoding inside the redirect because the buffer accepts unicode code points (not bytes), just as it does outside of the redirect.  In other words, I expect the 'encoding' attribute of sys.stdout to have the same value inside and outside this redirect.

It so happens that sys.stdout is an io.StringIO() object inside the redirect.  The getvalue() method on this object returns a string (not a bytes), i.e. a sequence of unicode points.

StringIO inherits from TextIOBase, which has an 'encoding' attribute.  For some reason, the encoding of a StringIO object is None, which is inconsistent with its semantics: it should be 'uft-8'.
msg398539 - (view) Author: Steven D'Aprano (steven.daprano) * (Python committer) Date: 2021-07-30 11:56
> I expect sys.stdout to have utf-8 encoding inside the redirect because 
> the buffer accepts unicode code points (not bytes)

And the buffer stores unicode code points, not bytes, so why would there 
be an encoding?

Just to get this out of the way, in case you are thinking along these 
lines. Python strings are not arrays of UTF-8 bytes, like Go runes. 
Python strings are arrays of abstract code points.

The specific details will vary from interpreter to interpreter, and from 
version to version, but current versions of CPython use a flexible 
in-memory representation where the width of the code points (1, 2 or 4 
bytes) depend on the string. This is not UTF-8: the bytes are encoded as 
Latin-1, UCS-2, or UTF-32 depending on the string.

> For some reason, the encoding of a StringIO object is None

Because StringIO objects store strings, not bytes. There is no encoding 
involved. The inputs are strings, and the storage is strings.

> which is inconsistent with its semantics: it should be 'uft-8'.

It is completely consistent: the encoding should be None, because there 
is no encoding.

> I expect the 'encoding' attribute of sys.stdout to have the same value 
> inside and outside this redirect.

Why? If you redirect to an actual file using, let's say Mac-Roman 
encoding, or ASCII, or UTF-32, or any one of dozens of other encodings, 
you should expect the encoding inside the block to reflect the actual 
encoding used inside the block.

The encoding outside the block is the encoding used by the original 
stdout; the encoding inside the block is the encoding used by the 
replacement stdout. Why would you expect them to always be the same?

>>> print("outside:", sys.stdout.encoding)
outside: utf-8
>>> with open("/tmp/junk.txt", "w", encoding="ascii") as f:
...     with redirect_stdout(f):
...             print("inside:", sys.stdout.encoding)
... 
>>> with open("/tmp/junk.txt", encoding="ascii") as f:
...     print(f.read())
... 
inside: ascii

> It so happens that sys.stdout is an io.StringIO() object inside the 
> redirect.  The getvalue() method on this object returns a string (not 
> a bytes), i.e. a sequence of unicode points.

Exactly. And that is why there is no encoding involved. It is purely a 
sequence of Unicode code points, not bytes, and at no point was a 
Unicode string encoded to bytes to go to the filesystem.

> StringIO inherits from TextIOBase, which has an 'encoding' attribute.  

And StringIO has an encoding attribute because of inheritance, and it is 
set to None because there is no actual encoding codec used.
msg398541 - (view) Author: Pierre Carbonnelle (pcarbonn) Date: 2021-07-30 12:13
As a work around, I had to use a temporary file (instead of a memory buffer):

    print("outside:", sys.stdout.encoding)
    with  open("/tmp/log.txt", mode='w', encoding='utf-8') as buf:
        with redirect_stdout(buf) as f:
            print("inside: ", sys.stdout.encoding)
    with  open("/tmp/log.txt", mode='r', encoding='utf-8') as f:
        print(f.read())

and get:

    outside: utf-8
    inside:  utf-8

as expected.
msg398542 - (view) Author: Pierre Carbonnelle (pcarbonn) Date: 2021-07-30 12:18
I can live with the workaround, so, you can close the issue if you wish.  As you say, maybe it's an issue with z3.

Thank you for your time.
History
Date User Action Args
2022-04-11 14:59:48adminsetgithub: 88937
2021-07-30 12:18:18pcarbonnsetmessages: + msg398542
2021-07-30 12:13:58pcarbonnsetmessages: + msg398541
2021-07-30 11:56:44steven.dapranosetmessages: + msg398539
2021-07-30 11:09:39pcarbonnsetmessages: + msg398536
2021-07-30 10:21:48steven.dapranosetnosy: + steven.daprano
messages: + msg398534
2021-07-30 08:52:55pcarbonncreate