This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: Is there a mojibake problem rendering interactive help in the REPL on Windows?
Type: behavior Stage:
Components: Windows Versions: Python 3.11, Python 3.10, Python 3.9
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: aroberge, docs@python, eryksun, jessevsilverman, methane, paul.moore, steve.dower, terry.reedy, tim.golden, zach.ware
Priority: normal Keywords:

Created on 2021-05-31 16:48 by jessevsilverman, last changed 2022-04-11 14:59 by admin.

Messages (17)
msg394817 - (view) Author: Jesse Silverman (jessevsilverman) Date: 2021-05-31 16:48
I didn't know whether to file this under DOCUMENTATION or WINDOWS.
I recently discovered the joys of the interactive help in the REPL, rather than just help(whatever).
I was exploring the topics and noticed multiple encoding or rendering errors.
I realized I stupidly wasn't using the Windows Terminal program but the default console.  I addressed that and they persisted in Windows Terminal.
I upgraded from 3.9.1 to 3.9.5...same deal.

I tried running:
Set-Item -Path Env:PYTHONUTF8 -Value 1

before starting the REPL, still no dice.

I confirmed this worked in the same session:
>>> ustr2='ʑʒʓʔʕʗʘʙʚʛʜʝʞ'
>>> ustr2
'ʑʒʓʔʕʗʘʙʚʛʜʝʞ'
It does.

The help stuff that doesn't render correctly is under topic COMPARISON:
lines 20, 21 and 25 of this output contain head-scratch-inducing mystery characters:
help> COMPARISON
Comparisons
***********

Unlike C, all comparison operations in Python have the same priority,
which is lower than that of any arithmetic, shifting or bitwise
operation.  Also unlike C, expressions like "a < b < c" have the
interpretation that is conventional in mathematics:

   comparison    ::= or_expr (comp_operator or_expr)*
   comp_operator ::= "<" | ">" | "==" | ">=" | "<=" | "!="
                     | "is" ["not"] | ["not"] "in"

Comparisons yield boolean values: "True" or "False".

Comparisons can be chained arbitrarily, e.g., "x < y <= z" is
equivalent to "x < y and y <= z", except that "y" is evaluated only
once (but in both cases "z" is not evaluated at all when "x < y" is
found to be false).

Formally, if *a*, *b*, *c*, …, *y*, *z* are expressions and *op1*,
*op2*, …, *opN* are comparison operators, then "a op1 b op2 c ... y
opN z" is equivalent to "a op1 b and b op2 c and ... y opN z", except
that each expression is evaluated at most once.

Note that "a op1 b op2 c" doesnΓÇÖt imply any kind of comparison between
*a* and *c*, so that, e.g., "x < y > z" is perfectly legal (though
perhaps not pretty).

That is: …, …, ’

em-dash or ellipsis might be involved somehow...maybe fancy apostrophe?
My current guess is that it isn't about rendering anymore, because something went awry further upstream?

Thanks!
msg394818 - (view) Author: Andre Roberge (aroberge) * Date: 2021-05-31 17:00
I observe something similar, though with different symbols. My Windows installation uses French (fr-ca) as default language.
===
help> COMPARISON
Comparisons
***********

...

Formally, if *a*, *b*, *c*, à, *y*, *z* are expressions and *op1*,
*op2*, à, *opN* are comparison operators, then "a op1 b op2 c ... y
opN z" is equivalent to "a op1 b and b op2 c and ... y opN z", except
that each expression is evaluated at most once.

Note that "a op1 b op2 c" doesnÆt imply any kind of comparison between
*a* and *c*, so that, e.g., "x < y > z" is perfectly legal (though
perhaps not pretty).

So, in my case, the unusual characters are: à, Æ.  In this case, the French word 'à' would make some sense in this context (as it means 'to' in English).
msg394820 - (view) Author: Jesse Silverman (jessevsilverman) Date: 2021-05-31 17:41
I looked around some more and it definitely is not just one isolated instance.  I noted a similar issue on the lines from CLASSES topic pasted here.  I think it is all usages of the ellipsis in the context of the help text?  Maybe also fancy quote marks that didn't survive the jump from ASCII to Unicode?  And some fancy dashes.
The theme of my day was excitement at how much more docs and help ship than I had realized with the most basic Python install and thus are at my fingertips anywhere, everywhere, internet access or not.  This mars that exuberance only slightly.
help> CLASSES
The standard type hierarchy
***************************

Below is a list of the types that are built into Python.  Extension
modules (written in C, Java, or other languages, depending on the
implementation) can define additional types.  Future versions of
Python may add types to the type hierarchy (e.g., rational numbers,
efficiently stored arrays of integers, etc.), although such additions
will often be provided via the standard library instead.

Some of the type descriptions below contain a paragraph listing
ΓÇÿspecial attributes.ΓÇÖ  These are attributes that provide access to the
...
methodΓÇÖs documentation (same as "__func__.__doc__"); "__name__"
...
dictionary containing the classΓÇÖs namespace; "__bases__" is a tuple
   containing the base classes, in the order of their occurrence in
   the base class list; "__doc__" is the classΓÇÖs documentation string,

ΓÇ£ClassesΓÇ¥.  See section Implementing Descriptors for another way
msg394835 - (view) Author: Inada Naoki (methane) * (Python committer) Date: 2021-06-01 02:17
Do you use PowerShell?
Please run this command and paste the output.

```
PS> $OutputEncoding
PS> [System.Console]::OutputEncoding
```
msg394836 - (view) Author: Eryk Sun (eryksun) * (Python triager) Date: 2021-06-01 02:38
> PS> [System.Console]::OutputEncoding

The console's current output encoding is irrelevant to this problem.

In Windows, pydoc uses the old "more.com" pager with a temporary file that's encoded with the default encoding, which is the process active codepage (i.e. "ansi" or "mbcs"), unless UTF-8 mode is enabled. The "more.com" utility, however, decodes the file using the console's current input codepage from GetConsoleCP(), and then it writes the decoded text via wide-character WriteConsoleW(). 

The only supported way to query the latter in the standard library is via os.device_encoding(0), and that's only if stdin isn't redirected to a file or pipe. Alternatively, ctypes can be used via ctypes.WinDLL('kernel32').GetConsoleCP(). For the latter, we would need to add _winapi.GetConsoleCP(), since using ctypes is discouraged.
msg394837 - (view) Author: Inada Naoki (methane) * (Python committer) Date: 2021-06-01 03:00
> In Windows, pydoc uses the old "more.com" pager with a temporary file that's encoded with the default encoding, which is the process active codepage (i.e. "ansi" or "mbcs"), unless UTF-8 mode is enabled. The "more.com" utility, however, decodes the file using the console's current input codepage from GetConsoleCP(), and then it writes the decoded text via wide-character WriteConsoleW(). 

Then, we need to check `[System.Console]::InputEncoding` too. It is a `GetConsoleCP()`.
msg394838 - (view) Author: Inada Naoki (methane) * (Python committer) Date: 2021-06-01 04:02
I confirmed this fixes the mojibake:

```
PS > $OutputEncoding =  [System.Text.Encoding]::GetEncoding("UTF-8")
PS > [System.Console]::OutputEncoding = $OutputEncoding
PS > [System.Console]::InputEncoding = $OutputEncoding
```
msg394841 - (view) Author: Eryk Sun (eryksun) * (Python triager) Date: 2021-06-01 05:39
> PS > [System.Console]::InputEncoding = $OutputEncoding

If changing the console input codepage to UTF-8 fixes the mojibake problem, then probably you're running Python in UTF-8 mode. pydoc.tempfilepager() encodes the temporary file with the preferred encoding, which normally would not be UTF-8. There are possible variations in how your system and the console are configured, so I can't say for sure.

tempfilepager() could temporarily set the console's input codepage to UTF-8 via SetConsoleCP(65001). However, if python.exe is terminated or crashes before it can reset the codepage, the console will be left in a bad state. By bad state, I mean that leaving the input code page set to UTF-8 is broken. Legacy console applications rely on the input codepage for reading input via ReadFile() and ReadConsoleA(), but the console host (conhost.exe or openconsole.exe) doesn't support reading input as UTF-8. It simply replaces each non-ASCII character (i.e. characters that require 2-4 bytes as UTF-8) with a null byte, e.g. "abĀcd" is read as "ab\x00cd". 

If you think the risk of crashing is negligible, and the downside of breaking legacy applications in the console session is trivial, then paging with full Unicode support is easily possible. Implement _winapi.GetConsoleCP() and _winapi.SetConsoleCP(). Write UTF-8 text to the temporary file. Change the console input codepage to UTF-8 before spawning "more.com". Revert to the original input codepage in the finally block.

A more conservative fix would be to change tempfilepager() to encode the file using the console's current input codepage, GetConsoleCP(). At least there's no mojibake.

> PS > $OutputEncoding =  [System.Text.Encoding]::GetEncoding("UTF-8")

FYI, $OutputEncoding in PowerShell has nothing to do with the python.exe and more.com processes, nor the console session to which they're attached.

> PS > [System.Console]::OutputEncoding = $OutputEncoding

The console output code page is irrelevant since more.com writes wide-character text via WriteConsoleW() and decodes the file using the console input code page, GetConsoleCP(). The console output codepage from GetConsoleOutputCP() isn't used for anything here.
msg394845 - (view) Author: Steve Dower (steve.dower) * (Python committer) Date: 2021-06-01 09:17
> If changing the console input codepage to UTF-8 fixes the mojibake problem, then probably you're running Python in UTF-8 mode.

I forget where I saw them, but there are some places where we incorrectly use stdin encoding for writing to stdout (I suppose assuming that they'll always match). This may be being impacted by one of those.
msg394858 - (view) Author: Jesse Silverman (jessevsilverman) Date: 2021-06-01 14:14
Thank you so much Inada and Eryk and Steve!

I was one of the people who mistakenly thought that Python 3 operating in the new Windows Terminal was going to magically leave us sitting happily in completely UTF-8 compatible territory on Windows, not realizing the complex long-term dependencies and regressions that still remain problematic.

I had spent a lot of time paying attention to the Python 2 vs. 3 debates with people shouting "I don't care about Unicode!" and mistook dedication to preventing regressions and breakages for a lesser appreciation of the value of UTF-8 support.  I have been schooled.  We all want the same thing, but getting there on Windows from where we are at the moment remains non-trivial.

Heartfelt appreciation to all on the front lines of dealing with this complex and sticky situation. ❤️❤️❤️

Also, much love to those who had put in the work to have much more help than I realized existed even when one finds oneself isolated on a single disconnected machine with only the standard docs as a guide🧭 -- I didn't realize the pages I found mojibake on even existed until this weekend.
msg394862 - (view) Author: Eryk Sun (eryksun) * (Python triager) Date: 2021-06-01 16:57
> I was one of the people who mistakenly thought that Python 3 operating 
> in the new Windows Terminal was going to magically leave us sitting 
> happily in completely UTF-8 compatible territory on Windows, not 
> realizing the complex long-term dependencies and regressions that 
> still remain problematic.

Windows Terminal provides a tab-based UI that supports font fallback and complex scripts, which is a significant improvement over the builtin terminal that conhost.exe provides. Each tab is a headless (conpty) console session that's hosted by an instance of OpenConsole.exe. The latter is based on the same source tree as the system conhost.exe, but typically it's a more recent build. The console host is what implements most of the API for client applications. That the host is in headless mode and connected to an alternate terminal doesn't matter from the perspective of client applications.

The console has been Unicode (UCS-2) back to its initial release in 1993. Taking advantage of this requires reading and writing wide-character strings via ReadConsoleW and WriteConsoleW, as Python's does in 3.6+ (except not for os.read and os.write). Many console applications instead use encoded byte strings with ReadFile / ReadConsoleA and WriteFile / WriteConsoleA. Updating the console host to support UTF-8 for this has been a drawn-out process. It finally has full support for writing UTF-8 in Windows 10, including splitting a sequence across multiple writes. But reading non-ASCII UTF-8 is still broken.

"more.com" uses the console input codepage to decode the file, so a workaround is to run `chcp.com 65001` and run Python in UTF-8 mode, e.g. `py -X utf8=1`. Since reading non-ASCII UTF-8 is broken, you'll have to switch back to the old input codepage if you need to enter non-ASCII characters in an app that reads from the console via ReadFile or ReadConsoleA.
msg394869 - (view) Author: Jesse Silverman (jessevsilverman) Date: 2021-06-01 20:27
"more.com" uses the console input codepage to decode the file, so a workaround is to run `chcp.com 65001` and run Python in UTF-8 mode, e.g. `py -X utf8=1`. Since reading non-ASCII UTF-8 is broken, you'll have to switch back to the old input codepage if you need to enter non-ASCII characters in an app that reads from the console via ReadFile or ReadConsoleA.

Confirmed that this workaround done in Windows Terminal causes all mojibake to immediately evaporate, leaving me with the most readable and enjoyable more/console experience I have ever had since first hitting a spacebar on MS-DOS.
(Windows Terminal and the open-sourcing of the CONSOLE code is in a three-way tie with open-sourcing of .Net Core and the C++ STL for changing how I feel about Windows.  I keep finding new reasons to love it, except for reading non-ASCII UTF-8 being broken which I just learned about today.)
msg394877 - (view) Author: Inada Naoki (methane) * (Python committer) Date: 2021-06-01 22:26
>> PS > $OutputEncoding =  [System.Text.Encoding]::GetEncoding("UTF-8")

> FYI, $OutputEncoding in PowerShell has nothing to do with the python.exe and more.com processes, nor the console session to which they're attached.

>> PS > [System.Console]::OutputEncoding = $OutputEncoding

> The console output code page is irrelevant since more.com writes wide-character text via WriteConsoleW() and decodes the file using the console input code page, GetConsoleCP(). The console output codepage from GetConsoleOutputCP() isn't used for anything here.

Yes, both are unrelated to this specific issue. But both are highly recommended for using Python with UTF-8 mode.

Now many people want to use Python with UTF-8 mode in PowerShell in Windows Terminal. And they don't want to care about legacy encoding at all.
msg394890 - (view) Author: Eryk Sun (eryksun) * (Python triager) Date: 2021-06-02 05:21
> Now many people want to use Python with UTF-8 mode in PowerShell 
> in Windows Terminal. And they don't want to care about legacy 
> encoding at all.

Windows Terminal isn't relevant to the encoding issue. Applications interact with the console session to which they're attached, e.g. python.exe <-> condrv.sys <-> conhost.exe (openconsole.exe). It doesn't matter whether or not the console session is headless (conpty) and connected to another process via pipes (i.e. a console session created via CreatePseudoConsole and set for a child process via PROC_THREAD_ATTRIBUTE_PSEUDOCONSOLE). Some settings and behaviors do change in conpty mode (e.g. the session has virtual-terminal mode enabled by default), but the way encoding and decoding are implemented for ReadFile/ReadConsoleA and WriteFile/WriteConsoleA doesn't change.

There are still a lot of programs that read and write from the console as a regular file via ReadFile and WriteFile, so the fact that ReadFile is broken when the input code page is set to UTF-8 is relevant to most people. However, people who run `chcp 65001` in Western locales usually only care about being able to write non-ASCII UTF-8 via WriteFile. Reading non-ASCII UTF-8 from console input via ReadFile doesn't come up as a common problem, but it definitely is a problem. For example:

    >>> s = os.read(0, 12)
    Привет мир
    >>> s
    b'\x00\x00\x00\x00\x00\x00 \x00\x00\x00\r\n'

Thus I don't like 'solving' this mojibake issue by simply recommending that users set the console input codepage to UTF-8. I previously proposed two solutions. (1) a radical change to get full Unicode support: modify pydoc to temporarily change the console input codepage to UTF-8 and write the temp file as UTF-8. (2) a conservative change just to avoid mojibake: modify pydoc to query the console input codepage and write the file using that encoding, as always with the backslashreplace error handler.
msg395114 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2021-06-04 19:41
Jesse or Andre, please test interactive help in IDLE as pydoc then knows that it is *not* talking to Windows console.  It then does not use more.com but prints the entire text at once.  It should send it 'as is' and should ignore the console language and encoding settings.  Use a Start menu icon or

> pyw -m idlelib

'py' works too.  It blocks, but displays the occasional tk/tkinter/IDLE error message as sys.__stderr__, etc, are the console stream instead of None.

Long output such as the 240 lines for COMPARISON is, by default, 'squeezed' to a little box.  Double click to expand or right click to move it to a separate window.

If there is still a problem with garbage in the text, we should try to fix it.

PS: one can select more than one Component, but Documentation usually refers to the content, not the means of displaying it.
msg395115 - (view) Author: Andre Roberge (aroberge) * Date: 2021-06-04 19:54
Terry: I just checked with Idle on Windows with Python 3.9.5 and the display works perfectly, with no incorrect characters.
msg395140 - (view) Author: Jesse Silverman (jessevsilverman) Date: 2021-06-04 23:38
As Andre noted, it is good in IDLE.
I also realize how convenient it is to read the real docs from there.
I learned a lot about the state of console programming on Windows, in and out of Python, but I have no problem using IDLE when on Windows.
History
Date User Action Args
2022-04-11 14:59:46adminsetgithub: 88441
2021-06-04 23:38:26jessevsilvermansetmessages: + msg395140
2021-06-04 19:54:32arobergesetmessages: + msg395115
2021-06-04 19:41:45terry.reedysetnosy: + terry.reedy
messages: + msg395114
2021-06-02 05:21:08eryksunsetmessages: + msg394890
2021-06-01 22:26:49methanesetmessages: + msg394877
2021-06-01 20:27:47jessevsilvermansetmessages: + msg394869
2021-06-01 16:57:01eryksunsetmessages: + msg394862
2021-06-01 14:14:47jessevsilvermansetmessages: + msg394858
2021-06-01 09:17:37steve.dowersetassignee: docs@python ->
messages: + msg394845
versions: + Python 3.10, Python 3.11
2021-06-01 05:39:20eryksunsetmessages: + msg394841
2021-06-01 04:02:14methanesetmessages: + msg394838
2021-06-01 03:00:37methanesetmessages: + msg394837
2021-06-01 02:38:06eryksunsetnosy: + eryksun
messages: + msg394836
2021-06-01 02:17:57methanesetnosy: + methane
messages: + msg394835
2021-06-01 01:20:41ned.deilysetnosy: + paul.moore, tim.golden, zach.ware, steve.dower
components: + Windows, - Documentation
2021-05-31 17:41:24jessevsilvermansetmessages: + msg394820
2021-05-31 17:00:23arobergesetnosy: + aroberge
messages: + msg394818
2021-05-31 16:48:01jessevsilvermancreate