Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Python interactive console doesn't use sys.stdin for input #61820

Open
Drekin mannequin opened this issue Apr 2, 2013 · 29 comments
Open

Python interactive console doesn't use sys.stdin for input #61820

Drekin mannequin opened this issue Apr 2, 2013 · 29 comments
Labels
3.8 only security fixes 3.9 only security fixes 3.10 only security fixes type-feature A feature request or enhancement

Comments

@Drekin
Copy link
Mannequin

Drekin mannequin commented Apr 2, 2013

BPO 17620
Nosy @gvanrossum, @birkenfeld, @pfmoore, @ncoghlan, @pitrou, @benjaminp, @merwok, @zooba

Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

Show more details

GitHub fields:

assignee = None
closed_at = None
created_at = <Date 2013-04-02.16:59:14.835>
labels = ['type-feature', '3.8', '3.9', '3.10']
title = "Python interactive console doesn't use sys.stdin for input"
updated_at = <Date 2021-03-15.21:01:39.375>
user = 'https://bugs.python.org/Drekin'

bugs.python.org fields:

activity = <Date 2021-03-15.21:01:39.375>
actor = 'eryksun'
assignee = 'none'
closed = False
closed_date = None
closer = None
components = []
creation = <Date 2013-04-02.16:59:14.835>
creator = 'Drekin'
dependencies = []
files = []
hgrepos = []
issue_num = 17620
keywords = []
message_count = 29.0
messages = ['185848', '186121', '186553', '186576', '186580', '193815', '221176', '221179', '223414', '224312', '224313', '224330', '224334', '224338', '224396', '224397', '226021', '226098', '226099', '226100', '226126', '226140', '226933', '234439', '242173', '255549', '272681', '272683', '275242']
nosy_count = 10.0
nosy_names = ['gvanrossum', 'georg.brandl', 'paul.moore', 'ncoghlan', 'pitrou', 'benjamin.peterson', 'eric.araujo', 'tshepang', 'Drekin', 'steve.dower']
pr_nums = []
priority = 'normal'
resolution = None
stage = 'needs patch'
status = 'open'
superseder = None
type = 'enhancement'
url = 'https://bugs.python.org/issue17620'
versions = ['Python 3.8', 'Python 3.9', 'Python 3.10']

@Drekin
Copy link
Mannequin Author

Drekin mannequin commented Apr 2, 2013

The Python interactive console actually doesn't use sys.stdin but standard C stdin for input. Is there any reason for this? Why it then uses its encoding attribute? (Assigning sys.stdin something, that doesn't have encoding attribute freezes the interpreter.) If anything, wouldn't it make more sense if it used sys.__stdin__.encoding instead of sys.stdin? sys.stdin is intended to be set by user (it affects input() and code.inpterrupt() which tries to minic standard interactive console).

@Drekin Drekin mannequin added the type-bug An unexpected behavior, bug, or error label Apr 2, 2013
@Drekin
Copy link
Mannequin Author

Drekin mannequin commented Apr 6, 2013

Sorry for typos.
• interactive console doesn't use sys.stdin for input, why?
• it uses sys.stdin.encoding, shouldn't it rather use sys.__stdin__.encoding if anything?
• input() and hence code.interact() uses sys.stdin

@pitrou pitrou removed their assignment Apr 6, 2013
@pitrou
Copy link
Member

pitrou commented Apr 11, 2013

• interactive console doesn't use sys.stdin for input, why?

Modules/main.c calls PyRun_AnyFileFlags(stdin, "<stdin>", ...). At this point, sys.stdin *is* the same as C stdin by construction, so I'm not sure how you came to encounter the issue.

However, it's also true that if you later redirect sys.stdin, it will be ignored and the original C stdin (as passed to PyRun_InteractiveLoopFlags) will continue to be used. On the other hand, the input() implementation has dedicated logic to find out whether sys.stdin is the same as C stdin.

(by the way, the issue should also apply to 2.7)

• it uses sys.stdin.encoding, shouldn't it rather use sys.__stdin__.encoding if anything?

Assuming the previous bug gets fixed, then no :-)

@Drekin
Copy link
Mannequin Author

Drekin mannequin commented Apr 11, 2013

I encountered it when I changed sys.stdin at runtime (I thought it was a supported feature) to affect the interactive console, see http://bugs.python.org/issue1602 .

@pitrou
Copy link
Member

pitrou commented Apr 11, 2013

Ok, I guess it would need a new API (PyRun_Stdio()?) to run the interactive loop from sys.stdin, rather than from a fixed FILE*.

@pitrou pitrou added type-feature A feature request or enhancement and removed type-bug An unexpected behavior, bug, or error labels Apr 11, 2013
@Drekin
Copy link
Mannequin Author

Drekin mannequin commented Jul 28, 2013

Is there any chance the API will be added and used by python.exe?

@ncoghlan
Copy link
Contributor

Steve, another one to look at in the context of improving the Unicode handling situation at the Windows command prompt.

@zooba
Copy link
Member

zooba commented Jun 21, 2014

Thanks Nick, but this has a pretty clear scope that may help the Unicode situation in cmd but doesn't directly relate to it.

@Drekin
Copy link
Mannequin Author

Drekin mannequin commented Jul 18, 2014

There is still the serious inconsistency that the sys.stdin is not used for input by interactive loop but its encoding is. So if I replace sys.stdin with a custom object with its own encoding attribute, the standard interactive loop tries to use this encoding which may result in an exception on any input.

@gvanrossum
Copy link
Member

Is this at all related to the use of GNU readline?

@pitrou
Copy link
Member

pitrou commented Jul 30, 2014

Yes, it is. GNU readline will use a FILE*. Apparently, one can customize this behaviour, see http://cnswww.cns.cwru.edu/php/chet/readline/readline.html#SEC25

"""Variable: rl_getc_func_t * rl_getc_function
If non-zero, Readline will call indirectly through this pointer to get a character from the input stream. By default, it is set to rl_getc, the default Readline character input function (see section 2.4.8 Character Input). In general, an application that sets rl_getc_function should consider setting rl_input_available_hook as well. """

It is not obvious how that interacts with special keys, e.g. arrows.

@gvanrossum
Copy link
Member

I propose not to mess with GNU readline. But that doesn't mean we can't try to fix this issue by detecting that sys.stdin has changed and use it if it isn't referring to the original process stdin. It will be tricky however to make sure nothing breaks.

(The passage quoted from the GNU readline docs seems to imply that it's in non-blocking mode, and that the FD is a raw tty device, probably with echo off. It will give escape sequences for e.g. arrow keys.)

@Drekin
Copy link
Mannequin Author

Drekin mannequin commented Jul 30, 2014

My naive picture of ideal situation looks like this: When the interactive loop wants input, it just calls sys.stdin.readline, which delegates to sys.stdin.buffer.raw.readinto or .read, these can use GNU readline if available to get the data. May I ask, what's wrong with my picture?

@gvanrossum
Copy link
Member

sys.stdin.readline() never delegates to GNU readline. The REPL calls GNU readline directly. There's clearly some condition that determines whether to call GNU readline or sys.stdin.readline, but it may not correspond to what you want (e.g. it may just test whether FD 0 is a tty). Can you find in the CPython source code where this determination is made?

@Drekin
Copy link
Mannequin Author

Drekin mannequin commented Jul 31, 2014

I looked to the sourcecode and found the following.

First, the codepath of how interactive loop gets its input follows:
Python/pythonrun.c:PyRun_InteractiveLoopFlags
Python/pythonrun.c:PyRun_InteractiveOneObject
Python/pythonrun.c:PyParser_ASTFromFileObject
Parse/parsetok.c:PyParser_ParseFileObject
Parse/parsetok.c:parsetok
Parse/tokenizer.c:PyTokenizer_Get
Parse/tokenizer.c:tok_get
Parse/tokenizer.c:tok_nextc
Parser/myreadline.c:PyOS_Readline OR Parse/tokenizer.c:decoding_fgets

PyRun_InteractiveOneObject tries to get the input encoding via sys.stdin.encoding. The encoding

is then passed along and finally stored in a tokenizer object. It is tok_nextc function that gets

the input. If the prompt is not NULL it gets the data via PyOS_Readline and uses the encoding to

recode it to UTF-8. This is unfortunate since the encoding, which originates in

sys.stdin.encoding, can have nothing to do with the data returned by PyOS_Readline. Αlso note

that there is hardcoded stdin argument to PyOS_Readline, but it probably holds tok->fp == stdin

so it doesn't matter.

If the prompt in tok_nextc is NULL then the data are gotten by decoding_fgets function, which

either use fp_readl > tok->decoding_readline or Objects/fileobject.c:Py_UniversalNewlineFgets

depending on tokenizer state. tok->decoding_readline handler may be set to io.open("isisOOO",

fileno(tok->fp), …) (I have no idea what "isisOOO" might be).

PyOS_Readline function either calls PyOS_StdioReadline or the function pointed to by

PyOS_ReadlineFunctionPointer which is by default again PyOS_StdioReadline, but usually is set to

support GNU readline by the code in Modules/readline.c. PyOS_StdioReadline function uses my_fgets

which calls fgets.

Now what input() function does. input is implemented as Python/bltinmodule.c:builtin_input. It

tests if we are on tty by comparing sys.stdin.fileno() to fileno(stdin) and testing isatty. Note

that this may not be enough – if I inslall a custom sys.stdin but let it have standard fileno

then the test may succeed. If we are tty then PyOS_Readline is used (and again together with

sys.std*.encoding), if we aren't then Objects/fileobject.c:PyFile_WriteObject > sys.stdout.write

(for prompt) and :PyFile_GetLine > sys.stdin.readline are used.

As we can see, the API is rather FILE* based. The only places where sys.std* objects are used are

in one branch of builtin_input, and when getting the encoding used in tokenizer. Could it be

possible to configure the tokenizer so it uses sys.stdin.readline for input, and also rewrite

builtin_input to allways use sys.std*? Then it would be sys.stdin.buffer.raw.read* methods'

responsibility to decide whether to use GNU readline or whatever PyOS_Readline uses or something

else (e.g. ReadConsoleW on Windows tty), and also check for Ctrl-C afterwards.

@Drekin
Copy link
Mannequin Author

Drekin mannequin commented Jul 31, 2014

Sorry for formating in the previous message. Repeating…

I looked to the sourcecode and found the following.

First, the codepath of how interactive loop gets its input follows:
Python/pythonrun.c:PyRun_InteractiveLoopFlags
Python/pythonrun.c:PyRun_InteractiveOneObject
Python/pythonrun.c:PyParser_ASTFromFileObject
Parse/parsetok.c:PyParser_ParseFileObject
Parse/parsetok.c:parsetok
Parse/tokenizer.c:PyTokenizer_Get
Parse/tokenizer.c:tok_get
Parse/tokenizer.c:tok_nextc
Parser/myreadline.c:PyOS_Readline OR Parse/tokenizer.c:decoding_fgets

PyRun_InteractiveOneObject tries to get the input encoding via sys.stdin.encoding. The encoding is then passed along and finally stored in a tokenizer object. It is tok_nextc function that gets the input. If the prompt is not NULL it gets the data via PyOS_Readline and uses the encoding to recode it to UTF-8. This is unfortunate since the encoding, which originates in sys.stdin.encoding, can have nothing to do with the data returned by PyOS_Readline. Αlso note that there is hardcoded stdin argument to PyOS_Readline, but it probably holds tok->fp == stdin so it doesn't matter.

If the prompt in tok_nextc is NULL then the data are gotten by decoding_fgets function, which either use fp_readl > tok->decoding_readline or Objects/fileobject.c:Py_UniversalNewlineFgets depending on tokenizer state. tok->decoding_readline handler may be set to io.open("isisOOO", fileno(tok->fp), …) (I have no idea what "isisOOO" might be).

PyOS_Readline function either calls PyOS_StdioReadline or the function pointed to by PyOS_ReadlineFunctionPointer which is by default again PyOS_StdioReadline, but usually is set to support GNU readline by the code in Modules/readline.c. PyOS_StdioReadline function uses my_fgets which calls fgets.

Now what input() function does. input is implemented as Python/bltinmodule.c:builtin_input. It tests if we are on tty by comparing sys.stdin.fileno() to fileno(stdin) and testing isatty. Note that this may not be enough – if I inslall a custom sys.stdin but let it have standard fileno then the test may succeed. If we are tty then PyOS_Readline is used (and again together with sys.std*.encoding), if we aren't then Objects/fileobject.c:PyFile_WriteObject > sys.stdout.write (for prompt) and :PyFile_GetLine > sys.stdin.readline are used.

As we can see, the API is rather FILE* based. The only places where sys.std* objects are used are in one branch of builtin_input, and when getting the encoding used in tokenizer. Could it be possible to configure the tokenizer so it uses sys.stdin.readline for input, and also rewrite builtin_input to allways use sys.std*? Then it would be sys.stdin.buffer.raw.read* methods' responsibility to decide whether to use GNU readline or whatever PyOS_Readline uses or something else (e.g. ReadConsoleW on Windows tty), and also check for Ctrl-C afterwards.

@Drekin
Copy link
Mannequin Author

Drekin mannequin commented Aug 28, 2014

I have found another example of where the current interaction between readline and Python core lead to confussion. It started with following report on my package: Drekin/win-unicode-console#2 .

Basically, IPython interactive console on Windows uses pyreadline package, which provides GNU readline functionality. To get input from user, it just calls input(prompt). Input calls readline both for writing prompt and reading the input. It interprets ANSI control sequences so colored prompt is displayed rather than garbage. And when user types, things like auto-completion work. sys.stdin is not used at all and points to standard object.

One easily gets the impression that since sys.stdin is bypassed, changing it doesn't mind, but it actually does. With changed sys.stdin, input() now uses it rather than readline and ANSI control sequences result in a mess. See ipython/ipython#17 (comment) .

I just think that it would be better when input() allways delegated to sys.stdin and print() to sys.stdout() and this was the standard way to interact with terminal. It would then be the responsibility of sys.std* objects to do right thing – to read from file, to delegate to readline, to directly interact with console some way, to interpret or not the ANSI control sequences.

Solving issues like bpo-1602 or bpo-18597 or adding readline support to Windows would then be just matter of providing the right sys.std* implementation.

@Drekin
Copy link
Mannequin Author

Drekin mannequin commented Aug 29, 2014

I realized that the behavior I want can be achieved by setting PyOS_ReadlineFunctionPointer to a function calling sys.stdin.readline(). However I found another problem: Python REPL just doesn't work, when sys.stdin.encoding is UTF-16-LE. The tokenizer (Parser/tokenizer.c:tok_nextc) reads a line using PyOS_Readline and then tries to recode it to UTF-8. The problem is that PyOS_Readline returns just plain *char and strlen() is used to determine its length when decoding, which makes no sense on UTF-16-LE encoded line, since it's full of nullbytes.

Why does PyOS_Readline return *char, rather than Python string object? In the situation when PyOS_ReadlineFunctionPointer points to something producing Unicode string (e.g. my new approach to solve bpo-1602 or pyreadline package), it must be encoded and cast to *char to return from PyOS_Readline, then it is decoded by the tokenizer and again encoded to UTF-8.

@pitrou
Copy link
Member

pitrou commented Aug 29, 2014

Why does PyOS_Readline return *char, rather than Python string object?

For historical reasons and now for compatibility: we can't change the hook's signature without breaking obvious applications, obviously.
If necessary, we could add a new hook that would take precedence over the old one if defined. Feel free to post a patch for that.

@pitrou
Copy link
Member

pitrou commented Aug 29, 2014

without breaking obvious applications

without breaking *existing* applications ;-)

@vstinner
Copy link
Member

The Python parser works well with UTF8. If you know the encoding, decode
from your encoding and encode to UTF8. You should pass the UTF8 flag to the
parser.

@Drekin
Copy link
Mannequin Author

Drekin mannequin commented Aug 30, 2014

Antoine Pitrou: I understand. It would be nice to have that new Python string based readline hook. Its default implementation could be to call PyOS_Readline and decode the bytes using sys.stdin.encoding (as the tokenizer currently does). Tokenizer then woudn't need to decode if it called the new hook.

Victor Stinner: I'm going to try the approach of reencoding my stream to UTF-8. So then my UTF-16-LE encoded stream is decoded, then encoded to UTF-8, interpreted as null-terminated *char, which is returned to the tokenizer, which again decodes it and encodes to UTF-8. I wonder if the last step could be short-circuited. What is this UTF8 flag to Python parser? I couldn't find any information.

@Drekin
Copy link
Mannequin Author

Drekin mannequin commented Sep 15, 2014

I have found another problem. PyOS_Readline can be called from two different places – from Parser/tokenizer.c:tok_nextc (by REPL), which uses sys.stdin.encoding to encode prompt argument, and from Python/bltinmodule.c:builtin_input_impl (by input() function), which uses sys.stdout.encoding. So readline hook cannot be implemented correctly if sys.stdin and sys.stdout don't have the same encoding.

Either the tokenizer should have two encodings – one for input and one for output - or better no encoding at all and should use Python string based alternative to PyOS_Readline, which could be added.

@Drekin
Copy link
Mannequin Author

Drekin mannequin commented Jan 21, 2015

Unfortunately, I have little or no experience with Python C code and I even don't have a C compiler installed so I cannot experiment. I'll just put my ideas how to solve this here.

• Add sys.__readlinehook__ attribute, which can be set to a function taking a prompt string and returing a line.
• Add C function PyOS_UnicodeReadline (possibly with a better name) which has the same signature as sys.__readlinehook__ (in contrast with the signature of PyOS_Readline). If sys.__readlinehook__ is set, call it; otherwise encode the prompt string using stdout encoding and delegate to PyOS_Readline and decode the string returned using stdin encoding.
• Change the tokenizer and the implementation of input() so it uses PyOS_UnicodeReadline rather than PyOS_Readline.

This would solve the problem that utf-16 encoded string cannot be given to the tokenizer and also would bypass the silent assumption that stdin and stdout encodings are the same. Also, readline hook could be easily set from Python code – no need for ctypes. The package pyreadline could use this. Also, the issue bpo-1602 could be then solved just by changing sys.std* streams and providing a trivial sys.__readlinehook__ delegating to sys.stdout.write and sys.stdin.readline.

@Drekin
Copy link
Mannequin Author

Drekin mannequin commented Apr 28, 2015

Note that under status quo PyOS_Readline is called from two places: the tokenizer during an interactive session and the builtin function input. The tokenizer passes promptstring encoded in sys.stdin.encoding while input() passes promtstring encoded in sys.stdout.encoding, so it is not possible to implement a readline hook correctly in the case the encodings are different. This might be considered a bug.

@Drekin
Copy link
Mannequin Author

Drekin mannequin commented Nov 28, 2015

I've formulated a proposal regarding this issue: https://mail.python.org/pipermail/python-dev/2015-November/142246.html . Does it make sense?

@zooba
Copy link
Member

zooba commented Aug 14, 2016

I'm working on this as part of my fix for bpo-1602. Not yet sure how this will come out - compatibility with GNU readline seems to be the biggest issue, as if we want to keep that then we can't allow embedded '\0' in the encoded text (i.e. UTF-16 cannot be used, which implies that sys.stdin.encoding cannot always be used directly).

Adding __readlinehook__ as an alternative may be feasible, but a decent amount of work given how we call into the current readline implementation. Unfortunately, it looks like detecting when a readline hook has been added is going to involve significant changes to the tokenizer, which I really don't want to do.

The easiest approach wrt bpo-1602 seems to be to special case the console by reencoding from utf-16-le to utf-8 and forcing the encoding in the tokenizer to utf-8 (instead of sys.stdin.encoding) in this case. I'll start here so that at least we can parse Unicode from the interactive prompt.

@zooba zooba self-assigned this Aug 14, 2016
@Drekin
Copy link
Mannequin Author

Drekin mannequin commented Aug 14, 2016

Unfortunately, it looks like detecting when a readline hook has been added is going to involve significant changes to the tokenizer, which I really don't want to do.

We don't need to detect the presence of readline hook, it may be so that there is always a readline hook. Whenever we have interactive stdio, and so PyOS_Readline is called, the new proposed API PyIO_Readline would be called instead. This would return Unicode str Py_Object*, so the result can be directly returned by input() and should be somehow encoded afterwards by the tokenizer (these are the only consumers of PyOS_Readline).

We may even leave the tokenizer alone and redefine PyOS_Readline as a wrapper of PyIO_Readline, having full control of the encoding process there. So it would be enough to set up the tokenizer with UTF-8 encoding despite the fact that sys.std*.encoding would be UTF-16.

(I hope that if the tokenizer was desiged nowdays, it would operate on strings rather than bytes so there won't be any encoding problems at all.)

Also, third parties would benefit from sys.readlinehook – at least win_unicode_console and pyreadline would just set the attribute rather than messing with ctypes.

@zooba
Copy link
Member

zooba commented Sep 9, 2016

Unassigning this. I meant to close it with another fix, but that would be wrong as we really ought to keep this open until we solve it properly. All I've done is make it use the right APIs on Windows, but we still don't handle it properly when we change stdin.

@zooba zooba removed their assignment Sep 9, 2016
@eryksun eryksun added 3.8 only security fixes 3.9 only security fixes 3.10 only security fixes labels Mar 15, 2021
@ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3.8 only security fixes 3.9 only security fixes 3.10 only security fixes type-feature A feature request or enhancement
Projects
None yet
Development

No branches or pull requests

6 participants