Issue 15216: Add encoding & errors parameters to TextIOWrapper.reconfigure()

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/59421

classification

Title:	Add encoding & errors parameters to TextIOWrapper.reconfigure()
Type:	enhancement	Stage:	resolved
Components:		Versions:	Python 3.7

process

Status:	closed	Resolution:	fixed
Dependencies:		Superseder:
Assigned To:		Nosy List:	Arfrever, berker.peksag, ishimoto, jwilk, loewis, martin.panter, methane, mrabarnett, ncoghlan, nikratio, pitrou, quad, rurpy2, serhiy.storchaka, vstinner
Priority:	normal	Keywords:	patch

Created on 2012-06-28 10:49 by ncoghlan, last changed 2022-04-11 14:57 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
set_encoding.patch	vstinner, 2012-08-07 02:01		review
set_encoding-2.patch	vstinner, 2012-08-09 20:42		review
set_encoding-3.patch	nikratio, 2014-02-02 02:33		review
set_encoding-4.patch	nikratio, 2014-02-02 05:15		review
set_encoding-5.patch	nikratio, 2014-02-02 21:51
set_encoding-6.patch	nikratio, 2014-02-04 04:23
set_encoding-7.patch	methane, 2017-01-08 02:07		review
set_encoding-newline.patch	martin.panter, 2017-01-11 10:05	Allow setting newline	review
set_encoding-8.patch	methane, 2017-01-11 12:50		review

Pull Requests
URL	Status	Linked	Edit
PR 199	closed	methane, 2017-02-20 13:05
PR 2343	merged	methane, 2017-06-23 04:39

Messages (55)
msg164239 - (view)	Author: Nick Coghlan (ncoghlan) *	Date: 2012-06-28 10:49
As discussed on python-ideas, swapping out the sys.std* streams can be fraught with peril. This can make writing code that wants to support an "--encoding" option for pipeline processing difficult. The proposal [1] is that TextIOWrapper support a set_encoding() method that is only supported between creation of the stream and the first read or write operation. Once the stream is "dirty", you can no longer change the encoding - you must replace the stream (e.g. by passing stream.fileNo() to open()) [1] http://mail.python.org/pipermail/python-ideas/2012-June/015535.html
msg164256 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2012-06-28 14:08
> The proposal [1] is that TextIOWrapper support a set_encoding() method that is only supported between creation of the stream and the first read or write operation. IMHO "encoding", "errors", "buffering", "line_buffering" and "newline" should be writable properties (unless first read/write).
msg164377 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2012-06-30 12:39
> The proposal [1] is that TextIOWrapper support a set_encoding() method > that is only supported between creation of the stream and the first > read or write operation. That will be fragile. A bit of prematurate input or output (for whatever reason) and your program breaks.
msg164379 - (view)	Author: Nick Coghlan (ncoghlan) *	Date: 2012-06-30 12:56
Indeed. However, the current alternatives (based on detach() and fileNo()) are also problematic - using detach() breaks the corresponding sys.__std__ entry, while using fileNo() means you now have two independent IO stacks using the same underlying descriptor. My own preference is to have a set_encoding() method that can* be used even if IO has already occurred. Yes, there's the chance of mixing encodings if data is processed before the encoding is changed, but that's also true for both of the alternatives. Pipeline applications written in Python should have an easy way to implement "--encoding" parameters for the standard IO streams, and a method to update the settings is the obvious choice.
msg164394 - (view)	Author: Matthew Barnett (mrabarnett) *	Date: 2012-06-30 16:24
Would a "set_encoding" method be Pythonic? I would've preferred an "encoding" property which flushes the output when it's changed.
msg164398 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2012-06-30 16:35
> Would a "set_encoding" method be Pythonic? I would've preferred an > "encoding" property which flushes the output when it's changed. I would prefer to have a method. The side-effect is too violent to be hidden behind a property. Besides, you want to set encoding and errors in a single operation.
msg167483 - (view)	Author: Martin v. Löwis (loewis) *	Date: 2012-08-05 10:52
I fail to see why this is a release blocker; no rationale is given in the original message, nor in the quoted message. So unblocking.
msg167597 - (view)	Author: STINNER Victor (vstinner) *	Date: 2012-08-07 02:01
Here is a Python implementation of TextIOWrapper.set_encoding(). The main limitation is that it is not possible to set the encoding on a non-seekable stream after the first read (if the read buffer is not empty, ie. if there are pending decoded characters). + # flush read buffer, may require to seek backward in the underlying + # file object + if self._decoded_chars: + if not self.seekable(): + raise UnsupportedOperation( + "It is not possible to set the encoding " + "of a non seekable file after the first read") + assert self._snapshot is not None + dec_flags, next_input = self._snapshot + offset = self._decoded_chars_used - len(next_input) + if offset: + self.buffer.seek(offset, SEEK_CUR) -- I don't have an use case for setting the encoding of sys.stdout/stderr after startup, but I would like to be able to control the error handler after the startup! My implementation permits to change both (encoding, errors, encoding and errors). For example, Lib/test/regrtest.py uses the following function to force the backslashreplace error handler on sys.stdout. It changes the error handler to avoid UnicodeEncodeError when displaying the result of tests. def replace_stdout(): """Set stdout encoder error handler to backslashreplace (as stderr error handler) to avoid UnicodeEncodeError when printing a traceback""" import atexit stdout = sys.stdout sys.stdout = open(stdout.fileno(), 'w', encoding=stdout.encoding, errors="backslashreplace", closefd=False, newline='\n') def restore_stdout(): sys.stdout.close() sys.stdout = stdout atexit.register(restore_stdout) The doctest module uses another trick to change the error handler: save_stdout = sys.stdout if out is None: encoding = save_stdout.encoding if encoding is None or encoding.lower() == 'utf-8': out = save_stdout.write else: # Use backslashreplace error handling on write def out(s): s = str(s.encode(encoding, 'backslashreplace'), encoding) save_stdout.write(s) sys.stdout = self._fakeout
msg167598 - (view)	Author: STINNER Victor (vstinner) *	Date: 2012-08-07 02:04
> That will be fragile. A bit of prematurate input or output > (for whatever reason) and your program breaks. I agree that it is not the most pure solution, but sometimes practicality beats purity (rule #9) ;-) We can add an ugly big red warning in the doc ;-)
msg167599 - (view)	Author: STINNER Victor (vstinner) *	Date: 2012-08-07 02:09
> My implementation permits to change both (encoding, errors, encoding and errors). We may also add a set_errors() method: def set_errors(self, errors): self.set_encoding(self.encoding, errors)
msg167604 - (view)	Author: Nick Coghlan (ncoghlan) *	Date: 2012-08-07 05:34
The reason I marked this as a release blocker for 3.4 is because it's a key piece of functionality for writing command line apps which accept an encoding argument. I'll use "high" instead. An interesting proposal was posted to the python-dev thread [1]: using self.detach() and self.__init__() to reinitialise the wrapper in-place. With that approach, the pure Python version of set_encoding() would look something like this: _sentinel = object() def set_encoding(self, encoding=_sentinel, errors=_sentinel): if encoding is _sentinel: encoding = self.encoding if errors is _sentinel: errors = self.errors self.__init__(self.detach(), encoding, errors, self._line_buffering, self._readnl, self._write_through) (The pure Python version currently has no self._write_through attribute - see #15571) Note that this approach addresses my main concern with the use of detach() for this: because the wrapper is reinitialised in place, old references (such as the sys.__std__ attributes) will also see the change. Yes, such a function would need a nice clear warning to say "Calling this may cause data loss or corruption if used without due care and attention", but it should work*. (Without automatic garbage collection, the C implementation would need an explicit internal "reinitialise" function rather than being able to just use the existing init function directly, but that shouldn't be a major problem). [1] http://mail.python.org/pipermail/python-ideas/2012-August/015898.html
msg167754 - (view)	Author: Nick Coghlan (ncoghlan) *	Date: 2012-08-09 02:08
To bring back Victor's comments from the list: - stdout/stderr are fairly easy to handle, since the underlying buffers can be flushed before switching the encoding and error settings. Yes, there's a risk of creating mojibake, but that's unavoidable and, for this use case, trumped by the pragmatic need to support overriding the output encoding in a robust fashion (i.e. not breaking sys.__stdout__ or sys.__stderr__, and not crashing if something else displays output during startup, for example, when running under "python -v") - stdin is more challenging, since it isn't entirely clear yet how to handle the case where data is already buffered internally. Victor proposes that it's acceptable to simply disallow changing the encoding of a stream that isn't seekable. My feeling is that such a restriction would largely miss the point, since the original use case that prompted the creation of this was shell pipeline processing, where stdin will often be a PIPE I think the guiding use case here really needs to be this one: "How do I implement the equivalent of 'iconv' as a Python 3 script, without breaking internal interpreter state invariants?" My current thought is that, instead of seeking, the input case can better be handled by manipulating the read ahead buffer directly. Something like (for the pure Python version): self._encoding = new_encoding if self._decoder is not None: old_data = self._get_decoded_chars().encode(old_encoding) old_data += self._decoder.getstate()[0] decoder = self._get_decoder() new_chars = '' if old_data: new_chars = decoder.decode(old_data) self._set_decoded_chars(new_chars) (A similar mechanism could actually be used to support an "initial_data" parameter to TextIOWrapper, which would help in general encoding detection situations where changing encoding in-place isn't needed, but the application would like an easy way to "put back" the initial data for inclusion in the text stream without making assumptions about the underlying buffer implementation) Also, StringIO should implement this new API as a no-op.
msg167772 - (view)	Author: STINNER Victor (vstinner) *	Date: 2012-08-09 08:31
> Victor proposes that it's acceptable to simply disallow changing the encoding of a stream that isn't seekable. It is no what I said. My patch raises an exception if you already started to read stdin. It should work fine if stdin is a pipe but the read buffer is empty.
msg167779 - (view)	Author: Nick Coghlan (ncoghlan) *	Date: 2012-08-09 11:25
Ah, you're right - peeking into the underlying buffer would be enough to handle encoding detection.
msg167831 - (view)	Author: STINNER Victor (vstinner) *	Date: 2012-08-09 20:42
Oh, set_encoding.patch is wrong: + offset = self._decoded_chars_used - len(next_input) self._decoded_chars_used is a number of Unicode characters, len(next_input) is a number of bytes. It only works with 7 and 8 bit encodings like ascii or latin1, but not with multibyte encodings like utf8 or ucs-4. > peeking into the underlying buffer would be enough to > handle encoding detection. I wrote a new patch using this idea. It does not work (yet?) with non seekable streams. The raw read buffer (bytes string) is not stored in the _snapshot attribute if the stream is not seeakble. It may be changed to solve this issue. set_encoding-2.patch is still a work-in-progress. It does not patch the _io module for example.
msg167832 - (view)	Author: STINNER Victor (vstinner) *	Date: 2012-08-09 20:46
Note: it is not possible to reencode the buffer of decoded characters to compute the offset in bytes. Some codecs are not bijective. Examples: * b'\x00'.decode('utf7').encode('utf7') == b'+AAA-' * b'\xff'.decode('ascii', 'replace').encode('ascii', 'replace') == b'?' * b'\xff'.decode('ascii', 'ignore').encode('ascii', 'ignore') == b''
msg183427 - (view)	Author: Nick Coghlan (ncoghlan) *	Date: 2013-03-04 09:18
That's a fair point - I think it's acceptable to throw an error in the case of already decoded characters that haven't been read. There's also a discussion on python-ideas about an explicit API for clearing the internal buffers, and pushing data back into a stream. If that is added, then set_encoding() would be free to error out if there was any already buffered data - it would be up to the application to call clear_buffer() before calling set_encoding(), and deal with an such data appropriately (such as calling push_data() with the results of the clear_buffer() call)
msg183428 - (view)	Author: Nick Coghlan (ncoghlan) *	Date: 2013-03-04 09:19
Oops, meant to link to my post in the thread about a buffer manipulation API: http://mail.python.org/pipermail/python-ideas/2013-March/019769.html
msg202214 - (view)	Author: Nick Coghlan (ncoghlan) *	Date: 2013-11-05 15:07
It would be good to have this addressed in Python 3.4 (I'm following up on a few other encoding related issues at the moment). Is that a reasonable possibility before beta 1, or do we need to bump this one to 3.5?
msg202223 - (view)	Author: STINNER Victor (vstinner) *	Date: 2013-11-05 16:48
> Is that a reasonable possibility before beta 1, or do we need to bump this one to 3.5? My patch was not reviewed yet and must be reimplemented in C. I will not have time before the beta1 to finish the work.
msg202258 - (view)	Author: Nick Coghlan (ncoghlan) *	Date: 2013-11-06 10:41
Just reviewed Victor's patch - aside from a trivial typo and not covering the C implementation or documentation yet, it looks good to me. Most importantly, the tests in that patch could be used to validate a C implementation as well. I'll see if I can find anyone interested in creating a patch for the C io implementation, otherwise we can postpone the addition until 3.5.
msg202259 - (view)	Author: -- (elixir) *	Date: 2013-11-06 11:09
I am interested in working on this, but I might need guidance at times. Is that acceptable? If yes, I'm willing to start as soon as possible.
msg202260 - (view)	Author: Nick Coghlan (ncoghlan) *	Date: 2013-11-06 12:03
Thanks, Andrei, that would be great. The tests and the Python version in Victor's patch show the desired API and behaviour. In theory, the C version should just be a matter of adding an equivalent textiowrapper_set_encoding method as a static function in hg.python.org/cpython/file/default/Modules/_io/textio.c
msg203210 - (view)	Author: -- (elixir) *	Date: 2013-11-17 20:47
I have to postpone this for a few weeks. Sorry if I stayed in anyone's way.
msg209364 - (view)	Author: Nikolaus Rath (nikratio) *	Date: 2014-01-26 23:15
I am working on this now.
msg209376 - (view)	Author: Nikolaus Rath (nikratio) *	Date: 2014-01-27 02:09
Question: What is the point of the old_encoding = codecs.lookup(self._encoding).name encoding = codecs.lookup(encoding).name if encoding == old_encoding and errors == self._errors: # no change return dance? Isn't this equivalent to if encoding == self.encoding and errors == self._errors: # no change return except that it doesn't validate the given encoding? But if the encoding is invalid, isn't it enough that the exception is raised a few lines further down?
msg209377 - (view)	Author: Nikolaus Rath (nikratio) *	Date: 2014-01-27 02:13
Second question: The following looks as if a BOM might be written for writeable, non-seekable streams: # don't write a BOM in the middle of a file if self._seekable and self.writable(): position = self.buffer.tell() if position != 0: try: self._get_encoder().setstate(0) except LookupError: # Sometimes the encoder doesn't exist pass Is that really desirable? It seems to me the safe choice is to not write the BOM, except when we know it's safe. Eg: # don't write a BOM unless we know we're at the beginning of a file if (self.writeable() and not (self._seekable and self.buffer.tell() == 0)): try: self._get_encoder().setstate(0) except LookupError: # Sometimes the encoder doesn't exist pass
msg209385 - (view)	Author: Nick Coghlan (ncoghlan) *	Date: 2014-01-27 03:44
A given encoding may have multiple aliases, and also variant spellings that are normalized before doing the codec lookup. Doing the lookup first means we run through all of the normalisation and aliasing machinery and then compare the canonical names. For example: >>> import codecs >>> codecs.lookup('ANSI_X3.4_1968').name 'ascii' >>> codecs.lookup('ansi_x3.4_1968').name 'ascii' >>> codecs.lookup('ansi-x3.4-1968').name 'ascii' >>> codecs.lookup('ASCII').name 'ascii' >>> codecs.lookup('ascii').name 'ascii' A public "codecs.is_same_encoding" API might be a worthwhile and self-documenting addition, rather than just adding a comment that explains the need for the canonicalisation dance. As far as the second question goes, for non-seekable output streams, this API is inherently a case of "here be dragons" - that's a large part of the reason why it took so long for us to accept it as a feature we really should provide. We need to support writing a BOM to sys.stdout and sys.stderr - potentially doing so in the middle of existing output isn't really any different from the chance of implicitly switching encodings mid-stream.
msg209394 - (view)	Author: Nikolaus Rath (nikratio) *	Date: 2014-01-27 04:40
Thanks Nick! Will take this into account. I've stumbled over another question in the meantime: It seems to me that after the call to set_encoding(), self._snapshot contains the decoder flags from the state of the old decoder. On the next call of eg. tell(), the flags from the old decoder are then passed to setstate() of the new decoder. Isn't this a problem? I thought the flags could mean different things to different codecs.
msg209395 - (view)	Author: Nick Coghlan (ncoghlan) *	Date: 2014-01-27 04:50
That's entirely plausible - Victor's current patch is definitely a proof-of-concept rather than a battle-tested implementation. I'm not sure you could actually provoke that bug with any of our existing codecs, though.
msg209612 - (view)	Author: Nikolaus Rath (nikratio) *	Date: 2014-01-29 03:46
I'm about 40% done with translating Victor's patch into C. However, in the process I got the feeling that this approach may not be so good after all. Note that: * The only use-case for set_encoding that I have found was changing the encoding of sys.{stdin,stdout,stderr} * When using non-seekable streams, set_encoding() has to be called before anything has been read from the stream, so it's unlikely that there is a situation (with the exception of sys.std) where the stream cannot be opened with the right encoding instead (if you can't change the open call, then you probably cannot call set_encoding early enough either). When using seekable streams, using set_encoding() breaks seeking, because the position cookie does not contain information about the decoder that was used at the given position. Example: $ cat ~/tmp/test.py import _pyio as io data = ('0123456789\r'5).encode('utf-16_le') bstream = io.BytesIO(data) tstream = io.TextIOWrapper(bstream, encoding='latin1') tstream.readline() pos = tstream.tell() tstream.read(6) tstream.set_encoding('utf-16_le') tstream.seek(pos) $ ./python ~/tmp/test.py Traceback (most recent call last): File "/home/nikratio/tmp/test.py", line 9, in <module> tstream.seek(pos) File "/home/nikratio/clones/cpython/Lib/_pyio.py", line 1989, in seek raise OSError("can't restore logical file position") OSError: can't restore logical file position I don't think there is a way to fix that that would not make the whole tell/seek and set_encoding code even more complicated than it already is. (It would probably involve keeping track of the history of encoders that have been used for different parts of the stream). In summary, using set_encoding() with seekable streams breaks seeking, using it with non-seekable streams requires it to be called right after open(), and the only reported case where one cannot simply change the open call instead is sys.std. Given all that, do we really want to add a new public method to the TextIOWrapper class that can only reasonably be used with three specific streams? Personally, I think it would make much more sense to instead introduce three new functions in the sys module: sys.change_std{out,err,in}_encoding(). That solves the reported use-case just as well without polluting the namespace of all text streams. That said, I am happy to complete the implementation set_encoding in C. However, I'd like a core developer to first reconfirm that this is really the better solution.
msg209619 - (view)	Author: Nick Coghlan (ncoghlan) *	Date: 2014-01-29 07:46
The specific motivating use cases I'm aware of involve the standard streams (for example, "How would you implement the equivalent of iconv in Python 3?"). There's actually the workaround for the missing feature right now: replace the standard streams with new streams, either by detaching the old ones or using the file descriptor with open(). It's also specifically the shadow references in __stdin__, __stdout__ and __stderr__ that make that replacement approach problematic (detaching breaks the shadow streams, using the file descriptor means you now have two independent IO stacks sharing the same descriptor). However, the other case where I can see this being useful is in pipes created by the subprocess module, and any other situation where an API creates a stream on your behalf, and you can't readily ensure you have replaced all the other references to that stream. In those cases, you really want to change the settings on the existing stream, rather than tracking down all the other references and rebinding them. Another question is whether or not we want to implement this as a no-op on StringIO.
msg209942 - (view)	Author: Nikolaus Rath (nikratio) *	Date: 2014-02-02 02:33
Wow, I didn't realize that programming Python using the C interface was that tedious and verbose. I have attached a work-in-progress patch. It is not complete yet, but maybe somebody could already take a look to make sure that I'm not heading completely in the wrong direction. Regarding StringIO.set_encoding(): I thought about this a bit, but I couldn't come up with a convincing scenario where having a no-op implementation would clearly help or or clearly hurt. Personally I would therefore to not provide it for now. If it turns out to be a problem, it can easily be added. But if we add it now and it turns out to have been a bad choice, it's probably much harder to remove it again.
msg209983 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2014-02-02 13:33
What about other TextIOWrapper parameters? I think "newline" should be changed in same cases as "encoding" and "errors". And changing "buffering" and "line_buffering" can be useful too. What if add the "configure" or "reopen" method which accepts all this parameters? This will decrease the swelling of interface.
msg210020 - (view)	Author: Nikolaus Rath (nikratio) *	Date: 2014-02-02 21:51
The attached patch now passes all testcases. However, every invocation of set_encoding() when there is buffered data leaks one reference. I haven't been able to find the error yet. As for adding a reopen() or configure() method: I don't like it very much, but for the same reasons that I still don't like set_encoding() itself: it makes a rather fragile operation appear well-supported. set_encoding already inserts BOM in the middle of non-seekable streams, and invalidates tell cookies for seekable streams. Changing the buffering would probably face similar problems if there is buffered data but buffering is supposed to be turned off. I think it would be better to restrict this functionality strictly to sys.stdin/out/err and in all other situations fix the API that results in the TextIO object with undesired parameters. For example, in the case of the subprocess module, wouldn't it be better to return pipes as byte streams and have the caller wrap them into text streams?
msg210050 - (view)	Author: Nick Coghlan (ncoghlan) *	Date: 2014-02-02 23:18
It's a blunt instrument to be sure, but a necessary one, I think - otherwise we end with a scattering of API specific solutions rather than a single consistent approach. Here's a thought, though: what if we went with Serhiy's "reconfigure" idea, limited that to seekable streams (resetting then back to the start of the file) and then also had a "force_reconfigure" that bypassed the safety checks? The main problem I see with that idea is that some changes (binary vs text, universal newlines or not, buffered or not) would potentially require adding or removing a class from the stream's IO stack, thus rendering it difficult to handle as an in-place operation. However the "seekable streams only by default, flag or separate method to force an encoding change for a non-seekable stream" approach would be applicable even for the basic "set_encoding" API. I'm beginning to think that this is all complicated enough that it should be written up in a PEP for 3.5 before we actually commit anything (even if what we end up committing is just set_encoding with a "force=False" keyword only parameter).
msg210175 - (view)	Author: Nikolaus Rath (nikratio) *	Date: 2014-02-04 04:04
I think that any API that gives you a TextIOWrapper object with undesired properties is broken and needs to be fixed, rather than a workaround added to TextIOWrapper. That said, I defer to the wisdow of the core developers, and I'll be happy to implement whatever is deemed to be the right thing(tm). However, my opinion on this feature probably makes me unsuitable to drive a PEP for it :-).
msg210176 - (view)	Author: Nikolaus Rath (nikratio) *	Date: 2014-02-04 04:23
Newest version of the patch attached, the reference leak problem has been fixed.
msg210205 - (view)	Author: Nick Coghlan (ncoghlan) *	Date: 2014-02-04 11:28
If operating systems always exposed accurate metadata and configuration settings, I'd agree with you. They don't though, so sometimes developers need to be able to override whatever the interpreter figured out automatically. In addition, needing to cope with badly designed APIs is an unfortunate fact of life - that's why monkeypatching is only discouraged, rather than disallowed :)
msg210270 - (view)	Author: Nikolaus Rath (nikratio) *	Date: 2014-02-04 21:50
On 02/04/2014 03:28 AM, Nick Coghlan wrote: > > Nick Coghlan added the comment: > > If operating systems always exposed accurate metadata and configuration settings, I'd agree with you. They don't though, so sometimes developers need to be able to override whatever the interpreter figured out automatically. Where does the operating system come into play? What I'm trying to say is that any code of the form fh = some_python_function() fh.set_encoding(foo) should really be written as fh = some_python_function(encoding=foo) or fh = TextIOWrapper(some_python_function(raw=True), encoding=foo) The only exception to this is sys.std{in,out,err}. But this special case should, in my opinion, be solved with a corresponding special function in sys., rather than a general purpose method in TextIOWrapper. > In addition, needing to cope with badly designed APIs is an unfortunate fact of life - that's why monkeypatching is only discouraged, rather than disallowed :) Well, yes, but introducing a set_encoding() or reconfigure() function is actively encouraging* bad API design. "I'll just return my favorite encoding from this function, after all, the caller can use set_encoding afterwards". Best, Nikolaus -- Encrypted emails preferred. PGP fingerprint: 5B93 61F8 4EA2 E279 ABF6 02CF A9AD B7F8 AE4E 425C »Time flies like an arrow, fruit flies like a Banana.«
msg210280 - (view)	Author: Nick Coghlan (ncoghlan) *	Date: 2014-02-05 00:25
Metaclasses and monkeypatching encourage bad API design, too, but we still provide them because they're essential tools for some use cases. In Python, the system integrator is always right, and we provide them with power tools to deal with the cases where reality doesn't want to cooperate with the nice neat model of the world that we provide by default. For the standard streams, the problem is that the stream is created automatically by the interpreter, and sometimes we will get the encoding choice wrong (because the info we retrieve from the OS doesn't match what the user actually wants), but rebinding the names to a new IO stream causes other problems. Now, we have two choices at this point: 1. Assume there is no other API involving implicit text stream creation in any third party Python 3 library anywhere that will ever have this problem and create a solution that only works with the standard streams 2. Acknowledge that while the implicit way we create the standard streams is unusual, it is unlikely to be unique, so it actually does make sense to provide the ability to change the encoding of an existing stream as a general purpose feature fenced about with as many "you probably don't want to use this unless you really know what you are doing, otherwise it may eat your data" warnings as seems appropriate.
msg226896 - (view)	Author: Martin Panter (martin.panter) *	Date: 2014-09-15 04:36
Some more use cases for temporarily switching error handler in the middle of writing to a stream: * Possibly simplify the implementation of sys.displayhook() * I have done a similar hack at <https://bitbucket.org/Gfy/pyrescene/commits/42ad75e375d84e090a32d024acc865de341c22aa#Lrescene/srr.pyF132>, to output a comment field with a permissive error handler, while other data is output with strict error handling. A use case for changing a reader’s newline translation mode is to use standard input with the built-in “csv” module. My current idea is to do something like this: encoding = sys.stdin.encoding errors = sys.stdin.errors line_buffering = sys.stdin.line_buffering # No way to retain write_through mode, but shouldn’t matter for reading sys.stdin = TextIOWrapper(sys.stdin.detach(), encoding, errors, newline="", line_buffering=line_buffering) for row in csv.reader(sys.stdin): ... On the other hand, I wonder about rewinding an input file after already having read and buffered text in the wrong encoding. From a distance, the Python native version of the code seems rather complex and full of dark magic. Is there a use case, or maybe it would be simpler to have it only work when nothing has been read or buffered?
msg242942 - (view)	Author: Nick Coghlan (ncoghlan) *	Date: 2015-05-12 05:08
Reviewing the items I had flagged as dependencies of issue 22555 for personal tracking purposes, I suggest we defer further consideration of this idea to 3.6 after folks have had a chance to get some experience with the interpreter defaulting to using "surrogateescape" on the standard streams when an underlying *nix operating system (other than Mac OS X) claims that the system encoding is ASCII.
msg243324 - (view)	Author: Nick Coghlan (ncoghlan) *	Date: 2015-05-16 14:42
Revisiting the idea Nikolaus raised last year regarding whether or not this could be done using a dedicated API in the sys module, I realised today that even if we decided to use a separate public API, that API would still need a defined way to modify the stream encoding and error handling in place. Since we're going to need the object level in-place modification API regardless, a separate sys module API for modifying the standard streams in particular would then introduce additional complexity without providing a proportional benefit.
msg261641 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2016-03-12 07:33
quad, please don't change issue attributes to unrelated values.
msg284950 - (view)	Author: Inada Naoki (methane) *	Date: 2017-01-08 02:07
Updated the patch for default branch.
msg284955 - (view)	Author: Nick Coghlan (ncoghlan) *	Date: 2017-01-08 03:23
Reviewing Inada-san's latest version of the patch, we seem to be in a somewhat hybrid state where: 1. The restriction to only being used with seekable() streams if there is currently unread data in the read buffer is in place 2. We don't actually call seek() anywhere to set the stream back to the beginning of the file. Instead, we try to shuffle data out of the old decoder and into the new one. I'm starting to wonder if the best option here might be to attempt to make the API work for arbitrary codecs and non-seekable streams, and then simply accept that it may take a few maintenance releases before that's actually true. If we decide to go down that path, then I'd suggest the follow stress test: - make a longish test string out of repeated copies of "ℙƴ☂ℌøἤ" - pick a few pairs of multibyte non-universal/universal encodings for use with surrogateescape and strict as their respective error handlers (e.g. ascii/utf8, ascii/utf16le, ascii/utf32, ascii/shift_jis, ascii/iso2022_jp, ascii/gb18030, gbk/gb18030) - for each pair, make the test data by encoding from str to bytes with the relevant universal encoding - switch the encoding multiple times on the same stream at different points Optionally: - extract "codecs._switch_decoder" and "codecs._switch_encoder" helper functions to make this all a bit easier to test and debug (with a Python version in the codecs module and the C version accessible via the _codecs modules) That way, confidence in the reliability of the feature (including across Python implementations) can be based on the strength of the test cases covering it.
msg285111 - (view)	Author: Inada Naoki (methane) *	Date: 2017-01-10 13:04
I had not read the patch carefully when updating. Now I'm reading pure Python part of the patch. For writing side, this patch is very simple. For reading side, this patch is very complex, because TextIOWrapper.tell() and .seek() is complex already. How about starting this API which works only when `self._decoder is None`? Regardless acceptance of PEP 538 and PEP 540, this is important to change stdio encoding. We already failed to ship this important API in 3.6.
msg285112 - (view)	Author: Inada Naoki (methane) *	Date: 2017-01-10 13:20
This patch doesn't change current behavior of "decode by chunk". It's difficult to support changing encoding via header, like Python's encoding cookie. >>> import io >>> buff = io.BytesIO("hello\nこんにちは".encode('utf-8')) >>> f = io.TextIOWrapper(buff, encoding='ascii') >>> f.readline() Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/Users/inada-n/work/python/cpython/Lib/encodings/ascii.py", line 26, in decode return codecs.ascii_decode(input, self.errors)[0] UnicodeDecodeError: 'ascii' codec can't decode byte 0xe3 in position 6: ordinal not in range(128) >>> >>> buff = io.BytesIO("hello\nこんにちは".encode('utf-8')) >>> f = io.TextIOWrapper(buff, encoding='ascii') >>> f.read(6) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/Users/inada-n/work/python/cpython/Lib/encodings/ascii.py", line 26, in decode return codecs.ascii_decode(input, self.errors)[0] UnicodeDecodeError: 'ascii' codec can't decode byte 0xe3 in position 6: ordinal not in range(128)
msg285205 - (view)	Author: Martin Panter (martin.panter) *	Date: 2017-01-11 10:05
Inada, I think you messed up the positioning of bits of the patch. E.g. there are now test methods declared inside a helper function (rather than a test class). Since it seems other people are in favour of this API, I would like to expand it a bit to cover two uses cases (see set_encoding-newline.patch): * change the error handler without affecting the main character encoding * set the newline encoding (also suggested by Serhiy) Regarding Serhiy’s other suggestion about buffering parameters, perhaps TextIOWrapper.line_buffering could become a writable attribute instead, and the class could grow a similar write_through attribute. I don’t think these affect encoding or decoding, so they could be treated independently. The algorithm for rewinding unread data is complicated and can fail. What is the advantage of using it? What is the use case for reading from a stream and then changing the encoding, without a guarantee that it will work? Even if it is enhanced to never “fail”, it will still have strange behaviour, such as data loss when a decoder is fed a single byte and produces multiple characters (e.g. CR newline, backslashreplace, UTF-7). One step in the right direction IMO would be to only support calling set_encoding() when no extra read data has been buffered (or to explicitly say that any buffered data is silently dropped). So there is no support for changing the encoding halfway through a disk file, but it may be appropriate if you can regulate the bytes being read, e.g. from a terminal (user input), pipe, socket, etc. But I would be happy enough without set_encoding(), and with something like my rewrap() function at the bottom of <https://github.com/vadmium/data/blob/master/data.py#L526>. It returns a fresh TextIOWrapper, but when you exit the context manager you can continue to reuse the old stream with the old settings.
msg285210 - (view)	Author: Inada Naoki (methane) *	Date: 2017-01-11 10:35
> Inada, I think you messed up the positioning of bits of the patch. E.g. there are now test methods declared > inside a helper function (rather than a test class). I'm sorry. `patch -p1` merged previous patch into wrong place, and test passed accidently. > Since it seems other people are in favour of this API, I would like to expand it a bit to cover two uses cases (see set_encoding-newline.patch): > > * change the error handler without affecting the main character encoding > * set the newline encoding (also suggested by Serhiy) +1. Since stdio is configured before running Python program, TextIOWrapper should be configurable after creation, as possible. > Regarding Serhiy’s other suggestion about buffering parameters, perhaps TextIOWrapper.line_buffering could become a writable attribute instead, and the class could grow a similar write_through attribute. I don’t think these affect encoding or decoding, so they could be treated independently. Could them go another new issue? This issue is too long to read already. > The algorithm for rewinding unread data is complicated and can fail. What is the advantage of using it? What is the use case for reading from a stream and then changing the encoding, without a guarantee that it will work? > > Even if it is enhanced to never “fail”, it will still have strange behaviour, such as data loss when a decoder is fed a single byte and produces multiple characters (e.g. CR newline, backslashreplace, UTF-7). When I posted the set_encoding-7.patch, I hadn't read io module deeply. I just solved conflict and ran test. After that, I read the code and I feel same thing (see msg285111 and msg285112). Let's drop support changing encoding while reading. It's significant step that allowing changing stdin encoding only before reading anything from it. > One step in the right direction IMO would be to only support calling set_encoding() when no extra read data has been buffered (or to explicitly say that any buffered data is silently dropped). So there is no support for changing the encoding halfway through a disk file, but it may be appropriate if you can regulate the bytes being read, e.g. from a terminal (user input), pipe, socket, etc. Totally agree. > But I would be happy enough without set_encoding(), and with something like my rewrap() function at the bottom of <https://github.com/vadmium/data/blob/master/data.py#L526>. It returns a fresh TextIOWrapper, but when you exit the context manager you can continue to reuse the old stream with the old settings. I want one obvious way to control encoding and error handler from Python, (not from environment variable). Rewrapping stream seems hacky way, rather than obvious way.
msg285221 - (view)	Author: Inada Naoki (methane) *	Date: 2017-01-11 12:50
set_encoding-8.patch dropped support of changing encoding after read. It is based on set_encoding-newline.patch
msg295018 - (view)	Author: Nick Coghlan (ncoghlan) *	Date: 2017-06-02 13:27
Antoine posted a simpler reconfiguration RFE over in https://bugs.python.org/issue30526 (for setting line buffering). While technically orthogonal to this RFE, we're thinking it might be possible to expose them to users as a common "reconfigure()" API, rather than as independent methods.
msg295113 - (view)	Author: Nick Coghlan (ncoghlan) *	Date: 2017-06-04 05:57
`TextIOWrapper.reconfigure()` has been added for 3.7 as part of issue 30526 (currently covering the `line_buffering` and `write_through` options), so I've updated the issue title here to reflect that that's now the relevant API to use to address this particular RFE.
msg308838 - (view)	Author: Inada Naoki (methane) *	Date: 2017-12-21 00:59
New changeset 507434fd504f3ebc1da72aa77544edc0d73f136e by INADA Naoki in branch 'master': bpo-15216: io: TextIOWrapper.reconfigure() accepts encoding, errors and newline (GH-2343) https://github.com/python/cpython/commit/507434fd504f3ebc1da72aa77544edc0d73f136e

History
Date	User	Action	Args
2022-04-11 14:57:32	admin	set	github: 59421
2017-12-21 06:37:36	methane	set	status: open -> closed resolution: fixed stage: patch review -> resolved
2017-12-21 00:59:55	methane	set	messages: + msg308838
2017-06-23 04:39:58	methane	set	pull_requests: + pull_request2387
2017-06-04 05:57:47	ncoghlan	set	messages: + msg295113 title: Support setting the encoding on a text stream after creation -> Add encoding & errors parameters to TextIOWrapper.reconfigure()
2017-06-02 13:27:49	ncoghlan	set	messages: + msg295018
2017-02-20 13:05:12	methane	set	pull_requests: + pull_request166
2017-01-11 12:50:57	methane	set	files: + set_encoding-8.patch messages: + msg285221
2017-01-11 10:35:16	methane	set	messages: + msg285210
2017-01-11 10:05:57	martin.panter	set	files: + set_encoding-newline.patch messages: + msg285205
2017-01-10 13:20:09	methane	set	messages: + msg285112
2017-01-10 13:04:20	methane	set	messages: + msg285111
2017-01-08 03:23:28	ncoghlan	set	messages: + msg284955
2017-01-08 02:07:23	methane	set	files: + set_encoding-7.patch messages: + msg284950 versions: + Python 3.7, - Python 3.6
2016-04-12 19:11:38	zach.ware	set	hgrepos: - hgrepo334
2016-03-12 07:33:54	serhiy.storchaka	set	components: - 2to3 (2.x to 3.x conversion tool)
2016-03-12 07:33:31	serhiy.storchaka	set	type: security -> enhancement messages: + msg261641 versions: + Python 3.6
2016-03-12 07:30:09	quad	set	hgrepos: + hgrepo334 components: + 2to3 (2.x to 3.x conversion tool) versions: - Python 3.6
2016-03-12 07:28:56	quad	set	nosy: + quad type: enhancement -> security
2016-02-22 17:53:27	elixir	set	nosy: - elixir
2015-05-16 14:42:19	ncoghlan	set	messages: + msg243324
2015-05-12 05:08:31	ncoghlan	set	priority: high -> normal messages: + msg242942 versions: + Python 3.6, - Python 3.5
2015-04-22 07:51:43	berker.peksag	set	nosy: + berker.peksag stage: needs patch -> patch review
2014-10-05 04:09:50	ncoghlan	link	issue22555 dependencies
2014-09-15 04:36:49	martin.panter	set	nosy: + martin.panter messages: + msg226896
2014-02-05 00:25:42	ncoghlan	set	messages: + msg210280
2014-02-04 21:50:46	nikratio	set	messages: + msg210270
2014-02-04 11:28:15	ncoghlan	set	messages: + msg210205
2014-02-04 04:23:57	nikratio	set	files: + set_encoding-6.patch messages: + msg210176
2014-02-04 04:04:23	nikratio	set	messages: + msg210175
2014-02-02 23:18:41	ncoghlan	set	messages: + msg210050
2014-02-02 21:51:28	nikratio	set	files: + set_encoding-5.patch messages: + msg210020
2014-02-02 13:33:14	serhiy.storchaka	set	messages: + msg209983
2014-02-02 05:16:00	nikratio	set	files: + set_encoding-4.patch
2014-02-02 02:33:15	nikratio	set	files: + set_encoding-3.patch messages: + msg209942
2014-01-29 07:46:46	ncoghlan	set	messages: + msg209619
2014-01-29 03:46:54	nikratio	set	messages: + msg209612
2014-01-27 04:50:23	ncoghlan	set	messages: + msg209395
2014-01-27 04:40:59	nikratio	set	messages: + msg209394
2014-01-27 03:44:40	ncoghlan	set	messages: + msg209385
2014-01-27 02:13:41	nikratio	set	messages: + msg209377
2014-01-27 02:09:10	nikratio	set	messages: + msg209376
2014-01-26 23:15:58	nikratio	set	nosy: + nikratio messages: + msg209364
2013-12-22 17:48:32	pitrou	set	versions: + Python 3.5, - Python 3.4
2013-12-21 17:10:09	jwilk	set	nosy: + jwilk
2013-11-17 20:47:36	elixir	set	messages: + msg203210
2013-11-06 12:03:11	ncoghlan	set	messages: + msg202260
2013-11-06 11:09:24	elixir	set	nosy: + elixir messages: + msg202259
2013-11-06 10:41:26	ncoghlan	set	messages: + msg202258
2013-11-05 16:48:24	vstinner	set	messages: + msg202223
2013-11-05 15:07:22	ncoghlan	set	messages: + msg202214
2013-03-04 09:19:49	ncoghlan	set	messages: + msg183428
2013-03-04 09:18:56	ncoghlan	set	messages: + msg183427
2012-08-09 20:46:42	vstinner	set	messages: + msg167832
2012-08-09 20:42:49	vstinner	set	files: + set_encoding-2.patch messages: + msg167831
2012-08-09 11:25:09	ncoghlan	set	messages: + msg167779
2012-08-09 08:31:22	vstinner	set	messages: + msg167772
2012-08-09 02:08:19	ncoghlan	set	messages: + msg167754
2012-08-09 01:25:55	rurpy2	set	nosy: + rurpy2
2012-08-07 05:34:01	ncoghlan	set	priority: normal -> high messages: + msg167604
2012-08-07 02:09:03	vstinner	set	messages: + msg167599
2012-08-07 02:04:22	vstinner	set	messages: + msg167598
2012-08-07 02:01:36	vstinner	set	files: + set_encoding.patch nosy: + vstinner messages: + msg167597 keywords: + patch
2012-08-07 01:09:18	methane	set	nosy: + methane
2012-08-05 10:52:10	loewis	set	priority: release blocker -> normal nosy: + loewis messages: + msg167483
2012-08-02 16:11:35	ishimoto	set	nosy: + ishimoto
2012-06-30 16:35:39	pitrou	set	messages: + msg164398
2012-06-30 16:24:08	mrabarnett	set	nosy: + mrabarnett messages: + msg164394
2012-06-30 12:56:40	ncoghlan	set	messages: + msg164379
2012-06-30 12:39:24	pitrou	set	nosy: + pitrou messages: + msg164377
2012-06-28 18:27:37	Arfrever	set	nosy: + Arfrever
2012-06-28 14:08:27	serhiy.storchaka	set	nosy: + serhiy.storchaka messages: + msg164256
2012-06-28 10:49:17	ncoghlan	create