Message 289378 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	vstinner
Recipients	ezio.melotti, lemburg, serhiy.storchaka, vstinner
Date	2017-03-10.15:41:50
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<CAMpsgwasZKg=C1uL-uha+v0RKHfdatF8foXWDdK=51cHh0VBFw@mail.gmail.com>
In-reply-to	<69886c48-0d2f-3f91-7457-41043ddcb1ac@egenix.com>

Content
> The reason for the problem is the UTF-8 decoder (and other > decoders) expecting an extension to the codec decoder API, > which are not implemented in its StreamReader class (it simply > uses the base class). It's not a problem of the base class, but > that of the codec. > > And no: it doesn't have anything to do with codec.open() > or the StreamReaderWriter class. open("document.txt", encoding="utf-8") uses IncrementalDecoder of encodings.utf_8. This object doesn't seem to have the discussed issue. IMHO the issue is that StreamReader doesn't use an incremental decoder. I don't see how it could support multibyte encodings and error handlers without an incremental decoder. I like TextIOWrapper design between it only handles codecs and text buffering. Bytes buffering is done at lower-level in a different object. I'm not confortable to modify StreamReader because it combines TextIOWrapper with BufferedReader and so is more complex. >> I propose to modify codecs.open() to reuse the io module: call io.open() with newline=''. The io module is now battle-tested and handles well many corner cases of incremental codecs with multibyte encodings. > > -1. People who want to use the io module should use it directly. When porting code to Python 3, many people chose to use codecs.open() to get text files using a single code base for Python 2 and Python 3. Once the code is ported, I don't expect that anyone will replace codecs.open() with io.open(). You know, nobody cares of the technical debt... >> The next step would be to deprecate the codecs.StreamReaderWriter class and the codecs.open(). But my latest attempt to deprecate them was the PEP 400 and it wasn't a full success, so I now prefer to move step by step :-) > > I'm still -1 on the deprecations in PEP 400. You are essentially > suggesting to replace the complete codecs subsystem with the > io module, but forgetting that all codecs use StreamWriter and > StreamReader as base classes. You can elaborate on "all codecs use StreamWriter and StreamReader as base classes". Only codecs.open() uses StreamReader and StreamWriter, no? All codecs implement a StreamReader and StreamWriter class, but my question is how use these classes? > The codecs sub system has a clean design. If used correctly > and maintained with more care, it works really well. It seems like we lack such maintainer, since I wrote the PEP, many issues are still open: http://bugs.python.org/issue7262 http://bugs.python.org/issue8630 http://bugs.python.org/issue10344 http://bugs.python.org/issue12508 http://bugs.python.org/issue12512 See also issue #5445 (wontfix, whereas TextIOWrapper.writeslines() uses "for line in lines") and issue #12513 (this one is not fair, io also has the same bug: issue #12215 :-)). > I'm tired of having to fight these fights every few years. > Can't we just stop having them, please ? The status quo is to do nothing, but as a consequence, bugs are still not fixed yet, and users are still affected by these bugs :-( I'm trying to find a solution.

> The reason for the problem is the UTF-8 decoder (and other
> decoders) expecting an extension to the codec decoder API,
> which are not implemented in its StreamReader class (it simply
> uses the base class). It's not a problem of the base class, but
> that of the codec.
>
> And no: it doesn't have anything to do with codec.open()
> or the StreamReaderWriter class.

open("document.txt", encoding="utf-8") uses IncrementalDecoder of
encodings.utf_8. This object doesn't seem to have the discussed issue.

IMHO the issue is that StreamReader doesn't use an incremental
decoder. I don't see how it could support multibyte encodings and
error handlers without an incremental decoder.

I like TextIOWrapper design between it only handles codecs and text
buffering. Bytes buffering is done at lower-level in a different
object.

I'm not confortable to modify StreamReader because it combines
TextIOWrapper with BufferedReader and so is more complex.

>> I propose to modify codecs.open() to reuse the io module: call io.open() with newline=''. The io module is now battle-tested and handles well many corner cases of incremental codecs with multibyte encodings.
>
> -1. People who want to use the io module should use it directly.

When porting code to Python 3, many people chose to use codecs.open()
to get text files using a single code base for Python 2 and Python 3.
Once the code is ported, I don't expect that anyone will replace
codecs.open() with io.open(). You know, nobody cares of the technical
debt...

>> The next step would be to deprecate the codecs.StreamReaderWriter class and the codecs.open(). But my latest attempt to deprecate them was the PEP 400 and it wasn't a full success, so I now prefer to move step by step :-)
>
> I'm still -1 on the deprecations in PEP 400. You are essentially
> suggesting to replace the complete codecs subsystem with the
> io module, but forgetting that all codecs use StreamWriter and
> StreamReader as base classes.

You can elaborate on "all codecs use StreamWriter and StreamReader as
base classes". Only codecs.open() uses StreamReader and StreamWriter,
no?

All codecs implement a StreamReader and StreamWriter class, but my
question is how use these classes?

> The codecs sub system has a clean design. If used correctly
> and maintained with more care, it works really well.

It seems like we lack such maintainer, since I wrote the PEP, many
issues are still open:

http://bugs.python.org/issue7262
http://bugs.python.org/issue8630
http://bugs.python.org/issue10344
http://bugs.python.org/issue12508
http://bugs.python.org/issue12512

See also issue #5445 (wontfix, whereas TextIOWrapper.writeslines()
uses "for line in lines") and issue #12513 (this one is not fair, io
also has the same bug: issue #12215 :-)).

> I'm tired of having to fight these fights every few years.
> Can't we just stop having them, please ?

The status quo is to do nothing, but as a consequence, bugs are still
not fixed yet, and users are still affected by these bugs :-( I'm
trying to find a solution.

History
Date	User	Action	Args
2017-03-10 15:41:50	vstinner	set	recipients: + vstinner, lemburg, ezio.melotti, serhiy.storchaka
2017-03-10 15:41:50	vstinner	link	issue29783 messages
2017-03-10 15:41:50	vstinner	create