Message 284955 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	ncoghlan
Recipients	Arfrever, berker.peksag, ishimoto, jwilk, loewis, martin.panter, methane, mrabarnett, ncoghlan, nikratio, pitrou, quad, rurpy2, serhiy.storchaka, vstinner
Date	2017-01-08.03:23:27
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1483845808.33.0.1652145511.issue15216@psf.upfronthosting.co.za>
In-reply-to

Content
Reviewing Inada-san's latest version of the patch, we seem to be in a somewhat hybrid state where: 1. The restriction to only being used with seekable() streams if there is currently unread data in the read buffer is in place 2. We don't actually call seek() anywhere to set the stream back to the beginning of the file. Instead, we try to shuffle data out of the old decoder and into the new one. I'm starting to wonder if the best option here might be to attempt to make the API work for arbitrary codecs and non-seekable streams, and then simply accept that it may take a few maintenance releases before that's actually true. If we decide to go down that path, then I'd suggest the follow stress test: - make a longish test string out of repeated copies of "ℙƴ☂ℌøἤ" - pick a few pairs of multibyte non-universal/universal encodings for use with surrogateescape and strict as their respective error handlers (e.g. ascii/utf8, ascii/utf16le, ascii/utf32, ascii/shift_jis, ascii/iso2022_jp, ascii/gb18030, gbk/gb18030) - for each pair, make the test data by encoding from str to bytes with the relevant universal encoding - switch the encoding multiple times on the same stream at different points Optionally: - extract "codecs._switch_decoder" and "codecs._switch_encoder" helper functions to make this all a bit easier to test and debug (with a Python version in the codecs module and the C version accessible via the _codecs modules) That way, confidence in the reliability of the feature (including across Python implementations) can be based on the strength of the test cases covering it.

Reviewing Inada-san's latest version of the patch, we seem to be in a somewhat hybrid state where:

1. The restriction to only being used with seekable() streams if there is currently unread data in the read buffer is in place

2. We don't actually call seek() anywhere to set the stream back to the beginning of the file. Instead, we try to shuffle data out of the old decoder and into the new one.

I'm starting to wonder if the best option here might be to attempt to make the API work for arbitrary codecs and non-seekable streams, and then simply accept that it may take a few maintenance releases before that's actually true. If we decide to go down that path, then I'd suggest the follow stress test:

- make a longish test string out of repeated copies of "ℙƴ☂ℌøἤ"
- pick a few pairs of multibyte non-universal/universal encodings for use with surrogateescape and strict as their respective error handlers (e.g. ascii/utf8, ascii/utf16le, ascii/utf32, ascii/shift_jis, ascii/iso2022_jp, ascii/gb18030, gbk/gb18030)
- for each pair, make the test data by encoding from str to bytes with the relevant universal encoding
- switch the encoding multiple times on the same stream at different points

Optionally:

- extract "codecs._switch_decoder" and "codecs._switch_encoder" helper functions to make this all a bit easier to test and debug (with a Python version in the codecs module and the C version accessible via the _codecs modules)

That way, confidence in the reliability of the feature (including across Python implementations) can be based on the strength of the test cases covering it.

History
Date	User	Action	Args
2017-01-08 03:23:28	ncoghlan	set	recipients: + ncoghlan, loewis, ishimoto, pitrou, vstinner, jwilk, mrabarnett, Arfrever, methane, nikratio, rurpy2, berker.peksag, martin.panter, serhiy.storchaka, quad
2017-01-08 03:23:28	ncoghlan	set	messageid: <1483845808.33.0.1652145511.issue15216@psf.upfronthosting.co.za>
2017-01-08 03:23:28	ncoghlan	link	issue15216 messages
2017-01-08 03:23:27	ncoghlan	create