classification
Title: Incremental encoder incompatibility between 2.x and py3k
Type: behavior Stage:
Components: Versions: Python 2.7
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: doerwalter, pitrou, vstinner
Priority: normal Keywords: patch

Created on 2009-06-05 20:17 by pitrou, last changed 2010-07-28 01:59 by vstinner. This issue is now closed.

Files
File name Uploaded Description Edit
utf_8_16.patch vstinner, 2010-07-24 03:41
Messages (8)
msg88972 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2009-06-05 20:17
The behaviour of several incremental encoders is inconsistent between
2.x and py3k.

In 2.x:
>>> enc = codecs.getincrementalencoder('utf-16')()
>>> enc.getstate()
0
>>> enc.setstate(0)
>>> enc.encode(u'abc')
'\xff\xfea\x00b\x00c\x00'

In py3k:
>>> enc = codecs.getincrementalencoder('utf-16')()
>>> enc.getstate()
2
>>> enc.setstate(0)
>>> enc.encode('abc')
b'a\x00b\x00c\x00'
msg89073 - (view) Author: Walter Dörwald (doerwalter) * (Python committer) Date: 2009-06-08 11:13
This was done because the codec state is part of the return value of
tell(). To have a reasonable return value (i.e. one with just the
position itself) in as many cases as possible it makes sense to design
the codec state in such a way, that the most common state is 0. This is
what was done for py3k: The default state (no BOM read/written yet) is 2
not 0.
msg89074 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2009-06-08 11:19
Yes, I agree with py3k's behaviour. But it should be backported to 2.x
as well. I don't know where the changes must be done so if someone else
could do it it would be nice :-)
(I'm backporting the py3k IO lib and I had to disable two tests because
of this)
msg89075 - (view) Author: Walter Dörwald (doerwalter) * (Python committer) Date: 2009-06-08 11:59
AFAICR the difference is: 2.x may return any object in getstate(), but
py3k must return a (buffered input, integer) tuple. Simply moving py3ks
getstate/setstate implementation over to 2.x might do the trick.
msg111423 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010-07-24 03:41
Codecs are inconsistents: utf-32 has working getstate() / setstate() methods, whereas utf-8-sig and utf-16 don't (getstate() always return 0, setstate() does nothing).

> Simply moving py3ks getstate/setstate implementation 
> over to 2.x might do the trick.

That's what my patch does :-) It just a copy/paste of Python3 code. It does fix #5006 tests (which are re-enabled by the patch). Using the patch, it's possible to save/restore utf-8-sig and utf-16 codecs state.
msg111745 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2010-07-27 22:47
The patch looks ok to me (I suppose you have tested it).
msg111760 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010-07-28 01:45
> The patch looks ok to me

Ok, commited to 2.7 (r83198).

> (I suppose you have tested it)

I ran test_io which does test the incremental encoders.

--

I'm not brave enough to commit it to 2.6 (test_io in 2.6 doesn't use incremental encoders).
msg111762 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010-07-28 01:59
> I'm not brave enough to commit it to 2.6 
> (test_io in 2.6 doesn't use incremental encoders)

Oh, I just remembered that I choosed to fix this issue to be able to backport #5006 to 2.6 :-)

So r83199 is the incremental encoder fix for 2.6, and r83200 is the BOM fix for the io library.
History
Date User Action Args
2010-07-28 01:59:06vstinnersetmessages: + msg111762
2010-07-28 01:45:22vstinnersetstatus: open -> closed
resolution: fixed
messages: + msg111760
2010-07-27 22:47:30pitrousetmessages: + msg111745
versions: - Python 3.2
2010-07-24 03:41:56vstinnersetfiles: + utf_8_16.patch

nosy: + vstinner
messages: + msg111423

keywords: + patch
2009-06-08 11:59:30doerwaltersetmessages: + msg89075
2009-06-08 11:19:07pitrousetmessages: + msg89074
2009-06-08 11:13:55doerwaltersetnosy: + doerwalter
messages: + msg89073
2009-06-05 20:17:41pitroucreate