msg80221 - (view) |
Author: STINNER Victor (vstinner) * |
Date: 2009-01-19 23:24 |
Copy/paste of message79330 from the issue #4862:
--------------
>>> f = open('utf16.txt', 'w', encoding='utf-16')
>>> f.write('abc')
3
>>> f.close()
>>> f = open('utf16.txt', 'a', encoding='utf-16')
>>> f.write('def')
3
>>> f.close()
>>> open('utf16.txt', 'r', encoding='utf-16').read()
'abc\ufeffdef'
--------------
|
msg80223 - (view) |
Author: STINNER Victor (vstinner) * |
Date: 2009-01-19 23:37 |
Bug is reproductible with:
* Python 2.5 : charset utf-8-sig and utf-16 for codecs.open()
* trunk : charset utf-8-sig, utf-16 and utf-32 for codecs.open()
* py3k : charset utf-8-sig, utf-16 and utf-32 for open()
With utf-7 or utf-8, no BOM is written.
Note: with UTF-32, the BOM is 4 bytes long (0xff 0xfe 0x00 0x00 on
little endian) but it's still the character (BOM) \ufeff (little
endian).
|
msg80303 - (view) |
Author: STINNER Victor (vstinner) * |
Date: 2009-01-21 00:41 |
See also issues #5008 (f.tell()) and #5016 (f.seekable()), not
directly related to this issue.
|
msg83554 - (view) |
Author: Antoine Pitrou (pitrou) * |
Date: 2009-03-14 00:49 |
One possible solution would be to call tell() in the TextIOWrapper
constructor, and then decoder.setstate((b"", 0)) if the current pos is >0.
|
msg83834 - (view) |
Author: STINNER Victor (vstinner) * |
Date: 2009-03-20 00:37 |
@pitrou: You're right, but the state have to be changed for the
encoder, not the decoder. I added the following code to TextIOWrapper
constructor (for the C and the Python version of io library):
if self._seekable and self.writable():
position = self.buffer.tell()
if position != 0:
self._encoder = self._get_encoder()
self._encoder.setstate(0)
|
msg84185 - (view) |
Author: Antoine Pitrou (pitrou) * |
Date: 2009-03-26 16:55 |
As for the C version of the patch:
- I'm not sure why you make self->ok handling more complicated than it was
- encodefunc should not be forced to NULL, otherwise it will yield a big
decrease in write() performance
|
msg84188 - (view) |
Author: STINNER Victor (vstinner) * |
Date: 2009-03-26 17:26 |
> - I'm not sure why you make self->ok handling more complicated than it was
tell() requires self->ok=1. I choosed to reset self->ok to zero on error, but
it's maybe useless.
> - encodefunc should not be forced to NULL, otherwise it will yield a big
> decrease in write() performance
self>encodefunc value have to be changed because <encoder>.setstate() changes
the encoding function: UTF-16 => UTF-16-LE or UTF-16-BE.
I don't know how to get the new value of self>encodefunc. I will try to write
another patch.
|
msg84193 - (view) |
Author: Antoine Pitrou (pitrou) * |
Date: 2009-03-26 17:41 |
Le jeudi 26 mars 2009 à 17:26 +0000, STINNER Victor a écrit :
> STINNER Victor <victor.stinner@haypocalc.com> added the comment:
>
> > - I'm not sure why you make self->ok handling more complicated than it was
>
> tell() requires self->ok=1. I choosed to reset self->ok to zero on error, but
> it's maybe useless.
You are calling tell() on the underlying binary buffer, not on the
TextIOWrapper object. So self->ok should have no impact. Am I missing
something?
> self>encodefunc value have to be changed because <encoder>.setstate() changes
> the encoding function: UTF-16 => UTF-16-LE or UTF-16-BE.
I know, but then the logic should probably be changed and use an
additional struct member to signal that the start of file has been
skipped.
|
msg84194 - (view) |
Author: STINNER Victor (vstinner) * |
Date: 2009-03-26 17:52 |
Faster patch!
- add fast encoder for UTF-32, UTF-32-LE and UTF-32-BE (copy/paste of
utf16 functions)
- move utf-8 before utf-16-* because utf8 more popular than utf16 :-p
- don't set self->encodefunc=NULL (loose all the encoder
optimisation), but only fix self->encodefunc for two special cases:
utf16=>utf16le or utf16be and utf32=>utf32le or utf32be
- remove self->ok: it was may be usefull for an older version of my
patch, but it's not more needed
|
msg84195 - (view) |
Author: STINNER Victor (vstinner) * |
Date: 2009-03-26 17:57 |
Hum, it's a detail, but is it a good idea to keep the reference the int(0)? Or
would it be better to release it at exit?
|
msg84319 - (view) |
Author: Antoine Pitrou (pitrou) * |
Date: 2009-03-28 18:07 |
It's better now, although I think it's not good to duplicate the
encoding switch logic. It would be better to have a separate flag
indicate whether it's the start of stream or not. I'm gonna produce a
new patch, unless you beat me to it.
Also, I'm adding Amaury to the nosy list so that he tells us whether he
thinks the approach is sound.
|
msg84321 - (view) |
Author: Antoine Pitrou (pitrou) * |
Date: 2009-03-28 18:53 |
Here is a new patch catering to more cases (seek()) in addition to just
opening in append mode.
|
msg84373 - (view) |
Author: STINNER Victor (vstinner) * |
Date: 2009-03-29 10:58 |
seek() has also the problem? It's really hard to encode UTF-16/32
correctly...
|
msg87554 - (view) |
Author: Antoine Pitrou (pitrou) * |
Date: 2009-05-10 22:52 |
Updated patch against py3k.
|
msg87748 - (view) |
Author: Antoine Pitrou (pitrou) * |
Date: 2009-05-14 17:50 |
If no-one objects, I'm going to commit this in a couple of days.
|
msg87753 - (view) |
Author: Benjamin Peterson (benjamin.peterson) * |
Date: 2009-05-14 18:43 |
As I said on IRC a few days ago, I think the patch is ready to go.
|
msg87754 - (view) |
Author: Antoine Pitrou (pitrou) * |
Date: 2009-05-14 18:56 |
Ok, committed in r72635.
|
msg109014 - (view) |
Author: Conrad.Irwin (Conrad.Irwin) |
Date: 2010-06-30 20:20 |
Shouldn't this fix be back-ported to the 2.6 branch too?
|
msg111763 - (view) |
Author: STINNER Victor (vstinner) * |
Date: 2010-07-28 02:00 |
I fixed #6213 in 2.6 and 2.7, and so it's now possible to backport this fix to 2.6 => r83200
|
|
Date |
User |
Action |
Args |
2022-04-11 14:56:44 | admin | set | github: 49256 |
2010-07-28 10:18:37 | flox | set | nosy:
+ flox
|
2010-07-28 02:00:02 | vstinner | set | messages:
+ msg111763 |
2010-06-30 20:20:25 | Conrad.Irwin | set | nosy:
+ Conrad.Irwin messages:
+ msg109014
|
2009-05-14 18:56:23 | pitrou | set | status: open -> closed resolution: fixed messages:
+ msg87754
|
2009-05-14 18:43:09 | benjamin.peterson | set | nosy:
+ benjamin.peterson messages:
+ msg87753
|
2009-05-14 17:52:45 | ezio.melotti | set | nosy:
+ ezio.melotti
|
2009-05-14 17:50:50 | pitrou | set | messages:
+ msg87748 |
2009-05-10 22:52:27 | pitrou | set | files:
+ append_bom-4.patch
messages:
+ msg87554 |
2009-05-01 21:50:41 | pitrou | set | nosy:
+ lemburg
|
2009-03-29 10:58:08 | vstinner | set | messages:
+ msg84373 |
2009-03-28 18:53:10 | pitrou | set | files:
+ append_bom-3.patch
messages:
+ msg84321 |
2009-03-28 18:07:14 | pitrou | set | nosy:
+ amaury.forgeotdarc messages:
+ msg84319
|
2009-03-26 17:57:49 | vstinner | set | messages:
+ msg84195 |
2009-03-26 17:52:52 | vstinner | set | files:
- append_bom.patch |
2009-03-26 17:52:40 | vstinner | set | files:
+ append_bom-2.patch
messages:
+ msg84194 |
2009-03-26 17:41:34 | pitrou | set | messages:
+ msg84193 |
2009-03-26 17:26:28 | vstinner | set | messages:
+ msg84188 |
2009-03-26 16:55:37 | pitrou | set | messages:
+ msg84185 |
2009-03-20 00:37:44 | vstinner | set | files:
+ append_bom.patch keywords:
+ patch messages:
+ msg83834
|
2009-03-14 00:49:57 | pitrou | set | messages:
+ msg83554 |
2009-02-28 17:25:56 | benjamin.peterson | link | issue4565 dependencies |
2009-01-21 00:41:46 | vstinner | set | messages:
+ msg80303 |
2009-01-20 00:12:20 | pitrou | set | nosy:
+ pitrou |
2009-01-19 23:37:22 | vstinner | set | messages:
+ msg80223 versions:
+ Python 2.6, Python 2.7 |
2009-01-19 23:24:35 | vstinner | create | |