Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Duplicate UTF-16 BOM if a file is open in append mode #49256

Closed
vstinner opened this issue Jan 19, 2009 · 19 comments
Closed

Duplicate UTF-16 BOM if a file is open in append mode #49256

vstinner opened this issue Jan 19, 2009 · 19 comments

Comments

@vstinner
Copy link
Member

BPO 5006
Nosy @malemburg, @amauryfa, @pitrou, @vstinner, @benjaminp, @ezio-melotti, @florentx
Files
  • append_bom-2.patch
  • append_bom-3.patch
  • append_bom-4.patch
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2009-05-14.18:56:23.019>
    created_at = <Date 2009-01-19.23:24:35.284>
    labels = []
    title = 'Duplicate UTF-16 BOM if a file is open in append mode'
    updated_at = <Date 2010-07-28.10:18:37.955>
    user = 'https://github.com/vstinner'

    bugs.python.org fields:

    activity = <Date 2010-07-28.10:18:37.955>
    actor = 'flox'
    assignee = 'none'
    closed = True
    closed_date = <Date 2009-05-14.18:56:23.019>
    closer = 'pitrou'
    components = []
    creation = <Date 2009-01-19.23:24:35.284>
    creator = 'vstinner'
    dependencies = []
    files = ['13421', '13445', '13953']
    hgrepos = []
    issue_num = 5006
    keywords = ['patch']
    message_count = 19.0
    messages = ['80221', '80223', '80303', '83554', '83834', '84185', '84188', '84193', '84194', '84195', '84319', '84321', '84373', '87554', '87748', '87753', '87754', '109014', '111763']
    nosy_count = 8.0
    nosy_names = ['lemburg', 'amaury.forgeotdarc', 'pitrou', 'vstinner', 'benjamin.peterson', 'ezio.melotti', 'Conrad.Irwin', 'flox']
    pr_nums = []
    priority = 'normal'
    resolution = 'fixed'
    stage = None
    status = 'closed'
    superseder = None
    type = None
    url = 'https://bugs.python.org/issue5006'
    versions = ['Python 2.6', 'Python 3.0', 'Python 3.1', 'Python 2.7']

    @vstinner
    Copy link
    Member Author

    Copy/paste of message79330 from the issue bpo-4862:
    --------------

    >>> f = open('utf16.txt', 'w', encoding='utf-16')
    >>> f.write('abc')
    3
    >>> f.close()
    
    >>> f = open('utf16.txt', 'a', encoding='utf-16')
    >>> f.write('def')
    3
    >>> f.close()
    >>> open('utf16.txt', 'r', encoding='utf-16').read()
    'abc\ufeffdef'

    @vstinner
    Copy link
    Member Author

    Bug is reproductible with:

    • Python 2.5 : charset utf-8-sig and utf-16 for codecs.open()
    • trunk : charset utf-8-sig, utf-16 and utf-32 for codecs.open()
    • py3k : charset utf-8-sig, utf-16 and utf-32 for open()

    With utf-7 or utf-8, no BOM is written.

    Note: with UTF-32, the BOM is 4 bytes long (0xff 0xfe 0x00 0x00 on
    little endian) but it's still the character (BOM) \ufeff (little
    endian).

    @vstinner
    Copy link
    Member Author

    See also issues bpo-5008 (f.tell()) and bpo-5016 (f.seekable()), not
    directly related to this issue.

    @pitrou
    Copy link
    Member

    pitrou commented Mar 14, 2009

    One possible solution would be to call tell() in the TextIOWrapper
    constructor, and then decoder.setstate((b"", 0)) if the current pos is >0.

    @vstinner
    Copy link
    Member Author

    @pitrou: You're right, but the state have to be changed for the
    encoder, not the decoder. I added the following code to TextIOWrapper
    constructor (for the C and the Python version of io library):

            if self._seekable and self.writable():
                position = self.buffer.tell()
                if position != 0:
                    self._encoder = self._get_encoder()
                    self._encoder.setstate(0)

    @pitrou
    Copy link
    Member

    pitrou commented Mar 26, 2009

    As for the C version of the patch:

    • I'm not sure why you make self->ok handling more complicated than it was
    • encodefunc should not be forced to NULL, otherwise it will yield a big
      decrease in write() performance

    @vstinner
    Copy link
    Member Author

    • I'm not sure why you make self->ok handling more complicated than it was

    tell() requires self->ok=1. I choosed to reset self->ok to zero on error, but
    it's maybe useless.

    • encodefunc should not be forced to NULL, otherwise it will yield a big
      decrease in write() performance

    self>encodefunc value have to be changed because <encoder>.setstate() changes
    the encoding function: UTF-16 => UTF-16-LE or UTF-16-BE.

    I don't know how to get the new value of self>encodefunc. I will try to write
    another patch.

    @pitrou
    Copy link
    Member

    pitrou commented Mar 26, 2009

    Le jeudi 26 mars 2009 à 17:26 +0000, STINNER Victor a écrit :

    STINNER Victor <victor.stinner@haypocalc.com> added the comment:

    > - I'm not sure why you make self->ok handling more complicated than it was

    tell() requires self->ok=1. I choosed to reset self->ok to zero on error, but
    it's maybe useless.

    You are calling tell() on the underlying binary buffer, not on the
    TextIOWrapper object. So self->ok should have no impact. Am I missing
    something?

    self>encodefunc value have to be changed because <encoder>.setstate() changes
    the encoding function: UTF-16 => UTF-16-LE or UTF-16-BE.

    I know, but then the logic should probably be changed and use an
    additional struct member to signal that the start of file has been
    skipped.

    @vstinner
    Copy link
    Member Author

    Faster patch!

    • add fast encoder for UTF-32, UTF-32-LE and UTF-32-BE (copy/paste of
      utf16 functions)
    • move utf-8 before utf-16-* because utf8 more popular than utf16 :-p
    • don't set self->encodefunc=NULL (loose all the encoder
      optimisation), but only fix self->encodefunc for two special cases:
      utf16=>utf16le or utf16be and utf32=>utf32le or utf32be
    • remove self->ok: it was may be usefull for an older version of my
      patch, but it's not more needed

    @vstinner
    Copy link
    Member Author

    Hum, it's a detail, but is it a good idea to keep the reference the int(0)? Or
    would it be better to release it at exit?

    @pitrou
    Copy link
    Member

    pitrou commented Mar 28, 2009

    It's better now, although I think it's not good to duplicate the
    encoding switch logic. It would be better to have a separate flag
    indicate whether it's the start of stream or not. I'm gonna produce a
    new patch, unless you beat me to it.

    Also, I'm adding Amaury to the nosy list so that he tells us whether he
    thinks the approach is sound.

    @pitrou
    Copy link
    Member

    pitrou commented Mar 28, 2009

    Here is a new patch catering to more cases (seek()) in addition to just
    opening in append mode.

    @vstinner
    Copy link
    Member Author

    seek() has also the problem? It's really hard to encode UTF-16/32
    correctly...

    @pitrou
    Copy link
    Member

    pitrou commented May 10, 2009

    Updated patch against py3k.

    @pitrou
    Copy link
    Member

    pitrou commented May 14, 2009

    If no-one objects, I'm going to commit this in a couple of days.

    @benjaminp
    Copy link
    Contributor

    As I said on IRC a few days ago, I think the patch is ready to go.

    @pitrou
    Copy link
    Member

    pitrou commented May 14, 2009

    Ok, committed in r72635.

    @pitrou pitrou closed this as completed May 14, 2009
    @ConradIrwin
    Copy link
    Mannequin

    ConradIrwin mannequin commented Jun 30, 2010

    Shouldn't this fix be back-ported to the 2.6 branch too?

    @vstinner
    Copy link
    Member Author

    I fixed bpo-6213 in 2.6 and 2.7, and so it's now possible to backport this fix to 2.6 => r83200

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    None yet
    Projects
    None yet
    Development

    No branches or pull requests

    3 participants