Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

utf-16 BOM is not skipped after seek(0) #49112

Closed
amauryfa opened this issue Jan 7, 2009 · 11 comments
Closed

utf-16 BOM is not skipped after seek(0) #49112

amauryfa opened this issue Jan 7, 2009 · 11 comments

Comments

@amauryfa
Copy link
Member

amauryfa commented Jan 7, 2009

BPO 4862
Nosy @malemburg, @amauryfa, @pitrou, @vstinner, @benjaminp
Files
  • io_utf16.patch
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2009-03-04.21:36:57.479>
    created_at = <Date 2009-01-07.00:21:16.662>
    labels = []
    title = 'utf-16 BOM is not skipped after seek(0)'
    updated_at = <Date 2009-03-05.01:00:49.629>
    user = 'https://github.com/amauryfa'

    bugs.python.org fields:

    activity = <Date 2009-03-05.01:00:49.629>
    actor = 'vstinner'
    assignee = 'none'
    closed = True
    closed_date = <Date 2009-03-04.21:36:57.479>
    closer = 'benjamin.peterson'
    components = []
    creation = <Date 2009-01-07.00:21:16.662>
    creator = 'amaury.forgeotdarc'
    dependencies = []
    files = ['12627']
    hgrepos = []
    issue_num = 4862
    keywords = ['patch']
    message_count = 11.0
    messages = ['79299', '79325', '79326', '79330', '79410', '80211', '80222', '83141', '83167', '83173', '83176']
    nosy_count = 5.0
    nosy_names = ['lemburg', 'amaury.forgeotdarc', 'pitrou', 'vstinner', 'benjamin.peterson']
    pr_nums = []
    priority = 'critical'
    resolution = 'fixed'
    stage = None
    status = 'closed'
    superseder = None
    type = None
    url = 'https://bugs.python.org/issue4862'
    versions = ['Python 3.0']

    @amauryfa
    Copy link
    Member Author

    amauryfa commented Jan 7, 2009

    First write a utf-16 file with its signature:

    >> f1 = open('utf16.txt', 'w', encoding='utf-16')
    >> f1.write('0123456789')
    >> f1.close()

    Then read it twice:

    >>> f2 = open('utf16.txt', 'r', encoding='utf-16')
    >>> print('read1', ascii(f2.read()))
    read1 '0123456789'
    >>> f2.seek(0)
    0
    >>> print('read2', ascii(f2.read()))
    read2 '\ufeff0123456789'

    The second read returns the BOM!
    This is because the zero in seek(0) is a "cookie" which contains both the position
    and the decoder state. Unfortunately, state=0 means 'endianness has been determined:
    native order'.

    maybe a suggestion: handle seek(0) as a special value which calls decoder.reset().
    The patch implement this idea.

    @vstinner
    Copy link
    Member

    vstinner commented Jan 7, 2009

    This is because the zero in seek(0) is a "cookie"
    which contains both the position and the decoder state.
    Unfortunately, state=0 means 'endianness has been determined:
    native order'.

    The problem is maybe that TextIOWrapper._pack_cookie() can create a
    cookie=0. Example to create a non-null value, replace:
    def _pack_cookie(self, position, ...):
    return (position | (dec_flags<<64) | ...
    def _unpack_cookie(self, bigint):
    rest, position = divmod(bigint, 1<<64)
    ...
    by
    def _pack_cookie(self, position, ...):
    return (1 | (position<<1) | (dec_flags<<65) | ...
    def _unpack_cookie(self, bigint):
    if not (bigint & 1):
    raise ValueError("invalid cookie")
    bigint >>= 1
    rest, position = divmod(bigint, 1<<64)
    ...

    Why the cookie is an integer and not an object with attributes?

    @amauryfa
    Copy link
    Member Author

    amauryfa commented Jan 7, 2009

    The problem is maybe that TextIOWrapper._pack_cookie()
    can create a cookie=0

    But only when position==0.
    And in this case, at the beginning of the stream, it makes sense to
    reset everything to its initial value: zero for the various counts, and
    call decoder.reset()

    @pitrou
    Copy link
    Member

    pitrou commented Jan 7, 2009

    Well, there are other problems with utf-16, e.g. when opening an
    existing file for appending, the BOM is written again:

    >>> f = open('utf16.txt', 'w', encoding='utf-16')
    >>> f.write('abc')
    3
    >>> f.close()
    >>> f = open('utf16.txt', 'a', encoding='utf-16')
    >>> f.write('def')
    3
    >>> f.close()
    >>> open('utf16.txt', 'r', encoding='utf-16').read()
    'abc\ufeffdef'

    Who said TextIOWrapper was sane? :-o

    @malemburg
    Copy link
    Member

    On 2009-01-07 01:21, Amaury Forgeot d'Arc wrote:
    > First write a utf-16 file with its signature:
    > 
    >>>> f1 = open('utf16.txt', 'w', encoding='utf-16')
    >>>> f1.write('0123456789')
    >>>> f1.close()
    > 
    > Then read it twice:
    > 
    >>>> f2 = open('utf16.txt', 'r', encoding='utf-16')
    >>>> print('read1', ascii(f2.read()))
    > read1 '0123456789'
    >>>> f2.seek(0)
    > 0
    >>>> print('read2', ascii(f2.read()))
    > read2 '\ufeff0123456789'
    > 
    > The second read returns the BOM!
    > This is because the zero in seek(0) is a "cookie" which contains both the position 
    > and the decoder state. Unfortunately, state=0 means 'endianness has been determined: 
    > native order'.
    > 
    > maybe a suggestion: handle seek(0) as a special value which calls decoder.reset().
    > The patch implement this idea.

    This is a problem with the utf_16.py codec, not the io layer.
    Opening a file in append mode is something that the io layer
    would have to handle, since the codec doesn't know anything about
    the underlying file mode.

    Using .reset() will not help. The code for the StreamReader
    and StreamWriter in utf_16.py will have to be modified to undo
    the adjustment of the .encode() and .decode() method after using
    .seek(0).

    Note that there's also the case .seek(1) - I guess this must
    be considered as resulting in undefined behavior.

    @pitrou
    Copy link
    Member

    pitrou commented Jan 19, 2009

    I support Amaury's suggestion (actually I implemented it in the io-c
    branch). Resetting the decoder when seeking to the beginning of the
    stream is a reasonable way to deal with those incremental decoders for
    which the start state is something else than (b"", 0).

    (and, you're right, opening in append mode is a different problem...)

    @vstinner
    Copy link
    Member

    I opened a different issue (bpo-5006) for the duplicate BOM in append
    mode.

    @benjaminp
    Copy link
    Contributor

    This has been fixed by the io-c branch merge.

    @vstinner
    Copy link
    Member

    vstinner commented Mar 5, 2009

    This has been fixed by the io-c branch merge.

    Can you at least include the patch to test_io.py from amaury's patch?

    And why not fixing the Python version of the io module (i'm not sure
    of the new name: _pyio?) since we have a working patch?

    @benjaminp
    Copy link
    Contributor

    Ah, I forgot this wasn't applied to the Python implementation. Fixed in
    r70184.

    @vstinner
    Copy link
    Member

    vstinner commented Mar 5, 2009

    @benjamin: ok, great.

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    None yet
    Projects
    None yet
    Development

    No branches or pull requests

    5 participants