Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SyntaxError: encoding problem: iso-8859-1 on Windows #65043

Closed
miwa mannequin opened this issue Mar 3, 2014 · 15 comments
Closed

SyntaxError: encoding problem: iso-8859-1 on Windows #65043

miwa mannequin opened this issue Mar 3, 2014 · 15 comments
Labels
3.7 (EOL) end of life 3.8 only security fixes interpreter-core (Objects, Python, Grammar, and Parser dirs) OS-windows type-bug An unexpected behavior, bug, or error

Comments

@miwa
Copy link
Mannequin

miwa mannequin commented Mar 3, 2014

BPO 20844
Nosy @vstinner, @nedbat, @tjguk, @benjaminp, @methane, @schlamar, @zware, @stevenwinfield, @eryksun
PRs
  • bpo-20844: open script file with "rb" mode #12616
  • [3.7] bpo-20844: open script file with "rb" mode (GH-12616) #12647
  • Files
  • test2.py
  • issue20844.py: Script used when reproducing the bug in slightly different ways
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2019-04-01.12:03:46.170>
    created_at = <Date 2014-03-03.13:54:42.008>
    labels = ['interpreter-core', '3.8', 'type-bug', '3.7', 'OS-windows']
    title = 'SyntaxError: encoding problem: iso-8859-1 on Windows'
    updated_at = <Date 2019-04-01.12:03:46.165>
    user = 'https://bugs.python.org/miwa'

    bugs.python.org fields:

    activity = <Date 2019-04-01.12:03:46.165>
    actor = 'methane'
    assignee = 'none'
    closed = True
    closed_date = <Date 2019-04-01.12:03:46.170>
    closer = 'methane'
    components = ['Interpreter Core', 'Windows']
    creation = <Date 2014-03-03.13:54:42.008>
    creator = 'miwa'
    dependencies = []
    files = ['34276', '47065']
    hgrepos = []
    issue_num = 20844
    keywords = ['patch']
    message_count = 15.0
    messages = ['212637', '212638', '213012', '213014', '213189', '213196', '214330', '221089', '221134', '224351', '233129', '233130', '299935', '339288', '339290']
    nosy_count = 10.0
    nosy_names = ['vstinner', 'nedbat', 'tim.golden', 'benjamin.peterson', 'miwa', 'methane', 'schlamar', 'zach.ware', 'steven.winfield', 'eryksun']
    pr_nums = ['12616', '12647']
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue20844'
    versions = ['Python 3.7', 'Python 3.8']

    @miwa
    Copy link
    Mannequin Author

    miwa mannequin commented Mar 3, 2014

    Microsoft Windows [Version 6.1.7601]
    Copyright (c) 2009 Microsoft Corporation. All rights reserved.

    C:\bug>python
    Python 3.3.5rc2 (v3.3.5rc2:ca5635efe090, Mar  2 2014, 18:18:29) [MSC v.1600 64 bit (AMD64)] on win32
    Type "help", "copyright", "credits" or "license" for more information.
    >>> exit()

    C:\bug>python test2.py
    File "test2.py", line 1
    SyntaxError: encoding problem: iso-8859-1

    @miwa miwa mannequin added the OS-windows label Mar 3, 2014
    @vstinner
    Copy link
    Member

    vstinner commented Mar 3, 2014

    It's a duplicate of the issue bpo-20731.

    @miwa
    Copy link
    Mannequin Author

    miwa mannequin commented Mar 10, 2014

    It seems that this is not fixed in 3.3.5. Someone please reproduce it.

    @BreamoreBoy
    Copy link
    Mannequin

    BreamoreBoy mannequin commented Mar 10, 2014

    Works fine for me

    @miwa
    Copy link
    Mannequin Author

    miwa mannequin commented Mar 12, 2014

    Thanks Mark.

    Perhaps, the problem is text-mode handling. When using Windows's text-mode stream, ftell() may return -1 even if no error occured.

    @miwa
    Copy link
    Mannequin Author

    miwa mannequin commented Mar 12, 2014

    When opening LF-newline file, ftell() may return zero when the position is not at the beginning of the file.

    Maybe LF-newline file should open in binary-mode.
    http://support.microsoft.com/kb/68337

    @schlamar
    Copy link
    Mannequin

    schlamar mannequin commented Mar 21, 2014

    I can reproduce this one. There are a few conditions which needs to be met:

    • Linux line endings
    • File needs to have at least x lines (empty lines are fine). I guess this is the point why no one could reproduce it. The attached file has 19 lines but probably no one copy/pasted the empty lines. Downloading the file reproduces this in my case. The length of the encoding declaration is relevant to the number of required newlines. #coding:latin-1 fails at a file with 19 lines, #coding: latin-1 (whitespace added) requires 20 lines.

    More observations:

    • Also reproducible if utf8 is used as alias for utf-8 (#coding: utf8 + 17 lines), but not reproducible with utf-8
    • Python 3.4 is affected, too
    • No issues on Python 3.3.2

    @BreamoreBoy
    Copy link
    Mannequin

    BreamoreBoy mannequin commented Jun 20, 2014

    I can reproduce this with 3.4.1 and 3.5.0.

    @eryksun
    Copy link
    Contributor

    eryksun commented Jun 20, 2014

    This fix for bpo-20731 doesn't address this bug completely because it's possible for ftell to return -1 without an actual error, as test2.py demonstrates.

    In text mode, CRLF is translated to LF by the CRT's _read function (Win32 ReadFile). So the buffer that's used by FILE streams is already translated. To get the stream position, ftell first calls _lseek (Win32 SetFilePointer) to get the file pointer. Then it adjusts the file pointer for the unwritten/unread bytes in the buffer. The problem for reading is how to tell whether or not LF in the buffer was translated from CRLF? The chosen 'solution' is to just assume CRLF.

    The example file test2.py is 33 bytes. At the time fp_setreadl calls ftell(tok->fp), the file pointer is 33, and Py_UniversalNewlineFgets has read the stream up to '#coding:latin-1\n'. That leaves 17 newline characters buffered. As stated above, ftell assumes CRLF, so it calculates the stream position as 33 - (17 * 2) == -1. That happens to be the value returned for an error, but who's checking? In this case, errno is 0 instead of the documented errno constants EBADF or EINVAL.

    Here's an example in 2.7.7, since it uses FILE streams:

        >>> f = open('test2.py')
        >>> f.read(16)
        '#coding:latin-1\n'
        >>> f.tell()
        Traceback (most recent call last):
          File "<stdin>", line 1, in <module>
        IOError: [Errno 0] Error

    Can the file be opened in binary mode in Modules/main.c? Currently it's using _Py_wfopen(filename, L"r"). But decoding_fgets calls Py_UniversalNewlineFgets, which expects binary mode anyway.

    @BreamoreBoy
    Copy link
    Mannequin

    BreamoreBoy mannequin commented Jul 30, 2014

    I've tried to make the title more meaningful, feel free to change it if you can think of something better.

    @BreamoreBoy BreamoreBoy mannequin added the interpreter-core (Objects, Python, Grammar, and Parser dirs) label Jul 30, 2014
    @BreamoreBoy BreamoreBoy mannequin changed the title coding bug remains in 3.3.5rc2 SyntaxError: encoding problem: iso-8859-1 on Windows Jul 30, 2014
    @BreamoreBoy BreamoreBoy mannequin added the type-bug An unexpected behavior, bug, or error label Jul 30, 2014
    @nedbat
    Copy link
    Member

    nedbat commented Dec 27, 2014

    This bug just bit me. Changing "# coding: utf8" to "# coding: utf-8" works around it.

    @nedbat
    Copy link
    Member

    nedbat commented Dec 27, 2014

    (oops: with Python 3.4.1 on Windows)

    @stevenwinfield
    Copy link
    Mannequin

    stevenwinfield mannequin commented Aug 8, 2017

    I've just been bitten by this on 3.6.2, Windows Server 2008 R2, when running the setup.py script for QuantLib-SWIG:
    https://github.com/lballabio/QuantLib-SWIG/blob/v1.10.x/Python/setup.py

    It seems there is different behaviour depending on whether:

    • Unix (LF) or Windows (CRLF) line endings are used
    • The file is >4096 bytes or <=4096 bytes
    • The module docstring has an initial space

    Some of that has been mentioned previously, but I think the 4096-byte limit might be new, which is why I'm posting.

    I've attached a script I used to come up with the results below. It contains:

    • a -*- coding line (for iso-8859-1 in this case)
    • a docstring consisting entirely of lines of x's, of length 78
    • Unix line endings

    The file's length is exactly 4096 bytes.

    Running this, or slightly modified versions of this, with a 3.6.2 interpreter gave the following results:

    • In all cases, when Windows line endings were used there was no issue - running the script produced no errors or output.

    • With Unix line endings:

      • File length <= 4096, with no leading spaces in the docstring:
        File "bpo-20844.py", line 1
        SyntaxError: encoding problem: iso-8859-1

      • File length > 4096, with no leading spaces in the docstring:
        File "bpo-20844.py", line 56
        xxxxx"""
        ^
        SyntaxError: EOF while scanning triple-quoted string literal

      • Any file length, with the first 'x' on line 3 replaced with a space (line 2 if the coding line is ignored):
        File "bpo-20844.py", line 2
        xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
        ^
        IndentationError: unexpected indent

    I had no issues with python 2.7.13.

    @methane
    Copy link
    Member

    methane commented Apr 1, 2019

    New changeset 10654c1 by Inada Naoki in branch 'master':
    bpo-20844: open script file with "rb" mode (GH-12616)
    10654c1

    @methane
    Copy link
    Member

    methane commented Apr 1, 2019

    New changeset 8384670 by Inada Naoki in branch '3.7':
    bpo-20844: open script file with "rb" mode (GH-12616)
    8384670

    @methane methane added 3.7 (EOL) end of life 3.8 only security fixes labels Apr 1, 2019
    @methane methane closed this as completed Apr 1, 2019
    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.7 (EOL) end of life 3.8 only security fixes interpreter-core (Objects, Python, Grammar, and Parser dirs) OS-windows type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    No branches or pull requests

    4 participants