Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

string.decode() fails on long strings #45862

Closed
eisele mannequin opened this issue Nov 29, 2007 · 16 comments
Closed

string.decode() fails on long strings #45862

eisele mannequin opened this issue Nov 29, 2007 · 16 comments
Assignees
Labels
topic-unicode type-bug An unexpected behavior, bug, or error

Comments

@eisele
Copy link
Mannequin

eisele mannequin commented Nov 29, 2007

BPO 1521
Nosy @doerwalter, @amauryfa
Files
  • getargs.patch
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/amauryfa'
    closed_at = <Date 2007-11-30.21:55:07.469>
    created_at = <Date 2007-11-29.15:33:06.429>
    labels = ['type-bug', 'expert-unicode']
    title = 'string.decode() fails on long strings'
    updated_at = <Date 2007-11-30.21:55:07.468>
    user = 'https://bugs.python.org/eisele'

    bugs.python.org fields:

    activity = <Date 2007-11-30.21:55:07.468>
    actor = 'amaury.forgeotdarc'
    assignee = 'amaury.forgeotdarc'
    closed = True
    closed_date = <Date 2007-11-30.21:55:07.469>
    closer = 'amaury.forgeotdarc'
    components = ['Unicode']
    creation = <Date 2007-11-29.15:33:06.429>
    creator = 'eisele'
    dependencies = []
    files = ['8832']
    hgrepos = []
    issue_num = 1521
    keywords = []
    message_count = 16.0
    messages = ['57932', '57934', '57935', '57936', '57938', '57962', '57969', '57970', '57972', '57973', '57993', '57994', '57995', '57996', '58008', '58015']
    nosy_count = 3.0
    nosy_names = ['doerwalter', 'amaury.forgeotdarc', 'eisele']
    pr_nums = []
    priority = 'normal'
    resolution = 'fixed'
    stage = None
    status = 'closed'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue1521'
    versions = ['Python 2.5']

    @eisele
    Copy link
    Mannequin Author

    eisele mannequin commented Nov 29, 2007

    s.decode("utf-8")

    sometimes silently truncates the result if s has more than 2E9 Bytes,
    sometimes raises a fairly incomprehensible exception:

    Traceback (most recent call last):
      File "<stdin>", line 2, in <module>
      File "/usr/lib64/python2.5/encodings/utf_8.py", line 16, in decode
        return codecs.utf_8_decode(input, errors, True)
    TypeError: utf_8_decode() argument 1 must be (unspecified), not str

    @eisele eisele mannequin added topic-unicode type-bug An unexpected behavior, bug, or error labels Nov 29, 2007
    @doerwalter
    Copy link
    Contributor

    Can you attach a (small) example that demonstrates the bug?

    @eisele
    Copy link
    Mannequin Author

    eisele mannequin commented Nov 29, 2007

    For instance:

    Python 2.5.1 (r251:54863, Aug 30 2007, 16:15:51) 
    [GCC 4.1.0 (SUSE Linux)] on linux2
    Type "help", "copyright", "credits" or "license" for more information.
    __[1] >>> s=" "*int(5E9)
    6.050000 sec
    __[1] >>> u=s.decode("utf-8")
    4.710000 sec
    __[1] >>> len(u) 
    705032704
    __[2] >>> len(s)
    5000000000
    __[3] >>> 

    I would have expected both lengths to be 5E9

    @eisele
    Copy link
    Mannequin Author

    eisele mannequin commented Nov 29, 2007

    An instance of the other problem:

    Python 2.5.1 (r251:54863, Aug 30 2007, 16:15:51) 
    [GCC 4.1.0 (SUSE Linux)] on linux2
    Type "help", "copyright", "credits" or "license" for more information.
    __[1] >>> s=" "*int(25E8)
    2.990000 sec
    __[1] >>> u=s.decode("utf-8")
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File
    "/home/cl-home/eisele/lns-root-07/lib/python2.5/encodings/utf_8.py",
    line 16, in decode
        return codecs.utf_8_decode(input, errors, True)
    TypeError: utf_8_decode() argument 1 must be (unspecified), not str
    __[1] >>>

    @amauryfa
    Copy link
    Member

    I don't have any 64bit machine to test with,
    but it seems to me that there is a problem in the function
    getargs.c::convertsimple(): the t# and w# formats use the buffer
    interface, but the code uses an int to store its length!

    Look for the variables declared as "int count;". I suggest to replace it
    with a Py_ssize_t in both places.

    Shouldn't the compiler emit some warning in this case?

    @amauryfa
    Copy link
    Member

    Here is a patch, with a unit test (I was surprised that test_bigmem.py
    already contained a test_decode function, which was left empty).

    But I still don't have access to any 64bit machine.
    Can someone try and see if the new tests in test_bigmem.py fail, and
    that the patch in getargs.c corrects the problem?

    @eisele
    Copy link
    Mannequin Author

    eisele mannequin commented Nov 30, 2007

    Thanks a lot for the patch, which indeed seems to solve the issue.
    Alas, the extended test code still does not catch the problem, at
    least in my installation. Someone with a better understanding of
    how these tests work and with access to a 64bit machine should
    still have a look.

    @amauryfa
    Copy link
    Member

    Alas, the extended test code still does not catch the problem
    Can you please try again by changing in the tests:
    minsize=_2G
    into
    minsize=_2G * 2 + 2
    The length has to be greater than 4G for an int to loose digits.

    @eisele
    Copy link
    Mannequin Author

    eisele mannequin commented Nov 30, 2007

    Tried
    @bigmemtest(minsize=_2G*2+2, memuse=3)
    but no change; the test is done only once with a small
    size (5147). Apparently something does not work as
    expected here. I'm trying this with 2.6 (Revision 59231).

    @amauryfa
    Copy link
    Member

    the test is done only once with a small size (5147)
    How do you run the test? Do you specify a maximum available size?
    If you run test_bigmem.py directly, try to run it with an additional
    argument like this:
    ./test_bigmem.py 7G
    If you run regrtest.py, you should add an option like "-M 7G".
    (assuming you have enough RAM...)

    @eisele
    Copy link
    Mannequin Author

    eisele mannequin commented Nov 30, 2007

    How do you run the test? Do you specify a maximum available size?
    I naively assumed that running "make test" from the toplevel would be
    clever about finding plausible parameters. However, it runs the bigmem
    tests in a minimalistic way, skipping essentially all interesting bits.

    Thanks for the hints on giving the maximal available size explicitly,
    which work in principle, but make testing rather slow. Also, if the
    encode/decode test are decorated with
    @bigmemtest(minsize=_2G*2+2, memuse=3)
    one needs to specify at least -M 15g, otherwise the tests are still
    skipped. No wonder that people do not normally run them...

    @amauryfa
    Copy link
    Member

    @bigmemtest(minsize=_2G*2+2, memuse=3)

    minsize=_2G + 2 should trigger your second problem (where the size wraps
    to a negative number). Then 7G is "enough" for the test to run.

    @eisele
    Copy link
    Mannequin Author

    eisele mannequin commented Nov 30, 2007

    Then 7G is "enough" for the test to run.

    yes, indeed, thanks for pointing this out.
    It runs and detects an ERROR, and after applying your patch it succeeds.

    What else needs to be done to make sure your patch finds it's way to the
    Python core?

    @amauryfa
    Copy link
    Member

    What else needs to be done to make sure your patch finds it's way
    to the Python core?

    Nothing I suppose. It appears like an inconsistency in the source code,
    and it happens to correct a real problem. I will commit it in a few hours.

    @amauryfa
    Copy link
    Member

    Committed revision 59241. Will backport after the buildbots run the test.

    @amauryfa amauryfa self-assigned this Nov 30, 2007
    @amauryfa
    Copy link
    Member

    Committed revision 59244 in release25-maint.

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    topic-unicode type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    No branches or pull requests

    2 participants