This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: Large tarfiles cause overflow
Type: Stage:
Components: Library (Lib) Versions: Python 2.4
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: georg.brandl Nosy List: complex, georg.brandl, georg.brandl, lars.gustaebel, loewis, rhettinger, tree
Priority: normal Keywords:

Created on 2005-06-06 19:19 by tree, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
bz2module-lfs-seek.diff georg.brandl, 2005-06-10 11:45
Messages (10)
msg25497 - (view) Author: Tom Emerson (tree) Date: 2005-06-06 19:19
I have a 4 gigabyte bz2 compressed tarfile containing some 3.3 
million documents. I have a script which opens this file with "r:bz2" 
and is simply iterating over the contents using next(). With 2.4.1 I 
still get an Overflow error (originally tried with 2.3.5 as packaged in 
Mac OS 10.4.1):

Traceback (most recent call last):
  File "extract_part.py", line 47, in ?
    main(sys.argv)
  File "extract_part.py", line 39, in main
    pathnames = find_valid_paths(argv[1], 1024, count)
  File "extract_part.py", line 13, in find_valid_paths
    f = tf.next()
  File "/usr/local/lib/python2.4/tarfile.py", line 1584, in next
    self.fileobj.seek(self.offset)
OverflowError: long int too large to convert to int
msg25498 - (view) Author: Lars Gustäbel (lars.gustaebel) * (Python committer) Date: 2005-06-07 13:23
Logged In: YES 
user_id=642936

A quick look at the problem reveals that this is a bug in
bz2.BZ2File. The seek() method does not allow position
values >= 2GiB.
msg25499 - (view) Author: Georg Brandl (georg.brandl) * (Python committer) Date: 2005-06-09 20:31
Logged In: YES 
user_id=1188172

Attaching a patch which mimics the behaviour of normal file
objects. This should resolve the issue on platforms with
large file support.
msg25500 - (view) Author: Georg Brandl (georg.brandl) * (Python committer) Date: 2005-06-10 11:45
Logged In: YES 
user_id=1188172

Attaching corrected patch.
msg25501 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2005-06-13 01:32
Logged In: YES 
user_id=80475

Is there a way to write a test for this?
Can it be done without a conditional compile?
Is the problem one that occurs in other code outside of bz?
msg25502 - (view) Author: Georg Brandl (georg.brandl) * (Python committer) Date: 2005-06-18 21:26
Logged In: YES 
user_id=1188172

I looked into this a bit further, and noticed the following:

The modules bz2, cStringIO and mmap all use plain integers
to represent file offsets given to or returned by seek(),
tell() and truncate().

They should be corrected to use a 64-bit type when having
large file support. fileobject.c defines an own type for
that, Py_off_t, which should be shared among the other modules.

Conditional compile is needed since different
macros/functions must be used.
msg25503 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2005-06-18 22:05
Logged In: YES 
user_id=80475

Martin, please look at this when you get a chance.
msg25504 - (view) Author: Viktor Ferenczi (complex) Date: 2005-06-20 23:44
Logged In: YES 
user_id=142612

The bug has been reproduced with a 90Mbytes bz2 file containing more than 4Gbytes of fairly similar documents. I've diagnosed the same problem with large offsets. Thanks for the patch.

Platform: WinXP Intel P4, Python 2.4.1
msg25505 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2005-08-25 11:24
Logged In: YES 
user_id=21627

The patch is fine, please apply.

As for generalising Py_off_t: there are some issues which I
keep forgetting. fpos_t is not guaranteed to be an integral
type, and indeed, on Linux, it is not. I'm not quite
completely sure why this patch works; I think that on all
platforms where fpos_t is not integral, off_t happens to be
large enough. The only case where off_t is not large enough
is (IIRC) Windows, where fpos_t can be used.

So this is all somewhat muddy, and if this gets generalized,
a more elaborate comment seems to be in order.
msg25506 - (view) Author: Georg Brandl (georg.brandl) * (Python committer) Date: 2005-08-25 13:11
Logged In: YES 
user_id=1188172

I just realized that I accidentally committed the patch
together with the fix for #1191043.

Modules/bz2module r1.25, r1.23.2.2.
History
Date User Action Args
2022-04-11 14:56:11adminsetgithub: 42059
2005-06-06 19:19:18treecreate