This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: tarfile use wrong code when read from fileobj
Type: behavior Stage: test needed
Components: Library (Lib) Versions: Python 3.5
process
Status: closed Resolution: wont fix
Dependencies: Superseder:
Assigned To: Nosy List: lars.gustaebel, martin.panter, socketpair
Priority: normal Keywords:

Created on 2016-04-28 21:44 by socketpair, last changed 2022-04-11 14:58 by admin. This issue is now closed.

Messages (9)
msg264450 - (view) Author: Марк Коренберг (socketpair) * Date: 2016-04-28 21:44
tarfile.py: _FileInFile():

(near line 687)

b = self.fileobj.read(length)
if len(b) != length:
    raise ReadError("unexpected end of data")

every read() API does not guarantee that it will read `length` bytes. So, if fileobj reads less than requestedm that is not an error (!)

In my case it was a pipe...
msg264454 - (view) Author: Марк Коренберг (socketpair) * Date: 2016-04-28 22:59
The same in tarfile.copyfileobj()
msg264469 - (view) Author: Martin Panter (martin.panter) * (Python committer) Date: 2016-04-29 06:52
Can you give a demonstration script? I don’t see how this could be triggered. If you use a tarfile mode like "r|", it internally uses a _Stream object which has a loop to do exact reads: <https://hg.python.org/cpython/annotate/v3.5.1/Lib/tarfile.py#l567>.
msg264479 - (view) Author: Марк Коренберг (socketpair) * Date: 2016-04-29 09:15
well, I don't use "r|" (but will, thanks for suggestion)

In any case, assuming that read() returns exact length is wrong. There is .readexactly() (f.e. in asyncio I mean). Or, one should use simple loop to call .read() multiple times.

Reading from plain file does not guarantee, that read() syscall will return requested count of bytes even when these bytes are available.
msg264481 - (view) Author: Martin Panter (martin.panter) * (Python committer) Date: 2016-04-29 10:08
On the other hand, you cannot use a pipe with mode="r" because that mode does seeking; that is why I asked for more details on what you are doing:

$ cat | python3 -c 'import tarfile, sys; tarfile.open(fileobj=sys.stdin.buffer, mode="r")'
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/usr/lib/python3.5/tarfile.py", line 1580, in open
    return func(name, filemode, fileobj, **kwargs)
  File "/usr/lib/python3.5/tarfile.py", line 1610, in taropen
    return cls(name, mode, fileobj, **kwargs)
  File "/usr/lib/python3.5/tarfile.py", line 1467, in __init__
    self.offset = self.fileobj.tell()
OSError: [Errno 29] Illegal seek

Python 3 has the io.RawIOBase class which models the low level read() system call and does partial reads, and the io.BufferedIOBase class whose read() method guarantees an exact read. You can often wrap a raw object with BufferedReader to easily convert to the buffered kind.
msg264569 - (view) Author: Марк Коренберг (socketpair) * Date: 2016-04-30 17:57
Well, there are  more than one workarounds for that.

man read:
===============
If a read() is interrupted by a signal after it has successfully read some data, it shall return the number of bytes read.
=====================

So, this is a way how to make "exploit" for this bug.
msg264570 - (view) Author: Lars Gustäbel (lars.gustaebel) * (Python committer) Date: 2016-04-30 18:11
Please give us some example test code that shows us what goes wrong exactly.
msg265968 - (view) Author: Марк Коренберг (socketpair) * Date: 2016-05-20 21:58
Can not reproduce in Linux. Will reopen if reproduced on MacOS X or BSD.

#!/usr/bin/python3.5

import os
import signal


p = os.getpid()

if os.fork() == 0:
    while True:
        try:
            os.kill(p, signal.SIGCHLD)
        except (ProcessLookupError, KeyboardInterrupt):
            break
    os._exit(0)

qwe = open('qwe.dat', 'w+b')
qwe.seek(2*1024*1024*1024*1024)
qwe.write(b'0')
qwe.flush()
qwe.seek(0)
while True:
    d = qwe.read(65536 * 32)
    if len(d) != 65536 * 32:
        raise Exception('!')
msg270207 - (view) Author: Марк Коренберг (socketpair) * Date: 2016-07-11 20:27
http://stackoverflow.com/questions/1964806/short-read-from-filesystem-when-can-it-happen

Disk-based filesystems generally use uninterruptible reads, which means that the read operation generally cannot be interrupted by a signal. Network-based filesystems sometimes use interruptible reads, which can return partial data or no data. (In the case of NFS this is configurable using the intr mount option.) They sometimes also implement timeouts.

> can return partial data

Seems reading tar-file from NFS-filesystem may trigger that bug.
History
Date User Action Args
2022-04-11 14:58:30adminsetgithub: 71064
2016-07-11 20:27:54socketpairsetmessages: + msg270207
2016-05-20 21:58:30socketpairsetstatus: open -> closed
resolution: wont fix
messages: + msg265968
2016-04-30 18:11:53lars.gustaebelsetmessages: + msg264570
2016-04-30 17:57:02socketpairsetmessages: + msg264569
2016-04-29 10:08:05martin.pantersetmessages: + msg264481
2016-04-29 09:15:57socketpairsetmessages: + msg264479
2016-04-29 06:52:13martin.pantersetnosy: + martin.panter

messages: + msg264469
stage: test needed
2016-04-28 22:59:59socketpairsetmessages: + msg264454
2016-04-28 22:18:33serhiy.storchakasetnosy: + lars.gustaebel
2016-04-28 21:44:42socketpaircreate