classification
Title: Bug in file.read(), can access unknown data.
Type: behavior Stage:
Components: Documentation Versions: Python 2.7
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: docs@python Nosy List: Alexander.Steppke, docs@python, r.david.murray, terry.reedy, vstinner
Priority: normal Keywords:

Created on 2011-10-13 18:40 by Alexander.Steppke, last changed 2011-10-15 00:43 by terry.reedy.

Messages (8)
msg145477 - (view) Author: Alexander Steppke (Alexander.Steppke) Date: 2011-10-13 18:40
The tempfile module shows strange behavior under certain conditions. This might lead to data leaking or other problems. 

The test session looks as follows:

Python 2.7.2 (default, Jun 12 2011, 15:08:59) [MSC v.1500 32 bit (Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import tempfile
>>> tmp = tempfile.TemporaryFile()
>>> tmp.read()
''
>>> tmp.write('test')
>>> tmp.read()
'P\xf6D\x02\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\ [ommitted]'

or similar behavior in text mode: 

Python 2.7.2 (default, Jun 12 2011, 15:08:59) [MSC v.1500 32 bit (Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import tempfile
>>> tmp = tempfile.TemporaryFile('w+t')
>>> tmp.read()
''
>>> tmp.write('test')
>>> tmp.read()
'\x00\xa5\x8b\x02int or long, hash(a) is used instead.\n    i\x10 [ommitted]'
>>> tmp.seek(0)
>>> tmp.readline()
'test\x00\xa5\x8b\x02int or long, hash(a) is used instead.\n'

This bug seems to be triggered by calling tmp.read() before tmp.seek(). I am running Python 2.7.2 on Windows 7 x64, other people have reproduced the problem on Windows XP but not under Linux or Cygwin (see also http://stackoverflow.com/questions/7757663/python-tempfile-broken-or-am-i-doing-it-wrong).

Thank you for looking into this.
Alexander
msg145480 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2011-10-13 19:11
I wonder if it is a bug in Windows?  Have you tried similar experiments with regular files?  tempfile is really just about *where* the files are located (and what happens when they are closed), not about their fundamental nature as OS file objects.  (I could be wrong about that on Windows of course, I'm more familiar with Linux.)
msg145501 - (view) Author: Alexander Steppke (Alexander.Steppke) Date: 2011-10-14 09:13
Hi David,

I followed your suggestion and tried to reproduce the problem without the tempfile module. It turns out that is indeed an underlying issue. I am not sure what the root cause is but now this is even a bigger problem: read() returns information from some file/memory that it was never intended to access. 

The session looks similar to the tempfile session:

>>> tmp = open('tmp', 'w+t')
>>> tmp.read()
''
>>> tmp.write('test')
>>> tmp.read()
'hp\'\x02\xe4\xb9>7\x80\x88\x81\x02\x01\x00\x00\x00\x00\x00\x00\x00\x12\x00\x00\
x00\xe86(\x02p\x11\x8d\x02\x01\x00\x00\x00@\xfd)\x02\xe7Y\x9aN\x01\x00\x00\x00\x
00\x00\x00\x00\x14\x00\x00\x00\x087(\x02\x00\x00\x00\x00\xe9Y\x0b\xa2\x00\x93+\x
02\x01\x00\x00\x00\x00\x00\x00\x00\x00\x9b,\x02\x02\x00\x00\x00\xe06(\x02\xc0W5\

At the moment the bug could only be reproduced using CPython 2.7.1 on Windows XP and Windows 7. 

Alexander
msg145502 - (view) Author: Alexander Steppke (Alexander.Steppke) Date: 2011-10-14 09:20
Additionally after calling tmp.close() the file 'tmp' contains the string 'test', which is followed by about 4kB of binary data similar to the previous output of tmp.read().
msg145508 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2011-10-14 11:37
This issue is a duplicate of the issue #1394612 which has been closed as invalid. Read the following message:

http://bugs.python.org/issue1394612#msg27200

I suppose that Python 3 is not affected by this issue because it doesn't use fread/fwrite anymore, but directly read/write (the low level, unbuffered, API).

It looks like Python cannot do anything for this issue, except documenting this surprising behaviour. Would you like to write a patch for the documentation?
msg145513 - (view) Author: Alexander Steppke (Alexander.Steppke) Date: 2011-10-14 12:37
Thank you for the update Victor. It seems to me that this is exactly the same issue.

At the moment the current documentation says (http://docs.python.org/library/stdtypes.html#bltin-file-objects):

"Note: This function is simply a wrapper for the underlying fread() C function, and will behave the same in corner cases, such as whether the EOF value is cached."

This is a hint to the current behavior but I would not expect from this that file.read() can return any kind of data, if used directly after file.write(). Maybe one could include a link or a snippet of the C standard which states that one shall not do this:

"When a file is opened with update mode ('+' as the second or third character in the above list of mode argument values), both input and output may be performed on the associated stream. However, output shall not be directly followed by input without an intervening call to the fflush function or to a file positioning function (fseek, fsetpos, or rewind), and input shall not be directly followed by output without an
intervening call to a file positioning function, unless the input operation encounters end-of-file." 
 
(from http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1124.pdf, page 272)
msg145541 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2011-10-14 15:55
Le 14/10/2011 14:37, Alexander Steppke a écrit :
> "When a file is opened with update mode ('+' as the second or third character in the above list of mode argument values),

You can just say " '+' in the file mode ".

> the fflush function or to a file positioning function (fseek, fsetpos, or rewind),

You should translate these names into Python method names:
  fflush -> file.flush()
  fseek/fsetpos -> file.seek()
  rewind -> (not exposed in Python)
msg145577 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2011-10-15 00:43
This issue has come up enough (tracker and python-list) that I think adding a mild adaptation of the C standard paragraph might be a good idea. Changing to a doc issue.
History
Date User Action Args
2011-10-15 00:43:19terry.reedysetnosy: + terry.reedy, docs@python
messages: + msg145577

assignee: docs@python
components: + Documentation, - Library (Lib), Windows, IO
2011-10-14 15:55:20vstinnersetmessages: + msg145541
2011-10-14 12:37:29Alexander.Steppkesetmessages: + msg145513
2011-10-14 11:37:12vstinnersetmessages: + msg145508
2011-10-14 09:30:09vstinnersetnosy: + vstinner
2011-10-14 09:20:44Alexander.Steppkesetmessages: + msg145502
2011-10-14 09:15:11Alexander.Steppkesetcomponents: + IO
title: Bug in tempfile module -> Bug in file.read(), can access unknown data.
2011-10-14 09:13:47Alexander.Steppkesetmessages: + msg145501
2011-10-13 19:11:22r.david.murraysetnosy: + r.david.murray
messages: + msg145480
2011-10-13 18:40:10Alexander.Steppkecreate