classification
Title: Regression in file.writelines behavior
Type: behavior Stage: resolved
Components: IO Versions: Python 2.7
process
Status: closed Resolution: out of date
Dependencies: Superseder:
Assigned To: Nosy List: ZackerySpytz, pitrou, r.david.murray, serhiy.storchaka, snaury, socketpair, terry.reedy, xtreak
Priority: normal Keywords:

Created on 2016-05-29 17:41 by snaury, last changed 2020-07-06 07:43 by terry.reedy. This issue is now closed.

Messages (6)
msg266605 - (view) Author: Alexey Borzenkov (snaury) Date: 2016-05-29 17:41
There's a regression in file.writelines behavior for binary files when writing unicode strings, which seems to have first appeared in Python 2.7.7. The problem is that when writing unicode strings the internal representation (UCS2 or UCS4) is written instead of the actual text, which also directly contradicts documentation, which states "This is equivalent to calling write() for each string". However on Python 2.7.7+ they are no longer equivalent:

>>> open('testfile.bin', 'wb').writelines([u'Hello, world!'])
>>> open('testfile.bin', 'rb').read()
'H\x00e\x00l\x00l\x00o\x00,\x00 \x00w\x00o\x00r\x00l\x00d\x00!\x00'
>>> open('testfile.bin', 'wb').write(u'Hello, world!')
>>> open('testfile.bin', 'rb').read()
'Hello, world!'

This code worked correctly no Python 2.7.6.
msg266628 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2016-05-29 20:21
Any chance you could bisect to figure out which changeset caused the problem?  I'm surprised that something like this would happen, we aren't in general making changes at that level to python2 any more.
msg266630 - (view) Author: Alexey Borzenkov (snaury) Date: 2016-05-29 20:28
Didn't need to bisect, it's very easy to find the problematic commit, since writelines doesn't change that often:

https://hg.python.org/releases/2.7.11/rev/db842f730432

The old code was buggy in a sense that it always called PyObject_AsCharBuffer due to the way the condition is structured, but this bugginess was what allowed it to work correctly with unicode objects. After the commit unicode objects are treated like any other buffer, and that's why internal UCS2 or UCS4 representation gets written to the file.
msg266636 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2016-05-29 20:40
Thanks.
msg372535 - (view) Author: Zackery Spytz (ZackerySpytz) * (Python triager) Date: 2020-06-28 21:36
Python 2 is EOL, so I think this issue should be closed.
msg373073 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2020-07-06 07:43
Removing 'b' and 'u', writelines([s]) and write(s) both now read as s.
History
Date User Action Args
2020-07-06 07:43:51terry.reedysetstatus: open -> closed

nosy: + terry.reedy
messages: + msg373073

resolution: out of date
stage: resolved
2020-06-28 21:36:05ZackerySpytzsetnosy: + ZackerySpytz
messages: + msg372535
2018-09-23 15:16:52xtreaksetnosy: + xtreak
2016-05-29 21:02:52serhiy.storchakasetnosy: + serhiy.storchaka
2016-05-29 20:40:11r.david.murraysetnosy: + pitrou
messages: + msg266636
2016-05-29 20:28:46snaurysetmessages: + msg266630
2016-05-29 20:21:36r.david.murraysetnosy: + r.david.murray
messages: + msg266628
2016-05-29 18:33:57socketpairsetnosy: + socketpair
2016-05-29 17:41:19snaurycreate