classification
Title: "codecs" module on Windows uses incorrect end-of-line, wiriting broken Unicode (UTF-8) files
Type: Stage:
Components: Unicode Versions: Python 2.5
process
Status: closed Resolution: not a bug
Dependencies: Superseder:
Assigned To: Nosy List: Technologov, georg.brandl, lemburg, pitrou
Priority: normal Keywords:

Created on 2008-02-16 21:30 by Technologov, last changed 2008-02-17 11:51 by lemburg. This issue is now closed.

Messages (7)
msg62470 - (view) Author: Technologov (Technologov) Date: 2008-02-16 21:30
"codecs" module on Windows writes incorrect end-of-line, making it
impossible to write Unicode files.

See below, how-to reproduce bug (Python 2.5.1 on Windows XP)
===================================================================

#buggy unicode support module:
import codecs
filewr=codecs.open('myfile.txt','w','utf-8')
filewr.write("abc"+"\n")
===================================================================
Now, try to open this 'myfile.txt' using Windows Notepad.
The bug is perfectly visible.

The code below, will give correct results however:
===================================================================

filewr=open('myfile.txt','w')

filewr.write("abc"+"\n")
===================================================================
Basically this bugs _prevents_ me from writing Unicode text files.

NOTE: I'm not sure, if this bug should relate to "Windows" or "Unicode"
component.
NOTE: This bug is reproducible, even without writing Unicode characters.

-Technologov, 16.02.2008.
msg62471 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2008-02-16 22:13
Could explain what exactly is wrong with the end-of-line on Windows ?

Note that "Unicode text files" on Windows are generally interpreted as
UTF-16 encoded files. Perhaps that's what makes you think there's a bug.
msg62473 - (view) Author: Technologov (Technologov) Date: 2008-02-16 22:27
OK: try
filewr.write("abc"+"\n"+"abc")

The file will be generated with 7 bytes in it (must be 8, because
Windows has two-byte line-end).

Without using "codecs" modules, everything works fine, and the file will
have 8-bytes in it. (see 2nd example)

Plus, the text will be corrupted when opened with Windows Notepad.

-Technologov
msg62474 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2008-02-16 23:15
As stated in the codecs.open() docstring: """Files are always opened in
binary mode, even if no binary mode was specified. This is done to avoid
data loss due to encodings using 8-bit values""". This certainly means
you have to insert "\r\n" yourself (instead of just "\n") if you want
the file contents to respect the end-of-line convention under Windows...
msg62485 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2008-02-17 10:44
As Antoine already pointed out: the codecs.open() function does not
support the C lib's text mode. As a result, no magical conversion of a
single newline to a CRLF takes place.

Closing as invalid.
msg62488 - (view) Author: Georg Brandl (georg.brandl) * (Python committer) Date: 2008-02-17 11:34
The note in the docstring wasn't in the documentation. Fixed this in r60873.
msg62490 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2008-02-17 11:51
Thanks, Georg.
History
Date User Action Args
2008-02-17 11:51:03lemburgsetmessages: + msg62490
2008-02-17 11:34:04georg.brandlsetnosy: + georg.brandl
messages: + msg62488
2008-02-17 10:44:39lemburgsetstatus: open -> closed
resolution: not a bug
messages: + msg62485
2008-02-16 23:15:12pitrousetnosy: + pitrou
messages: + msg62474
2008-02-16 22:27:35Technologovsetmessages: + msg62473
2008-02-16 22:13:13lemburgsetnosy: + lemburg
messages: + msg62471
2008-02-16 21:30:45Technologovcreate