classification
Title: msgfmt cannot cope with BOM
Type: behavior Stage: needs patch
Components: Demos and Tools, Unicode Versions: Python 3.3, Python 3.2, Python 3.1, Python 2.7
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: loewis Nosy List: cito, eric.araujo, haypo, loewis, rhettinger
Priority: normal Keywords: patch

Created on 2007-04-10 20:58 by cito, last changed 2011-01-10 22:19 by haypo.

Files
File name Uploaded Description Edit
msgfmt.diff cito, 2007-04-10 20:58 review
Messages (7)
msg31755 - (view) Author: Christoph Zwerschke (cito) Date: 2007-04-10 20:58
If a .po file has a BOM (byte order mark) at the beginning, as is often the case for utf-8 files created on Windows, msgfmt.py complines about a syntax error.

The attached patch fixes this problem.
msg31756 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2007-04-11 16:07
Martin, is this your code?
msg31757 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2007-04-11 22:13
It's my code, but I will need to establish first whether it's a bug. That depends on what the PO specification says, and, if is it silent on the matter, what GNU gettext does.
msg31758 - (view) Author: Christoph Zwerschke (cito) Date: 2007-04-12 09:10
It may well be that GNU gettext also chokes on a BOM, because they aren't used under Linux. But I think as a Python tool it should be more Windows-tolerant. The annoying thing is that you get a syntax error, but everything looks right because the BOM is usually invisible. Such error messages are really frustrating. Either the BOM should be silently ignored (as in the patch) or the users should get a friendly error message asking them to save the file without BOM. If GNU behaves badly to Windows users, that's not an excuse to do the same. They are already suffering enough because of their (or their bosses') bad choice of OS ;-)

msg70042 - (view) Author: Christoph Zwerschke (cito) Date: 2008-07-19 16:17
Small improvement of the patch: Instead of hardcoding the BOM as
'\xef\xbb\xbf', we should use codecs.BOM_UTF8.
msg125940 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2011-01-10 22:18
Extract of the Unicode standard: "Use of a BOM is neither required nor recommended for UTF-8, but may be encountered in contexts where UTF-8 data is converted from other encoding forms that use a BOM or where the BOM is used as a UTF-8 signature".

See also the following section explaing issues with UTF-8 BOM:
http://en.wikipedia.org/wiki/Byte_order_mark#UTF-8

I agree that Python should handle (UTF-8) BOM to read a CSV file (#7185), because the file format is common on Windows.

But msgfmt is an UNIX tool: I would expect that Python behaves like the original msgfmt tool, fail with a fatal error on the BOM "invisible character". How do you explain to a user msgfmt fails but not msgfmt.py?

About the patch: *ignore* the BOM is not a good idea. The BOM announces the encoding (eg. UTF-8): if a Content-Type header announces another encoding, you should raise an error.
msg125941 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2011-01-10 22:19
See also issue #7651: "Python3: guess text file charset using the BOM".
History
Date User Action Args
2011-01-10 22:19:48hayposetnosy: loewis, rhettinger, cito, haypo, eric.araujo
messages: + msg125941
2011-01-10 22:18:38hayposetnosy: loewis, rhettinger, cito, haypo, eric.araujo
messages: + msg125940
2011-01-06 17:03:44pitrousetnosy: + haypo
stage: test needed -> needs patch

versions: + Python 2.7, Python 3.2, Python 3.3, - Python 2.6
2010-06-11 14:58:50eric.araujosetnosy: + eric.araujo
2009-05-15 02:21:09ajaksu2setversions: + Python 2.6, Python 3.1, - Python 2.5
nosy: loewis, rhettinger, cito
components: + Unicode
keywords: + patch
type: behavior
stage: test needed
2008-07-19 16:17:29citosetmessages: + msg70042
2007-04-10 20:58:04citocreate