Created on 2012-03-30 11:12 by zmk, last changed 2014-07-29 12:38 by pocock. This issue is now closed.
|msg157135 - (view)||Author: marko kreen (zmk)||Date: 2012-03-30 11:12|
SysLogHandler converts message to utf8 and adds BOM, supposedly to conform with RFC5424, but the implementation is broken: the RFC specifies that the BOM should prefix only unstructured message part, but current Python implementation puts it in the middle of structured part, thus confusing RFC-compliant receivers. Simplest fix would be to just remove the BOM adding.
|msg157147 - (view)||Author: Vinay Sajip (vinay.sajip) *||Date: 2012-03-30 18:32|
> the RFC specifies that the BOM should prefix only unstructured > message part, but current Python implementation puts it in the > middle of structured part, thus confusing RFC-compliant receivers. How do you work that out? The implementation puts the BOM before the message part: prio = '<%d>' % self.encodePriority(...) prio = prio.encode('utf-8') msg = msg.encode('utf-8') if codecs: msg = codecs.BOM_UTF8 + msg msg = prio + msg Perhaps I've misunderstood you or the RFC5424, but I don't quite see how. I'll mark this as "pending" and "invalid" awaiting a clarification from you (ideally with a specific example).
|msg157572 - (view)||Author: marko kreen (zmk)||Date: 2012-04-05 11:55|
The 'msg' in SysLogHandler does not correspond to MSG in RFC. Nor to %(message)s in log record. RFC: ------------------------ SYSLOG-MSG = HEADER SP STRUCTURED-DATA [SP MSG] HEADER = PRI VERSION SP TIMESTAMP SP HOSTNAME SP APP-NAME SP PROCID SP MSGID PRI = "<" PRIVAL ">" [...] MSG = MSG-ANY / MSG-UTF8 MSG-ANY = *OCTET ; not starting with BOM MSG-UTF8 = BOM UTF-8-STRING BOM = %xEF.BB.BF -------------------------------- logging: The SysLogHandler does not provide any fields after PRI. Which is OK, those can be added via format string. We are using following formatter to get RFC-like structure to message: [formatter_syslog] format=%(hostname)s %(service_name)s %(job_name)s %(message)s That's not fully compliant, but that's beside the point. The problem is that SysLogHandler puts BOM before hostname not before %(message)s. In Python 2.6 it put it even before PRI, which was even more annoying. I see 2 solutions to this: 1) Stop putting BOM at all, its not necessary. 2) Put the BOM into %(message)s somehow, thus working with whatever format is given.
|msg157597 - (view)||Author: Vinay Sajip (vinay.sajip) *||Date: 2012-04-05 14:53|
Ok, I see what the problem is. I could go for option 1 - leave the BOM out, encode the string as UTF-8 but send it as just a bunch of bytes, i.e. the MSG-ANY variant of the spec. However, this could break any existing code that doesn't use structured data before the message, as you are doing, and relies on the MSG-UTF8 variant. So while agreeing with you that the situation isn't ideal, I don't see how I can change things while preserving backward compatibility, other than: Introduce a class-level _insert_BOM attribute, defaulting to True but which can be set to False on a per-instance level. The BOM would only be inserted if this were True, so the current behaviour is preserved, but you could set the attribute to False and leave the BOM out. However, isn't ideal either, as it requires you to be sensitive to the exact version of Python in use - not easy if you're developing a library rather than an application. I will consider a different approach in 3.3, e.g. provide one or more overridable methods to construct the message sent across the socket. Thoughts? If these methods won't suffice, you can always resort to subclassing the handler and overriding the emit method (not that that's ideal, either).
|msg157598 - (view)||Author: marko kreen (zmk)||Date: 2012-04-05 15:08|
Note additional brokenness in BOM addition: * It does add BOM even when msg is ascii, just because it was in unicode() string. * It does not add BOM to UTF8 string if it is already encoded to str(). So highly doubt anybody actually relies on it. And keeping such addition just increases chance that anybody starts to rely on it, so it would be good to just drop it. For 3.3, I think it would be best to move syslog packet formatting out from emit(), because emit() already contains noticeable amount of unrelated code, so full override it is annoying. I suggested changing BOM addition so that it only adds it to %(message)s, but such change could cause backwards incompatibility. So it does not look like good idea. But dropping BOM even in old Python *does* look like good idea, as anybody using SysLoghandler needs to deal with BOM-less UTF8 anyway, so dropping it will not create backwards incompatibility.
|msg158030 - (view)||Author: Vinay Sajip (vinay.sajip) *||Date: 2012-04-11 13:12|
I have a possible suggestion about how to resolve this issue: The SysLogHandler will not do BOM insertion unless the message is Unicode. If it is Unicode, it will add the attribute 'UTF8BOM' to the LogRecord, with the value u'\ufeff'. The record will then be formatted; if the format string contains the UTF8BOM placeholder, it will be replaced with the value u'\ufeff', which, when encoded, results in the UTF-8 BOM value '\xef\xbb\xbf'. The user of the format string is responsible for ensuring that: 1. If there's no UTF8BOM placeholder in the format string, everything in the formatted result must always encode to plain ASCII when encoded using UTF-8. 2. If there is a UTF8BOM placeholder in the format string, everything in the formatted result prior to the placeholder must always encode to plain ASCII when encoded using UTF-8. The stuff following can, of course, be free of that restriction. 3. The end result of encoding should be a prefix which is bytes of pure ASCII, then the BOM (if the placeholder is present in the format string), then bytes of UTF-8 encoded Unicode. In any case, a Unicode string will be encoded using UTF-8. If no UTF8BOM placeholder was present, no BOM will be; the message can be considered to just be a set of octets, which just happens to be UTF-8 encoded Unicode. If the placeholder was present, the BOM should appear at the appropriate place to comply with RFC 5424. On 3.2, the message will always be Unicode, and the above processing will take place (whereas on 2.x it will be conditional on the type of the formatted message string being Unicode). This seems to provide a resolution to the issue which can be solved without API changes, and with changes to the format string if BOM insertion is needed. With no UTF8BOM placeholder, the BOM will simply not be inserted. Can you comment on this suggestion?
|msg158046 - (view)||Author: Guido van Rossum (gvanrossum) *||Date: 2012-04-11 15:41|
But why on earth would one want a BOM in UTF-8-encoded data? It is byte-order independent!
|msg158047 - (view)||Author: Tim Golden (tim.golden) *||Date: 2012-04-11 15:46|
It's used by some systems (Windows Notepad does this if you save as UTF8) to indicate that the byte stream *is* UTF8-encoded. It's not so much a BOM as a magic cookie. I can't speak for syslog, I'm afraid TJG
|msg158048 - (view)||Author: Vinay Sajip (vinay.sajip) *||Date: 2012-04-11 15:52|
> But why on earth would one want a BOM in UTF-8-encoded data? It is byte-order independent! Lord only knows, but the RFC does call for it - msg157572 has an actual excerpt from RFC 5424.
|msg158108 - (view)||Author: Gregory P. Smith (gregory.p.smith) *||Date: 2012-04-12 07:03|
Your three step approach makes sense... But it _is_ still technically a new API though in that the UTF8BOM placeholder for LogRecord's is being introduced. What would the behavior be when run on an older version without support for that placeholder be? I'm okay with adding this but wouldn't be surprised if release managers are not. But I personally have no need for syslog logging from Python so it'd be better to have someone who needs this to work properly chime in. Perhaps just fixing it nicely in 3.3 is sufficient while documenting the misbehavior as a known issue for 2.7 and 3.2.
|msg158117 - (view)||Author: Vinay Sajip (vinay.sajip) *||Date: 2012-04-12 09:22|
> What would the behavior be when run on an older version without support for that placeholder be? Then it would fail when the format string contained e.g. %(UTF8BOM)s and there was no corresponding attribute in the LogRecord - but that's true of any feature which is introduced in newer Pythons. I get that it would be unexpected in a point release, and that's why I've posted about this on c.l.py (no feedback from there so far). > I'm okay with adding this but wouldn't be surprised if release managers are not. Hence the post to python-dev, but no release manager has expressed an opinion yet. > Perhaps just fixing it nicely in 3.3 is sufficient while documenting the misbehavior as a known issue for 2.7 and 3.2. This is doable at a user level, except for the fact that the BOM insertion is currently unconditional. If I just remove the BOM insertion in 2.7 and 3.2, then I don't need to do the UTF8BOM placeholder thing in the stdlib; I could just add a cookbook recipe telling users how to do it. I am thinking about a different solution for 3.3 anyway, i.e. adding one or more overridable methods to SysLogHandler. Since no one has objected on c.l.py about the proposed change (which implied that if there were no objections, the change would happen) Marko may be right that not many people are affected, or care. I'll wait a little while longer, and if no objections are forthcoming I'll remove the BOM insertion in 2.7 and 3.2, add a cookbook recipe for those who need a BOM and leave it at that for those versions.
|msg158447 - (view)||Author: Roundup Robot (python-dev)||Date: 2012-04-16 13:44|
New changeset af46a001d5ec by Vinay Sajip in branch '2.7': Issue #14452: remove BOM insertion code. http://hg.python.org/cpython/rev/af46a001d5ec New changeset 89ab589f6fa7 by Vinay Sajip in branch '3.2': Closes #14452: remove BOM insertion code. http://hg.python.org/cpython/rev/89ab589f6fa7 New changeset 372aa4267a43 by Vinay Sajip in branch 'default': Closes #14452: remove BOM insertion code. http://hg.python.org/cpython/rev/372aa4267a43
|msg158458 - (view)||Author: R. David Murray (r.david.murray) *||Date: 2012-04-16 14:30|
|msg158464 - (view)||Author: Roundup Robot (python-dev)||Date: 2012-04-16 15:22|
New changeset 603301cfb194 by Vinay Sajip in branch 'default': Closes #14452: brought tests in line with removal of BOM insertion code. http://hg.python.org/cpython/rev/603301cfb194
|msg224223 - (view)||Author: Daniel Pocock (pocock)||Date: 2014-07-29 12:38|
As a workaround, Python 3.2 users can clobber the global variable codecs like this from their own init code: logging.handlers.codec = None There is a more complete example here: https://github.com/dpocock/python-rfc5424-logging-formatter
messages: + msg224223
|2012-07-27 02:33:25||r.david.murray||link||issue15462 superseder|
|2012-04-16 15:22:47||python-dev||set||status: open -> closed|
messages: + msg158464
|2012-04-16 14:30:36||r.david.murray||set||status: closed -> open|
nosy: + r.david.murray
messages: + msg158458
|2012-04-16 13:44:45||python-dev||set||status: open -> closed|
nosy: + python-dev
messages: + msg158447
|2012-04-12 09:22:28||vinay.sajip||set||messages: + msg158117|
messages: + msg158108
|2012-04-11 15:52:04||vinay.sajip||set||messages: + msg158048|
messages: + msg158047
messages: + msg158046
|2012-04-11 13:12:22||vinay.sajip||set||messages: + msg158030|
|2012-04-06 21:39:41||vinay.sajip||set||assignee: vinay.sajip|
|2012-04-05 15:08:30||zmk||set||messages: + msg157598|
|2012-04-05 14:53:51||vinay.sajip||set||resolution: not a bug -> (no value)|
messages: + msg157597
|2012-04-05 11:55:39||zmk||set||status: pending -> open|
messages: + msg157572
|2012-03-30 18:32:55||vinay.sajip||set||status: open -> pending|
resolution: not a bug
messages: + msg157147
|2012-03-30 14:20:05||r.david.murray||set||versions: - Python 2.6, Python 3.1, Python 3.4|
|2012-03-30 11:13:24||zmk||set||type: behavior|