classification
Title: SysLogHandler sends invalid messages when using unicode
Type: behavior Stage: resolved
Components: Library (Lib) Versions: Python 3.2, Python 3.3, Python 2.7
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: vinay.sajip Nosy List: gregory.p.smith, gvanrossum, pocock, python-dev, r.david.murray, roysmith, scjody, tim.golden, vinay.sajip, zmk
Priority: normal Keywords:

Created on 2012-03-30 11:12 by zmk, last changed 2014-07-29 12:38 by pocock. This issue is now closed.

Messages (15)
msg157135 - (view) Author: marko kreen (zmk) Date: 2012-03-30 11:12
SysLogHandler converts message to utf8 and adds BOM, supposedly
to conform with RFC5424, but the implementation is broken:
the RFC specifies that the BOM should prefix only unstructured
message part, but current Python implementation puts it in the
middle of structured part, thus confusing RFC-compliant receivers.

Simplest fix would be to just remove the BOM adding.
msg157147 - (view) Author: Vinay Sajip (vinay.sajip) * (Python committer) Date: 2012-03-30 18:32
> the RFC specifies that the BOM should prefix only unstructured
> message part, but current Python implementation puts it in the
> middle of structured part, thus confusing RFC-compliant receivers.

How do you work that out? The implementation puts the BOM before the message part:

        prio = '<%d>' % self.encodePriority(...)
        prio = prio.encode('utf-8')
        msg = msg.encode('utf-8')
        if codecs:
            msg = codecs.BOM_UTF8 + msg
        msg = prio + msg

Perhaps I've misunderstood you or the RFC5424, but I don't quite see how. I'll mark this as "pending" and "invalid" awaiting a clarification from you (ideally with a specific example).
msg157572 - (view) Author: marko kreen (zmk) Date: 2012-04-05 11:55
The 'msg' in SysLogHandler does not correspond to MSG in RFC.
Nor to %(message)s in log record.

RFC:
------------------------
SYSLOG-MSG      = HEADER SP STRUCTURED-DATA [SP MSG]
HEADER          = PRI VERSION SP TIMESTAMP SP HOSTNAME
                  SP APP-NAME SP PROCID SP MSGID
PRI             = "<" PRIVAL ">"
[...]
MSG             = MSG-ANY / MSG-UTF8
MSG-ANY         = *OCTET ; not starting with BOM
MSG-UTF8        = BOM UTF-8-STRING
BOM             = %xEF.BB.BF
--------------------------------

logging:

The SysLogHandler does not provide any fields after PRI.
Which is OK, those can be added via format string.

We are using following formatter to get RFC-like structure
to message:

  [formatter_syslog]
  format=%(hostname)s %(service_name)s %(job_name)s %(message)s

That's not fully compliant, but that's beside the point.
The problem is that SysLogHandler puts BOM before hostname
not before %(message)s.  In Python 2.6 it put it even before PRI,
which was even more annoying.

I see 2 solutions to this:

1) Stop putting BOM at all, its not necessary.
2) Put the BOM into %(message)s somehow, thus working with whatever
format is given.
msg157597 - (view) Author: Vinay Sajip (vinay.sajip) * (Python committer) Date: 2012-04-05 14:53
Ok, I see what the problem is. I could go for option 1 - leave the BOM out, encode the string as UTF-8 but send it as just a bunch of bytes, i.e. the MSG-ANY variant of the spec. However, this could break any existing code that doesn't use structured data before the message, as you are doing, and relies on the MSG-UTF8 variant. So while agreeing with you that the situation isn't ideal, I don't see how I can change things while preserving backward compatibility, other than:

Introduce a class-level _insert_BOM attribute, defaulting to True but which can be set to False on a per-instance level. The BOM would only be inserted if this were True, so the current behaviour is preserved, but you could set the attribute to False and leave the BOM out. However, isn't ideal either, as it requires you to be sensitive to the exact version of Python in use - not easy if you're developing a library rather than an application.

I will consider a different approach in 3.3, e.g. provide one or more overridable methods to construct the message sent across the socket.

Thoughts? If these methods won't suffice, you can always resort to subclassing the handler and overriding the emit method (not that that's ideal, either).
msg157598 - (view) Author: marko kreen (zmk) Date: 2012-04-05 15:08
Note additional brokenness in BOM addition:
* It does add BOM even when msg is ascii, just because it was in unicode() string.
* It does not add BOM to UTF8 string if it is already encoded to str().

So highly doubt anybody actually relies on it.  And keeping such addition just increases chance that anybody starts to rely on it, so it would be good to just drop it.

For 3.3, I think it would be best to move syslog packet formatting out from emit(), because emit() already contains noticeable amount of unrelated code, so full override it is annoying.

I suggested changing BOM addition so that it only adds it to %(message)s, but such change could cause backwards incompatibility. So it does not look like good idea.

But dropping BOM even in old Python *does* look like good idea, as anybody using SysLoghandler needs to deal with BOM-less UTF8 anyway,
so dropping it will not create backwards incompatibility.
msg158030 - (view) Author: Vinay Sajip (vinay.sajip) * (Python committer) Date: 2012-04-11 13:12
I have a possible suggestion about how to resolve this issue:

The SysLogHandler will not do BOM insertion unless the message is Unicode. If it is Unicode, it will add the attribute 'UTF8BOM' to the LogRecord, with the value u'\ufeff'. The record will then be formatted; if the format string contains the UTF8BOM placeholder, it will be replaced with the value u'\ufeff', which, when encoded, results in the UTF-8 BOM value '\xef\xbb\xbf'. The user of the format string is responsible for ensuring that:

1. If there's no UTF8BOM placeholder in the format string, everything in the formatted result must always encode to plain ASCII when encoded using UTF-8.

2. If there is a UTF8BOM placeholder in the format string, everything in the formatted result prior to the placeholder must always encode to plain ASCII when encoded using UTF-8. The stuff following can, of course, be free of that restriction.

3. The end result of encoding should be a prefix which is bytes of pure ASCII, then the BOM (if the placeholder is present in the format string), then bytes of UTF-8 encoded Unicode.

In any case, a Unicode string will be encoded using UTF-8. If no UTF8BOM placeholder was present, no BOM will be; the message can be considered to just be a set of octets, which just happens to be UTF-8 encoded Unicode. If the placeholder was present, the BOM should appear at the appropriate place to comply with RFC 5424.

On 3.2, the message will always be Unicode, and the above processing will take place (whereas on 2.x it will be conditional on the type of the formatted message string being Unicode).

This seems to provide a resolution to the issue which can be solved without API changes, and with changes to the format string if BOM insertion is needed. With no UTF8BOM placeholder, the BOM will simply not be inserted. Can you comment on this suggestion?
msg158046 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2012-04-11 15:41
But why on earth would one want a BOM in UTF-8-encoded data?  It is byte-order independent!
msg158047 - (view) Author: Tim Golden (tim.golden) * (Python committer) Date: 2012-04-11 15:46
It's used by some systems (Windows Notepad does this
if you save as UTF8) to indicate that the byte stream
*is* UTF8-encoded. It's not so much a BOM as a magic cookie.

I can't speak for syslog, I'm afraid

TJG
msg158048 - (view) Author: Vinay Sajip (vinay.sajip) * (Python committer) Date: 2012-04-11 15:52
> But why on earth would one want a BOM in UTF-8-encoded data?  It is byte-order independent!

Lord only knows, but the RFC does call for it - msg157572 has an actual excerpt from RFC 5424.
msg158108 - (view) Author: Gregory P. Smith (gregory.p.smith) * (Python committer) Date: 2012-04-12 07:03
Your three step approach makes sense... But it _is_ still technically a new API though in that the UTF8BOM placeholder for LogRecord's is being introduced.  What would the behavior be when run on an older version without support for that placeholder be?

I'm okay with adding this but wouldn't be surprised if release managers are not.  But I personally have no need for syslog logging from Python so it'd be better to have someone who needs this to work properly chime in.

Perhaps just fixing it nicely in 3.3 is sufficient while documenting the misbehavior as a known issue for 2.7 and 3.2.
msg158117 - (view) Author: Vinay Sajip (vinay.sajip) * (Python committer) Date: 2012-04-12 09:22
> What would the behavior be when run on an older version without support for that placeholder be?

Then it would fail when the format string contained e.g. %(UTF8BOM)s and there was no corresponding attribute in the LogRecord - but that's true of any feature which is introduced in newer Pythons. I get that it would be unexpected in a point release, and that's why I've posted about this on c.l.py (no feedback from there so far).

> I'm okay with adding this but wouldn't be surprised if release managers are not.

Hence the post to python-dev, but no release manager has expressed an opinion yet.

> Perhaps just fixing it nicely in 3.3 is sufficient while documenting the misbehavior as a known issue for 2.7 and 3.2.

This is doable at a user level, except for the fact that the BOM insertion is currently unconditional. If I just remove the BOM insertion in 2.7 and 3.2, then I don't need to do the UTF8BOM placeholder thing in the stdlib; I could just add a cookbook recipe telling users how to do it. I am thinking about a different solution for 3.3 anyway, i.e. adding one or more overridable methods to SysLogHandler.

Since no one has objected on c.l.py about the proposed change (which implied that if there were no objections, the change would happen) Marko may be right that not many people are affected, or care. I'll wait a little while longer, and if no objections are forthcoming I'll remove the BOM insertion in 2.7 and 3.2, add a cookbook recipe for those who need a BOM and leave it at that for those versions.
msg158447 - (view) Author: Roundup Robot (python-dev) Date: 2012-04-16 13:44
New changeset af46a001d5ec by Vinay Sajip in branch '2.7':
Issue #14452: remove BOM insertion code.
http://hg.python.org/cpython/rev/af46a001d5ec

New changeset 89ab589f6fa7 by Vinay Sajip in branch '3.2':
Closes #14452: remove BOM insertion code.
http://hg.python.org/cpython/rev/89ab589f6fa7

New changeset 372aa4267a43 by Vinay Sajip in branch 'default':
Closes #14452: remove BOM insertion code.
http://hg.python.org/cpython/rev/372aa4267a43
msg158458 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2012-04-16 14:30
This appears to be failing on the buildbots:

http://www.python.org/dev/buildbot/all/builders/x86%20OpenIndiana%203.x/builds/3358/steps/test/logs/stdio
http://www.python.org/dev/buildbot/all/builders/x86%20Gentoo%20Non-Debug%203.x/builds/2037/steps/test/logs/stdio
msg158464 - (view) Author: Roundup Robot (python-dev) Date: 2012-04-16 15:22
New changeset 603301cfb194 by Vinay Sajip in branch 'default':
Closes #14452: brought tests in line with removal of BOM insertion code.
http://hg.python.org/cpython/rev/603301cfb194
msg224223 - (view) Author: Daniel Pocock (pocock) Date: 2014-07-29 12:38
As a workaround, Python 3.2 users can clobber the global variable codecs like this from their own init code:

    logging.handlers.codec = None

There is a more complete example here:

https://github.com/dpocock/python-rfc5424-logging-formatter
History
Date User Action Args
2014-07-29 12:38:21pococksetnosy: + pocock
messages: + msg224223
2013-01-10 02:22:57roysmithsetnosy: + roysmith
2012-08-21 14:56:52scjodysetnosy: + scjody
2012-07-27 02:33:25r.david.murraylinkissue15462 superseder
2012-04-16 15:22:47python-devsetstatus: open -> closed

messages: + msg158464
2012-04-16 14:30:36r.david.murraysetstatus: closed -> open
nosy: + r.david.murray
messages: + msg158458

2012-04-16 13:44:45python-devsetstatus: open -> closed

nosy: + python-dev
messages: + msg158447

resolution: fixed
stage: resolved
2012-04-12 09:22:28vinay.sajipsetmessages: + msg158117
2012-04-12 07:03:51gregory.p.smithsetnosy: + gregory.p.smith
messages: + msg158108
2012-04-11 15:52:04vinay.sajipsetmessages: + msg158048
2012-04-11 15:46:32tim.goldensetnosy: + tim.golden
messages: + msg158047
2012-04-11 15:41:50gvanrossumsetnosy: + gvanrossum
messages: + msg158046
2012-04-11 13:12:22vinay.sajipsetmessages: + msg158030
2012-04-06 21:39:41vinay.sajipsetassignee: vinay.sajip
2012-04-05 15:08:30zmksetmessages: + msg157598
2012-04-05 14:53:51vinay.sajipsetresolution: not a bug -> (no value)
messages: + msg157597
2012-04-05 11:55:39zmksetstatus: pending -> open

messages: + msg157572
2012-03-30 18:32:55vinay.sajipsetstatus: open -> pending
resolution: not a bug
messages: + msg157147
2012-03-30 14:20:05r.david.murraysetversions: - Python 2.6, Python 3.1, Python 3.4
2012-03-30 14:19:42r.david.murraysetnosy: + vinay.sajip
2012-03-30 11:13:24zmksettype: behavior
2012-03-30 11:12:21zmkcreate