This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: UTF8 BOM incorrectly prepended syslog messages when using rsysolog
Type: behavior Stage: resolved
Components: Unicode Versions: Python 2.7
process
Status: closed Resolution: duplicate
Dependencies: Superseder: SysLogHandler sends invalid messages when using unicode
View: 14452
Assigned To: Nosy List: Aimon.Bustardo, ezio.melotti, r.david.murray, vinay.sajip
Priority: normal Keywords:

Created on 2012-07-26 21:06 by Aimon.Bustardo, last changed 2022-04-11 14:57 by admin. This issue is now closed.

Messages (2)
msg166520 - (view) Author: Aimon Bustardo (Aimon.Bustardo) Date: 2012-07-26 21:06
Ubuntu 12.0.4 LTS 64bit
python2.7-minimal 2.7.3-0ubuntu3
rsyslog 5.8.6-1ubuntu8

Python converts all syslog messages to UTF8 before sending to syslog. It also prepends the Byte Order Mark (BOM) of the Unicode Standard. This prepended BOM causes bad characters when using rsyslog (have not verified with std syslog or syslog-ng).

Example log line:

Jul 25 13:36:03 mc 2012-07-25 13:36:03 INFO nova.api.openstack.wsgi [req-48a555a5-6d2a-4a38-8384-3b4684357e72 19f932a5b0b34655989f4cb761522bb3 2617e657fdf84569a6be7977318e46c8] http://MASKED:8774/v1.1/2617e657fdf84569a6be7977318e46c8/os-hosts/MASKED.json?ignore_awful_caching1343248563 returned with HTTP 200

Note the ' ' before the date field.

Interesting find on issues from another site:

"Yes, "" is the Byte Order Mark (BOM) of the Unicode Standard. Specifically it is the hex bytes EF BB BF, which form the UTF-8 representation of the BOM, misinterpreted as ISO 8859/1 text instead of UTF-8."

If I patch the code in /usr/lib/python2.7/logging/handlers.py:
------------------------------------------
@@ -797,9 +797,10 @@
                                             self.mapPriority(record.levelname))
         # Message is a string. Convert to bytes as required by RFC 5424
         if type(msg) is unicode:
            msg = msg.encode('utf-8')
- if codecs:
- msg = codecs.BOM_UTF8 + msg
+ #if codecs:
+ # msg = codecs.BOM_UTF8 + msg
         msg = prio + msg
         try:
             if self.unixsocket:

----------------------------------------

The logs will now appear normally. What is happening with the 'codecs' condition? Is this controllable through config? Is this a bug in rsyslog? 

Related tickets:

https://bugs.launchpad.net/openstack-common/+bug/1029116
https://bugs.launchpad.net/ubuntu/+source/python2.7/+bug/1029640
http://bugzilla.adiscon.com/show_bug.cgi?id=346
msg166534 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2012-07-27 02:33
I believe this is a duplicate of issue 14452.
History
Date User Action Args
2022-04-11 14:57:33adminsetgithub: 59667
2012-07-27 02:33:25r.david.murraysetstatus: open -> closed

superseder: SysLogHandler sends invalid messages when using unicode
components: - Library (Lib), IO

nosy: + vinay.sajip, r.david.murray
messages: + msg166534
resolution: duplicate
stage: resolved
2012-07-26 21:06:21Aimon.Bustardocreate