classification
Title: Logging - Inconsistent behaviour when handling unicode
Type: enhancement Stage: resolved
Components: Windows Versions: Python 3.9
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: SilentGhost, inada.naoki, jonathan-lp, paul.moore, steve.dower, tim.golden, vinay.sajip, zach.ware
Priority: normal Keywords: patch

Created on 2019-05-31 09:57 by jonathan-lp, last changed 2019-06-17 16:43 by vinay.sajip. This issue is now closed.

Files
File name Uploaded Description Edit
my_log.log jonathan-lp, 2019-05-31 10:54 File that won't take the unicode
Pull Requests
URL Status Linked Edit
PR 13932 closed vinay.sajip, 2019-06-09 16:24
PR 14008 merged vinay.sajip, 2019-06-12 06:35
Messages (20)
msg344053 - (view) Author: Jonathan (jonathan-lp) Date: 2019-05-31 09:57
Python is inconsistent in how it handles errors that have some unicode characters. It works to screen but fails to log

This works:
```
>>> import logging
>>> logging.error('จุด1')
ERROR:root:จุด1
```

The following breaks:

```
>>> import logging
>>> logging.basicConfig(filename='c:\\my_log.log')
>>> logging.error('จุด1')
```

This raises a unicode error:
UnicodeEncodeError: 'charmap' codec can't encode characters in position 11-13: character maps to <undefined>

Python 3.6.3

Given that the file created by the logger is utf-8, it's unclear why it doesn't work.

I found a workaround by using a Handler, but surely the loggers should all work the same way so that people don't get unpleasant surprises that even more painful to debug when things only break in certain logging modes?
msg344059 - (view) Author: SilentGhost (SilentGhost) * (Python triager) Date: 2019-05-31 10:27
> Given that the file created by the logger is utf-8

I don't think this is true. Default file handler created using filename argument is opened with encoding None (meaning that it relies on system-default encoding, which is on Windows is not utf-8). And it also explains why your work-around works, provided you've opened file handler with explicit utf-8 encoding.
msg344060 - (view) Author: Jonathan (jonathan-lp) Date: 2019-05-31 10:54
It definitely claims to be "utf-8" in NotePad++. I've attached it if you want to double-check.

(Windows 7)
msg344081 - (view) Author: Vinay Sajip (vinay.sajip) * (Python committer) Date: 2019-05-31 15:37
> Given that the file created by the logger is utf-8, it's unclear why it doesn't work ... I found a workaround by using a Handler

Loggers don't create files - handlers do. The file that you attached seems to be just a text file containing ASCII text. Any ASCII is also utf-8, so Notepad++'s assertion doesn't mean anything in this situation.

> so that people don't get unpleasant surprises that even more painful to debug when things only break in certain logging modes?

People have been using logging, on Windows, without problems, for years, often using utf-8 to encode their log files. Perhaps you need to read the documentation and look at examples more carefully - I'm not intending to be rude, but say that because using a handler isn't some kind of workaround, and you thinking that it was indicates that you're not sufficiently familiar with the documentation.

I'd suggest posting questions on Stack Overflow or the python-list mailing list, to establish whether there really seems to be a Python bug, before actually logging a Python issue. I appreciate that you're trying to improve Python.
msg344106 - (view) Author: Jonathan (jonathan-lp) Date: 2019-05-31 19:00
Thank you for your comments but this wasn't a question and I maintain this is a bug, or at least undesirable behaviour. I'd consider it a bug in my own software.

Reasoning:
* It's an inconsistent default with the logging to screen. This causes more complexity for users when their bug is intermittent.
* Despite your assertion, it's not documented anywhere on the logging docs (I did check before creating this bug when trying to figure out what's going on) - The word "utf" or "unicode" doesn't appear on the logging page, or any of the two tutorials, or the logging.handlers page. There's something in the cookbook but that's about BOM's.
* Much of the world's native characters won't log to ASCII

Per this page: https://docs.python.org/3/howto/unicode.html
"UTF-8 is one of the most commonly used encodings, and Python often defaults to using it."

> People have been using logging, on Windows, without problems, for years, often using utf-8 to encode their log files.
I'm afraid this line of reasoning is suffering from selection bias, cherry picking, confirmation bias, and probably some others too. Clearly people have had problems before because it was from one of those folks I took the solution.

Doing something as basic as logging unicode shouldn't require knowledge of "handlers" - that's failing "simple is better than complex".
msg344110 - (view) Author: Vinay Sajip (vinay.sajip) * (Python committer) Date: 2019-05-31 19:33
> Doing something as basic as logging unicode shouldn't require knowledge of "handlers" - that's failing "simple is better than complex".

I reiterate my statement that it appears that you aren't sufficiently familiar with how logging is designed to work. Logging is more complex than print() or open() because it has to work flexibly in lots of different scenarios. Handlers are part of this, as are Formatters and Filters.

See my 2009 blog post

https://plumberjack.blogspot.com/2009/09/python-logging-101.html

for more information for why logging is designed the way it is.

If you want to configure a stream with utf-8 encoding, without explicitly using handlers, this is easily doable, as follows:

C:\Users\Vinay> \python37\python
Python 3.7.3 (v3.7.3:ef4ec6ed12, Mar 25 2019, 22:22:05) [MSC v.1916 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import logging
>>> logging.basicConfig(stream=open(r'c:\temp\my_log.log', 'w', encoding='utf-8'))
>>> logging.error('จุด1')
>>> ^Z

and, lo and behold, the file c:\temp\my_log.log contains

ERROR:root:จุด1

The use of the stream keyword parameter is clearly documented at

https://docs.python.org/3/library/logging.html#logging.basicConfig

Did you look at the basicConfig documentation before raising this issue?
msg344113 - (view) Author: Jonathan (jonathan-lp) Date: 2019-05-31 19:55
> Did you look at the basicConfig documentation before raising this issue?

This seems like an attempt at victim blaming. But yes, I did. In fact, this is now the third time I've looked at that page - once before raising this issue, once before my previous reply, and now. I note that your example and nothing like your example is anywhere on that page. The word "encoding" doesn't appear anywhere on the page either. Sure "stream" is on there, but then you need to know about streams and make the association with logging which I apparently don't. You have to remember not everyone has your level of proficiency in the language. In fact, most Python users don't.

Lets put this another way - is there a reason NOT to have Unicode logging as the default? Clearly Unicode was important enough for Guido-et-al to decide to throw Python 2 under the bus.
I've listed the advantages of changing it, what are the disadvantages?
msg344164 - (view) Author: Vinay Sajip (vinay.sajip) * (Python committer) Date: 2019-06-01 09:50
> This seems like an attempt at victim blaming.

Er, no, it was a straight question about whether you'd read the documentation.

> I'm afraid this line of reasoning is suffering from selection bias, cherry picking, confirmation bias, and probably some others too.

Er, no, it was stating that lots of people have successfully used logging on Windows with utf-8, as a way of justifying my opinion that there's no deficiency here that needs addressing.

> Sure "stream" is on there, but then you need to know about streams and make the association with logging which I apparently don't. You have to remember not everyone has your level of proficiency in the language. In fact, most Python users don't.

Sure, and that's fine. Experienced developers learn stuff every day too. But responding to issues takes time (notice how many people will have been notified about this issue from the "nosy list" above; each would have to spend at least some time looking at it), so raising issues should be done only after exhausting other resources to make sure that the problem isn't in one's own level of knowledge or understanding, but an actual problem with the software. This is why I earlier suggested Stack Overflow and python-list as useful resources for improving one's knowledge and understanding.

> Lets put this another way - is there a reason NOT to have Unicode logging as the default?

I have no idea what you mean by this. As my example illustrated, it's quite easy to log Unicode in log files.

A simple Internet search for "basicConfig encoding" yields for me as the second result this Stack Overflow question

https://stackoverflow.com/questions/21402530

for which the accepted answer is a slight variation on what I suggested earlier.
msg344198 - (view) Author: Jonathan (jonathan-lp) Date: 2019-06-01 19:03
> I have no idea what you mean by this.

I don't see how I can be clearer. What are the reasons for NOT implementing logging to file be unicode as a default? Logging to screen is unicode as a default. What are the reasons for not wanting consistency?

> A simple Internet search for "basicConfig encoding" yields for me as the second result this Stack Overflow question 

Indeed, and it was from that question I got my solution in fact. The problem was the 30-60 minutes I wasted before that trying to figure out why my program was crashing and why it was only crashing *sometimes*. I'd written the logging part of the program a year ago and not really touched it since, so the logging module being a possible culprit was not even in my mind when the program crashed.

> As my example illustrated, it's quite easy to log Unicode in log files.

Yes, things are easy when you know it's necessary. It's the process of discovery that's an unnecessary waste of people's time. That's why I raised this and that's why I would consider this a bug in my own software. It's inconsistent, it invites problems, and it wastes peoples time.
msg344203 - (view) Author: Vinay Sajip (vinay.sajip) * (Python committer) Date: 2019-06-01 19:13
Learning is not a waste of time. You're entitled to your opinion, but this is not a bug in logging. We'll have to agree to disagree.
msg344204 - (view) Author: Jonathan (jonathan-lp) Date: 2019-06-01 19:17
> Learning is not a waste of time. You're entitled to your opinion, but this is not a bug in logging. We'll have to agree to disagree.

I agree and value learning a great deal. However learning should happen on your own time, NOT when a program crashes randomly and tries taking you down the rabbit hole. I like learning but not about unrelated things when I'm trying to do useful work.

Fine, if you don't consider this a bug, consider it a feature request. "User would like Python logging of Unicode characters to be consistent" is not an unreasonable request.
msg344208 - (view) Author: Vinay Sajip (vinay.sajip) * (Python committer) Date: 2019-06-01 19:24
Different people have different ideas of what a default should be. You can pretty much guarantee that whatever a default is, someone will think it should be something else. The basicConfig functionality has been around in its present form since 2003, and people in general haven't had a problem with the current defaults. There is no case for changing a default unless there is a consensus among lots of people that a default needs changing.
msg344212 - (view) Author: Paul Moore (paul.moore) * (Python committer) Date: 2019-06-01 19:55
> User would like Python logging of Unicode characters to be consistent

It is consistent. The encoding of

    logging.basicConfig(filename='c:\\my_log.log')

is consistent with the encoding of

    open('c:\\my_log.log')

Both use the system default encoding, which is not UTF-8. There is some discussion going on right now, as to whether it *should* be, but it isn't, and I wouldn't consider changing the behaviour of loging *without* changing the behaviour of open() to be consistent.

Logging to the console is consistent with the standard IO streams, and it was PEP 528 (https://www.python.org/dev/peps/pep-0528/) that made that change - before that, the standard IO streams, and logging to the console, used the console codepage.

So, while I agree that the behaviour takes a bit of work to understand, it's not (IMO) inconsistent, nor is it (IMO) a bug.
msg344456 - (view) Author: Steve Dower (steve.dower) * (Python committer) Date: 2019-06-03 18:16
I agree with Paul. Consistency with open() is the most important aspect here, and when we investigated changing that to UTF-8 it was decided to be too disruptive.

That said, I'm not opposed to adding encoding/errors to basicConfig if someone wants to do that (and the logging specialists are okay with it).
msg345319 - (view) Author: Inada Naoki (inada.naoki) * (Python committer) Date: 2019-06-12 11:43
Hmm,, about encoding, I agree that default encoding of open() should be used.
If we change it, encoding of log files are changed unexpectedly after upgrading Python.

On the other hand, couldn't we use different default error handler?
"replace" or "backslashescape" seems better default error handler for the logging.
Is this change affects too many users?
msg345413 - (view) Author: Vinay Sajip (vinay.sajip) * (Python committer) Date: 2019-06-12 21:40
> On the other hand, couldn't we use different default error handler?
"replace" or "backslashescape" seems better default error handler for the logging.

I've left it as is using the same rationale as I guess open() has at the moment - "Errors should never pass silently. Unless explicitly silenced".
msg345430 - (view) Author: Inada Naoki (inada.naoki) * (Python committer) Date: 2019-06-12 23:52
> I've left it as is using the same rationale as I guess open() has at the moment - "Errors should never pass silently. Unless explicitly silenced".

But open() is general purpose, and logging is for logging.

Note that stderr uses 'backslashreplace', even when PYTHONIOENCODING=ascii:strict is used.

And if we don't log something due to encoding error, doesn't it mean "errors may ignored completely silently by default"?
What logging does when UnicodeEncodeError is happened?
msg345432 - (view) Author: Inada Naoki (inada.naoki) * (Python committer) Date: 2019-06-13 00:10
> What logging does when UnicodeEncodeError is happened?

Hm, it seems logging does best job, show enough information to stderr.

But sometime, Python is embedded in web servers and there is no stderr.  pythonw.exe may not have stderr too....

I still feel backslashreplace is better default error handler.  But it's not strong opinion and it shouldn't block this issue.  Go ahead.


$ cat x.py
import logging

logging.basicConfig(filename="mylog.txt")
msg = "hello, はろー"
logging.error(msg)

$ LC_ALL=en_US.US-ASCII python3 x.py
--- Logging error ---
Traceback (most recent call last):
  File "/usr/local/Cellar/python/3.7.3/Frameworks/Python.framework/Versions/3.7/lib/python3.7/logging/__init__.py", line 1037, in emit
    stream.write(msg + self.terminator)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 18-20: ordinal not in range(128)
Call stack:
  File "x.py", line 5, in <module>
    logging.error(msg)
Message: 'hello, \u306f\u308d\u30fc'
Arguments: ()
msg345523 - (view) Author: Steve Dower (steve.dower) * (Python committer) Date: 2019-06-13 14:51
I think Vinay gets to make the call, but I also agree that some sort of error handling by default makes sense for logging. Especially now that it's (about to be) easier to override the default.

FWIW, I'd leave the default arguments as None and just coerce it as late as possible, before TextIOWrapper does it.
msg345877 - (view) Author: Vinay Sajip (vinay.sajip) * (Python committer) Date: 2019-06-17 16:40
New changeset ca7b504a4d4c3a5fde1ee4607b9501c2bab6e743 by Vinay Sajip in branch 'master':
bpo-37111: Add 'encoding' and 'errors' parameters to logging.basicCon… (GH-14008)
https://github.com/python/cpython/commit/ca7b504a4d4c3a5fde1ee4607b9501c2bab6e743
History
Date User Action Args
2019-06-17 16:43:59vinay.sajipsetstatus: open -> closed
stage: patch review -> resolved
resolution: fixed
versions: + Python 3.9, - Python 3.7, Python 3.8
2019-06-17 16:40:56vinay.sajipsetmessages: + msg345877
2019-06-13 14:51:41steve.dowersetmessages: + msg345523
2019-06-13 00:10:25inada.naokisetmessages: + msg345432
2019-06-12 23:52:58inada.naokisetmessages: + msg345430
2019-06-12 21:40:18vinay.sajipsetmessages: + msg345413
2019-06-12 11:43:46inada.naokisetnosy: + inada.naoki
messages: + msg345319
2019-06-12 06:35:29vinay.sajipsetpull_requests: + pull_request13871
2019-06-09 16:24:48vinay.sajipsetkeywords: + patch
stage: test needed -> patch review
pull_requests: + pull_request13798
2019-06-03 18:16:17steve.dowersetresolution: not a bug -> (no value)
messages: + msg344456
stage: test needed
2019-06-01 19:55:22paul.mooresetmessages: + msg344212
2019-06-01 19:24:13vinay.sajipsetmessages: + msg344208
stage: resolved -> (no value)
2019-06-01 19:17:36jonathan-lpsetstatus: closed -> open
type: behavior -> enhancement
messages: + msg344204
2019-06-01 19:13:04vinay.sajipsetmessages: + msg344203
2019-06-01 19:03:10jonathan-lpsetmessages: + msg344198
2019-06-01 09:50:59vinay.sajipsetmessages: + msg344164
2019-05-31 19:55:14jonathan-lpsetmessages: + msg344113
2019-05-31 19:33:50vinay.sajipsetmessages: + msg344110
2019-05-31 19:00:14jonathan-lpsetmessages: + msg344106
2019-05-31 15:37:20vinay.sajipsetstatus: open -> closed
resolution: not a bug
messages: + msg344081

stage: resolved
2019-05-31 11:12:12SilentGhostsetnosy: + paul.moore, tim.golden, zach.ware, steve.dower
type: behavior
components: + Windows
2019-05-31 10:54:49jonathan-lpsetfiles: + my_log.log

messages: + msg344060
2019-05-31 10:28:08SilentGhostsetversions: + Python 3.7, Python 3.8, - Python 3.6
2019-05-31 10:27:59SilentGhostsetnosy: + vinay.sajip, SilentGhost
messages: + msg344059
2019-05-31 09:57:34jonathan-lpcreate