This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: Different behavior of C and Python impls of datetime.strftime with non-UTF-8-encodable strings
Type: behavior Stage: patch review
Components: Extension Modules Versions: Python 3.8, Python 3.7, Python 3.6
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: belopolsky, brett.cannon, izbyshev, p-ganssle, pitrou, serhiy.storchaka, taleinat, vstinner
Priority: normal Keywords: patch

Created on 2018-08-23 17:40 by izbyshev, last changed 2022-04-11 14:59 by admin.

Pull Requests
URL Status Linked Edit
PR 8983 closed p-ganssle, 2018-08-28 23:50
Messages (8)
msg323963 - (view) Author: Alexey Izbyshev (izbyshev) * (Python triager) Date: 2018-08-23 17:40
The C datetime implementation uses PyUnicode_AsUTF8AndSize() in wrap_strftime() and rejects strings containing surrogate code points (0xD800 - 0xDFFF) since they can't be encoded in UTF-8. On the other hand, the pure-Python datetime implementation doesn't have this restriction:

>>> import sys
>>> sys.modules['_datetime'] = None # block C implementation
>>> from datetime import time
>>> time().strftime('\ud800')
'\ud800'
>>> del sys.modules['datetime']
>>> del sys.modules['_datetime']
>>> from datetime import time
>>> time().strftime('\ud800')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'utf-8' codec can't encode character '\ud800' in position 0: surrogates not allowed
msg324329 - (view) Author: Paul Ganssle (p-ganssle) * (Python committer) Date: 2018-08-29 16:04
I'm finding it very difficult to reconcile these things. I'm not entirely sure, but we may need to take a performance hit in normal strftime if we want to make this work with surrogate characters, which really does not appeal to me (though we can certainly improve to some degree).

One major question here: Is anyone (@vstinner, @belopolsky?) know why time's strftime opportunistically uses wcsftime instead of strftime? It makes the code *way* more complicated and difficult to read / maintain - are there platforms that provide wcstrftime and not strftime?

Also, related, it seems according to https://bugs.python.org/issue10653#msg243660 that there may have been a regression in issue #10653, which may be related here.

Either way, some note should probably be made in the code to clarify exactly *why* these choices were made in the code, in case the situation has changed.
msg324336 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2018-08-29 17:15
Because caring of surrogates, I would prefer to first fix time.strftime() and time.strptime() for legit time zone names. Currently, Python uses the C function strfime() because the string produced by wcsftime() cannot be parsed later. We should use the native Windows API to get the time zone name properly encoded. I don't recall the details, sorry. There is maybe an open issue for that one.
msg324337 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2018-08-29 17:17
> Because caring of surrogates,

Before*
msg324356 - (view) Author: Alexey Izbyshev (izbyshev) * (Python triager) Date: 2018-08-29 23:05
@p-ganssle

> I'm finding it very difficult to reconcile these things.

Paul, this issue was only about reconciling C and Python implementations of datetime module -- not fixing time.strftime(), which both of them delegate to. While the latter is certainly more important, it's also much more difficult, and IMO needs another issue report (maybe merged with #34512). As for the former, it seems that you've already fixed it in your PR.
msg324397 - (view) Author: Paul Ganssle (p-ganssle) * (Python committer) Date: 2018-08-30 18:08
@izbyshev That's totally fair and I wouldn't want to make it a condition of merging the existing fixes - I've already made great progress in fixing the time.strftime part as well.

The main reason it relates here is that I generally find the tests to be among the hardest part about writing a good PR, and if we can't make assertions about the behavior of strftime outputs, I think it makes it hard to prevent regressions. I figured if I can solve the problem all the way down the stack in one go, I might as well.

That said, Victor makes an *extremely* good point that this is an outsized effort for the bug it's fixing. No one really *needs* support for unpaired surrogates in their strftime as far as I can tell. The main reason I'm still working on it is that I'm curious to see if it's even possible to fix.
msg324401 - (view) Author: Alexey Izbyshev (izbyshev) * (Python triager) Date: 2018-08-30 23:10
> if we can't make assertions about the behavior of strftime outputs, I think it makes it hard to prevent regressions.

Ironically, some of the changes we may have to make to fix time.strftime() inconsistencies may appear like regressions to some users. For example, regarding dropping wcsftime() on all platforms, 'man strftime' on macOS contains the following warning in BUGS section: "The strftime() function does not correctly handle multibyte characters in the format argument.". (I'm not sure what "incorrect handling" means here). That's not to discourage you, just to point out that we should be extra careful at this point when there might be users relying on every variety of the current behavior. And that's also why I don't want to mix the fix for C vs. Python issue (low risk) with changes in time.strftime() (high risk).

Also, while I certainly agree with "the outsized effort" point, it seems that your PR goes far beyond supporting surrogates because falling back to escaping allows one to bypass checks in wcsftime()/strftime() and round-trip any code point regardless of the current locale.
msg324630 - (view) Author: Paul Ganssle (p-ganssle) * (Python committer) Date: 2018-09-05 12:29
I've left a mostly finished PR on #8983, but I've decided it's not really worth continuing to work on. Anyone can feel free to take it and run with it if they want to implement the fix for this.

As Alexey mentions, that PR indeed already fixes this bug, I mainly blocked on getting consistent behavior across all platforms. Currently that PR has consistent behavior across all platforms if you disable wcsftime *except* Ubuntu 14.0, which seems to have some kind of funky bug in wcstombs, though for the life of me I don't know how to consistently trigger it. I suspect that using strftime instead of wcsftime on all platforms would lead to simpler code that works most consistently, but I am guessing wcsftime is faster.

Another option is to use wsftime but fall back to strftime on MacOS in the event that the error condition (blank output from non-blank input) is detected.
History
Date User Action Args
2022-04-11 14:59:05adminsetgithub: 78662
2018-09-05 15:38:31rhettingersetnosy: + brett.cannon
2018-09-05 12:29:35p-gansslesetmessages: + msg324630
2018-08-30 23:10:49izbyshevsetmessages: + msg324401
2018-08-30 18:08:40p-gansslesetmessages: + msg324397
2018-08-29 23:05:12izbyshevsetmessages: + msg324356
2018-08-29 17:17:19vstinnersetmessages: + msg324337
2018-08-29 17:15:17vstinnersetmessages: + msg324336
2018-08-29 16:04:00p-gansslesetmessages: + msg324329
2018-08-29 00:01:51vstinnersetnosy: + vstinner
2018-08-28 23:50:34p-gansslesetkeywords: + patch
stage: patch review
pull_requests: + pull_request8457
2018-08-23 18:36:03p-gansslesetnosy: + p-ganssle
2018-08-23 17:40:57izbyshevcreate