datetime: NULL dereference in fromisoformat() on PyUnicode_AsUTF8AndSize() failure #78635

izbyshev · 2018-08-21T21:38:30Z

BPO	34454
Nosy	@abalkin, @vstinner, @taleinat, @serhiy-storchaka, @pganssle, @izbyshev, @miss-islington
PRs	bpo-34454: datetime: Fix crash on PyUnicode_AsUTF8AndSize() failure #8850 bpo-34454: fix crash in .fromisoformat() methods when given inputs with surrogate code points #8859 bpo-34454: Fix issue with non-UTF8 separator strings #8862 [3.7] bpo-34454: fix .fromisoformat() methods crashing on inputs with surrogate code points (GH-8862) #8877 bpo-34482: Add tests for proper handling of non-UTF-8-encodable strin… #8878 bpo-34454: Clean up datetime.fromisoformat surrogate handling #8959 [3.7] bpo-34454: Clean up datetime.fromisoformat surrogate handling (GH-8959) #10041

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = None
closed_at = <Date 2018-10-23.07:06:56.538>
created_at = <Date 2018-08-21.21:38:29.737>
labels = ['extension-modules', '3.7', '3.8', 'type-crash']
title = 'datetime: NULL dereference in fromisoformat() on PyUnicode_AsUTF8AndSize() failure'
updated_at = <Date 2018-10-23.09:19:02.520>
user = 'https://github.com/izbyshev'

bugs.python.org fields:

activity = <Date 2018-10-23.09:19:02.520>
actor = 'vstinner'
assignee = 'none'
closed = True
closed_date = <Date 2018-10-23.07:06:56.538>
closer = 'taleinat'
components = ['Extension Modules']
creation = <Date 2018-08-21.21:38:29.737>
creator = 'izbyshev'
dependencies = []
files = []
hgrepos = []
issue_num = 34454
keywords = ['patch']
message_count = 12.0
messages = ['323844', '323846', '323849', '323856', '323875', '323954', '323958', '323960', '328263', '328272', '328278', '328294']
nosy_count = 7.0
nosy_names = ['belopolsky', 'vstinner', 'taleinat', 'serhiy.storchaka', 'p-ganssle', 'izbyshev', 'miss-islington']
pr_nums = ['8850', '8859', '8862', '8877', '8878', '8959', '10041']
priority = 'normal'
resolution = 'fixed'
stage = 'resolved'
status = 'closed'
superseder = None
type = 'crash'
url = 'https://bugs.python.org/issue34454'
versions = ['Python 3.7', 'Python 3.8']

izbyshev · 2018-08-21T21:38:30Z

A failure of PyUnicode_AsUTF8AndSize() in various fromisoformat() functions in Modules/_datetimemodule.c leads to NULL dereference due to the missing check, e.g.:

>>> from datetime import date
>>> date.fromisoformat('\ud800')
Segmentation fault (core dumped)

This is similar to msg123474. The missing NULL check was reported by Svace static analyzer.

While preparing tests for this issue, I've discovered a deeper problem. The C datetime implementation uses PyUnicode_AsUTF8AndSize() in several places, making some functions reject strings containing surrogate code points (0xD800 - 0xDFFF) since they can't be encoded in UTF-8. On the other hand, the pure-Python datetime implementation doesn't have this restriction. For example:

>>> import sys
>>> sys.modules['_datetime'] = None # block C implementation
>>> from datetime import time
>>> time().strftime('\ud800')
'\ud800'
>>> del sys.modules['datetime']
>>> del sys.modules['_datetime']
>>> from datetime import time
>>> time().strftime('\ud800')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'utf-8' codec can't encode character '\ud800' in position 0: surrogates not allowed

My PR (coming soon) doesn't address this difference but focuses on fixing the immediate problem instead. Suggestions are appreciated.

pganssle · 2018-08-21T22:05:29Z

So this is related to something I was actually meaning to fix. When I wrote this code I didn't understand the way PyUnicode works, there's actually no need to call PyUnicode_AsUTF8AndSize() on the entire unicode string.

My understanding is that each glyph in a given PyUnicode object is the same size, which means that this section of the code can go: https://github.com/python/cpython/blob/master/Modules/_datetimemodule.c#L4862

Instead we can just break the string up as glyphs 0-10 and 11+ and pass them on. Since by the contract of the function glyphs 0-10 and 11+ *must* be ASCII, we no longer need to worry about *valid* use cases where a character un-representable by UTF-8 will lead to anything except an error.

Obviously the null pointer error needs to be fixed since it should raise an error and not segfault.

I'd be happy to do the part where the string is broken up *before* being passed to PyUnicode_AsUTF8AndSize() if it would make it easier to implement your PR (which seems to touch a lot of other parts of the code as well).

izbyshev · 2018-08-21T22:34:32Z

I will be glad to rebase my PR and remove the try/except from the test if you remove the dependency of separator searching code on PyUnicode_AsUTF8AndSize() as you suggest. Or we can go the other way and merge mine first -- whatever you prefer.

Note that technically a difference between C and Python implementation of fromisoformat() will still remain: if a part of the input string before or after the separator contains surrogates, the C code will throw a UnicodeEncodeError while the Python code -- ValueError. But since the former error is a subclass of the latter, I guess it's OK, what do you think?

Also, note that the other discovered C/Python impl difference (for strftime, handled by another try/catch in tests) won't go away, of course, unless someone is ready to fix that as well.

pganssle · 2018-08-22T00:00:13Z

Note that technically a difference between C and Python implementation of fromisoformat() will still remain: if a part of the input string before or after the separator contains surrogates, the C code will throw a UnicodeEncodeError while the Python code -- ValueError. But since the former error is a subclass of the latter, I guess it's OK, what do you think?

I think the fact that the unicode string is decoded is an implementation detail. I would suggest swallowing the decode error and raising a standard ValueError.

serhiy-storchaka · 2018-08-22T12:21:25Z

I agree, that raising UnicodeEncodeError instead of ValueError is acceptable, since the former is a subclass of the latter. But since the default implementation raises more specialized subclass, it is probably that the user code will catch more narrow exception type. Seems, it is not hard to make the same exception be raised in all cases. For example, in date_fromisoformat():

    if (!dt_ptr || len != 10
        || parse_isoformat_date(dt_ptr, &year, &month, &day) < 0)
    {
        PyErr_Format(PyExc_ValueError, "Invalid isoformat string: %R", dstr);
        return NULL;
    }

taleinat · 2018-08-23T15:06:23Z

New changeset 096329f by Tal Einat (Paul Ganssle) in branch 'master':
bpo-34454: fix .fromisoformat() methods crashing on inputs with surrogate code points (GH-8862)
096329f

miss-islington · 2018-08-23T15:54:52Z

New changeset 89b1654 by Miss Islington (bot) in branch '3.7':
bpo-34454: fix .fromisoformat() methods crashing on inputs with surrogate code points (GH-8862)
89b1654

taleinat · 2018-08-23T16:47:49Z

Thanks for the report and initial patch, Alexey!

Thanks for the followup and final PR, Paul!

vstinner · 2018-10-22T16:14:14Z

I reopen the issue since there are still 2 open PRs: PR 8878 and PR 8959.

vstinner · 2018-10-22T19:37:27Z

New changeset 3df8540 by Victor Stinner (Paul Ganssle) in branch 'master':
bpo-34454: Clean up datetime.fromisoformat surrogate handling (GH-8959)
3df8540

miss-islington · 2018-10-22T22:35:21Z

New changeset 18450be by Miss Islington (bot) in branch '3.7':
bpo-34454: Clean up datetime.fromisoformat surrogate handling (GH-8959)
18450be

vstinner · 2018-10-23T09:19:02Z

Thanks Paul Ganssle for your fixes!

izbyshev mannequin added 3.7 (EOL) end of life 3.8 only security fixes extension-modules C modules in the Modules dir type-crash A hard crash of the interpreter, possibly with a core dump labels Aug 21, 2018

taleinat closed this as completed Aug 23, 2018

vstinner reopened this Oct 22, 2018

taleinat closed this as completed Oct 23, 2018

ezio-melotti transferred this issue from another repository Apr 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

datetime: NULL dereference in fromisoformat() on PyUnicode_AsUTF8AndSize() failure #78635

datetime: NULL dereference in fromisoformat() on PyUnicode_AsUTF8AndSize() failure #78635

izbyshev mannequin commented Aug 21, 2018

izbyshev mannequin commented Aug 21, 2018

pganssle commented Aug 21, 2018

izbyshev mannequin commented Aug 21, 2018

pganssle commented Aug 22, 2018

serhiy-storchaka commented Aug 22, 2018

taleinat commented Aug 23, 2018

miss-islington commented Aug 23, 2018

taleinat commented Aug 23, 2018

vstinner commented Oct 22, 2018

vstinner commented Oct 22, 2018

miss-islington commented Oct 22, 2018

vstinner commented Oct 23, 2018

datetime: NULL dereference in fromisoformat() on PyUnicode_AsUTF8AndSize() failure #78635

datetime: NULL dereference in fromisoformat() on PyUnicode_AsUTF8AndSize() failure #78635

Comments

izbyshev mannequin commented Aug 21, 2018

izbyshev mannequin commented Aug 21, 2018

pganssle commented Aug 21, 2018

izbyshev mannequin commented Aug 21, 2018

pganssle commented Aug 22, 2018

serhiy-storchaka commented Aug 22, 2018

taleinat commented Aug 23, 2018

miss-islington commented Aug 23, 2018

taleinat commented Aug 23, 2018

vstinner commented Oct 22, 2018

vstinner commented Oct 22, 2018

miss-islington commented Oct 22, 2018

vstinner commented Oct 23, 2018