classification
Title: Don't allow datetime parsing to accept non-Ascii digits
Type: security Stage: resolved
Components: Library (Lib) Versions: Python 3.9, Python 3.8, Python 3.7
process
Status: closed Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: cool-RR, mark.dickinson, methane, p-ganssle, steven.daprano, vstinner
Priority: normal Keywords: patch

Created on 2020-01-09 21:10 by cool-RR, last changed 2020-07-17 17:11 by cool-RR. This issue is now closed.

Pull Requests
URL Status Linked Edit
PR 17931 closed cool-RR, 2020-01-09 22:14
Messages (10)
msg359695 - (view) Author: Ram Rachum (cool-RR) * Date: 2020-01-09 21:10
I've been doing some research into the use of `\d` in regular expressions in CPython, and any security vulnerabilities that might happen as a result of the fact that it accepts non-Ascii digits like ٢ and 5.

In most places in the CPython codebase, the `re.ASCII` flag is used for such cases, thus ensuring the `re` module prohibits these non-Ascii digits. Personally, my preference is to never use `\d` and always use `[0-9]`. I think that it's rule that's more easy to enforce and less likely to result in a slipup, but that's a matter of personal taste.

I found a few places where we don't use the `re.ASCII` flag and we do accept non-Ascii digits.

The first and less interesting place is platform.py, where we define patterns used for detecting versions of PyPy and IronPython. I don't know how anyone would exploit that, but personally I'd change that to a [0-9] just to be safe. I've opened bpo-39279 for that. 

The more sensitive place is the `datetime` module. 

Happily, the `datetime.datetime.fromisoformat` function rejects non-Ascii digits. But the `datetime.datetime.strptime` function does not: 

    from datetime import datetime
    
    time_format = '%Y-%m-%d'
    parse = lambda s: datetime.strptime(s, time_format)
       
    x = '٢019-12-22'
    y = '2019-12-22'
    assert x != y
    assert parse(x) == parse(y)
    print(parse(x))
    # Output: 2019-12-22 00:00:00

If user code were to check for uniqueness of a datetime by comparing it as a string, this is where an attacker could fool this logic, by using a non-Ascii digit.

Two more interesting points about this: 

1. If you'd try the same trick, but you'd insert ٢ in the day section instead of the year section, Python would reject that. So we definitely have inconsistent behavior.
2. In the documentation for `strptime`, we're referencing the 1989 C standard. Since the first version of Unicode was published in 1991, it's reasonable not to expect the standard to support digits that were introduced in Unicode.

If you'd scroll down in that documentation, you'll see that we also implement the less-known ISO 8601 standard, where `%G-%V-%u` represents a year, week number, and day of week. The `%G` is vulnerable:
    
    from datetime import datetime
    
    time_format = '%G-%V-%u'
    parse = lambda s: datetime.strptime(s, time_format)
   
    x = '٢019-53-4'
    y = '2019-53-4'
    assert x != y
    assert parse(x) == parse(y)
    print(parse(x))
    # Output: 2020-01-02 00:00:00

I looked at the ISO 8601:2004 document, and under the "Fundamental principles" chapter, it says:

    This International Standard gives a set of rules for the representation of
        time points
        time intervals
        recurring time intervals.
        Both accurate and approximate representations can be identified by means of unique and unambiguous expressions specifying the relevant dates, times of day and durations.  

Note the "unique and unambiguous". By accepting non-Ascii digits, we're breaking the uniqueness requirement of ISO 8601.
msg359696 - (view) Author: Paul Ganssle (p-ganssle) * (Python committer) Date: 2020-01-09 22:36
I don't love the inconsistency, but can you elaborate on the actual *danger* posed by this? What security vulnerabilities involve parsing a datetime using a non-ascii digit?

The reason that `fromisoformat` doesn't accept non-ASCII digits is actually because it's the inverse of `datetime.isoformat`, which never *emits* non-ASCII digits. For `strptime`, we're really going more for a general specifier for parsing datetime strings in a given format. I'll note that we do accept any valid unicode character for the date/time separator.

From my perspective, there are a few goals, some of which may be in conflict with the others:

1. Mitigating security vulnerabilities, if they exist.
2. Supporting international locales if possible.
3. Improving consistency in the API.

If no one ever actually specifies datetimes in non-ascii locales (and this gravestone that includes the date in both Latin and Chinese/Japanese characters seems to suggest otherwise: https://jbnewall.com/wp-content/uploads/2017/02/LEE-MONUMENT1-370x270.jpg ), then I don't see a problem dropping our patchy support, but I think we need to carefully consider the backwards compatibility implications if we go through with that.

One bit of evidence in favor of "no one uses this anyway" is that no one has yet complained that apparently this doesn't work for "%d" even if it works for "%y", so presumably it's not heavily used. If our support for this sort of thing is so broken that no one could possibly be using it, I suppose we may as well break it all the way, but it would be nice to try and identify some resources that the documentation can point to for how to handle international date parsing.


> Note the "unique and unambiguous". By accepting non-Ascii digits, we're breaking the uniqueness requirement of ISO 8601.

I think in this case "but the standard says X" is probably not a very strong argument. 
 Even if we were coding to the ISO 8601 standard (I don't think we claim to be, we're just using that convention), I don't really know how to interpret the "unique" portion of that claim, considering that ISO 8601 specifies dozens of ways to represent the same datetime. Here's an example from [my `dateutil.parse.isoparse` test suite](https://github.com/dateutil/dateutil/blob/110a09b4ad46fb87ae858a14bfb5a6b92557b01d/dateutil/test/test_isoparser.py#L150):

```
    '2014-04-11T00',
    '2014-04-10T24',
    '2014-04-11T00:00',
    '2014-04-10T24:00',
    '2014-04-11T00:00:00',
    '2014-04-10T24:00:00',
    '2014-04-11T00:00:00.000',
    '2014-04-10T24:00:00.000',
    '2014-04-11T00:00:00.000000',
    '2014-04-10T24:00:00.000000'
```

All of these represent the exact same moment in time, and this doesn't even get into using the week-number/day-number configurations or anything with time zones. They also allow for the use of `,` as the subsecond-component separator (so add 4 more variants for that) and they allow you to leave out the dashes between the date components and the colons between time components, so you can multiply the possible variants by 4.

Just a random aside - I think there may be strong arguments for doing this even if we don't care about coding to a specific standard.
msg359697 - (view) Author: Steven D'Aprano (steven.daprano) * (Python committer) Date: 2020-01-09 22:38
> If user code were to check for uniqueness of a datetime by comparing it as a string, this is where an attacker could fool this logic, by using a non-Ascii digit.

To me, this seems like a pretty thin justification for calling this a security vulnerability.

Using the exact same reasoning, one could argue that "If user code were to check for uniqueness of a float by comparing it as a string, this is where an attacker could fool this logic, by using leading or trailing spaces, extra non-significant digits, upper- or lowercase 'E', etc."

py> float(" +00012.145000000000000099999e00 ") == float("12.145")
True

Referring specifically to strptime(), there are many format codes which break uniqueness by allowing optional leading zeroes, and month names are case insensitive e.g. %b accepts 'jAn' as well as 'Jan'.

https://docs.python.org/3/library/datetime.html#strftime-strptime-behavior

As far as the inconsistency, I think that's an argument for being less strict, not more, and allowing non-ASCII digits in more places not just the first. Why shouldn't (let's say) a Bengali user specify the day of the month using Bengali digits?
msg359698 - (view) Author: Ram Rachum (cool-RR) * Date: 2020-01-09 23:00
> To me, this seems like a pretty thin justification for calling this a security vulnerability.

I know that lots of exploits were made because of bugs in Python's URL parsing, and security releases of Python were made because of that. It's possible that similar exploits could be done for bugs in parsing datetimes. It's different, and it's a thin justification, I agree, so if it's treated as a non-security bug, I'm okay with that. 


> ISO 8601 specifies dozens of ways to represent the same datetime

Yes, but not within the same format. If someone were to choose the format '2014-04-10T24:00:00', they would have a reasonable expectation that there is only one unique string that corresponds with that datetime.

> As far as the inconsistency, I think that's an argument for being less strict, not more, and allowing non-ASCII digits in more places not just the first. Why shouldn't (let's say) a Bengali user specify the day of the month using Bengali digits?

That's an interesting direction. It might be a good thing to allow users in different locales to specify datetime with their native digits. But, if we were to go in that direction, we would have to: 

1. Define that use case more rigorously, consider whether it has any effects we need to consider, like clashes with how the local culture expresses dates. Maybe we accept the year 2020 in Elbonian digits, but the Elbonian people count their years from the birth of Saint Elbon, and for them the year 2020 in Elbonian digits means -1553 BC for us?
2. Enable it in other fields except year.
3. Add tests for it. 

Since we're not going to do that ever, I don't think there's a point in leaving a 10% implementation of non-Ascii datetime parsing.
msg359707 - (view) Author: Steven D'Aprano (steven.daprano) * (Python committer) Date: 2020-01-10 07:58
> > ISO 8601 specifies dozens of ways to represent the same datetime
> 
> Yes, but not within the same format. If someone were to choose the 
> format '2014-04-10T24:00:00', they would have a reasonable expectation 
> that there is only one unique string that corresponds with that 
> datetime.

There are no format code in '2014-04-10T24:00:00', it cannot be a 
format. Without knowing the format, there is no way of telling what that 
string represents or whether it is a unique representation.

- Is the leading 0 in '04' mandatory or optional?
- Could the '00's be replaced by '0' or even left out altogether?
- Is 24:00:00 the exact same time as 00:00:00?
- If the final '00' represents seconds, does it accept microseconds '.000'?

> > As far as the inconsistency, I think that's an argument for being 
> > less strict, not more, and allowing non-ASCII digits in more places 
> > not just the first. Why shouldn't (let's say) a Bengali user specify 
> > the day of the month using Bengali digits?
> 
> That's an interesting direction. It might be a good thing to allow 
> users in different locales to specify datetime with their native 
> digits. But, if we were to go in that direction, we would have to:
> 
> 1. Define that use case more rigorously, consider whether it has any 
> effects we need to consider, like clashes with how the local culture 
> expresses dates.

I don't think so. Supporting different calendars is a seperate, and far 
more complex issue, than just supporting dumb transliteration of 
Hindu-Arabic digits 0...9 into other character sets.

This makes datetime behave more like int(), float(), etc, all of which 
accept non-ASCII digits including (e.g.) East-Asian double-width digits.

    py> int('5432')
    5432

Supporting non-Western calendars is a worthy thing, but probably best 
left for a third-party library.

> 2. Enable it in other fields except year.

That part is easy: replace `[0-9]` in regexes with `\d`.

> 3. Add tests for it. 

Aye, that would need to be done. Adding the tests would be the hard 
part.
msg359715 - (view) Author: Ram Rachum (cool-RR) * Date: 2020-01-10 09:38
Okay, since it seems like I'm the only one who wants this change, I'll let it go. Thanks for your input.
msg359729 - (view) Author: Paul Ganssle (p-ganssle) * (Python committer) Date: 2020-01-10 15:15
> Yes, but not within the same format. If someone were to choose the format '2014-04-10T24:00:00', they would have a reasonable expectation that there is only one unique string that corresponds with that datetime

That's a particularly bad example, because it's exactly the same as another string with the exact same format:

  2014-04-11T00:00:00

Since ISO 8601 allows you to specify midnight (and only midnight) using previous day + 24:00. Admittedly, that is the only ambiguity I know of offhand (though it's a huge spec) *for a given format*, but also ISO 8601 does not really have a concept of format specifiers, so it's not like there's a way to unambiguously specify the format you are intending to use.

Either way, I think we can explicitly dispense with "there will be an exact mapping between a given (format_str, datetime_str) pair and the datetime it produces" as a goal here. I can't think of any good reason you'd want that property, nor have we made any indication that I can see that we provide it (probably the opposite, since there are some formats that explicitly ignore whitespace).

> Okay, since it seems like I'm the only one who wants this change, I'll let it go. Thanks for your input.

I wouldn't go that far. I think I am +0 or +1 on this change, I just wanted to be absolutely clear *why* we're doing this. I don't want someone pointing at this thread in the future and saying, "Core dev says that it's a bug in their code if they don't follow X standard / if more than one string produces the same datetime / etc".

I think the strongest argument for making this or a similar change is that I'm fairly certain that we don't have the bandwidth to handle internationalized dates and I don't think we have much to gain by doing a sort of half-assed version of that by accepting unicode transliterations of numerals and calling it a day. I think there are tons of edge cases here that could bite people, and if we don't support this *now* I'd rather give people an error message early in the process and try to point people at a library that is designed to handle datetime localization issues. If all we're going to do is switch [0-9] to \d (which won't work for the places where it's actually [1-9], mind you), I think people will get a better version of that with something like:

  def normalize_dt_str(dt_str):
      return "".join(str(int(x)) if x.isdigit() else x
                     for x in dt_str)

There are probably more robust and/or faster versions of this, but it's probably roughly equivalent to what we'd be doing here *anyway*, and at least people would have to opt-in to this.

I am definitely open to us supporting non-ASCII digits in strptime if it would be useful at the level of support we could provide, but given that it's currently broken for any reasonable use case and as far as I know no one has complained, we're better off resolving the inconsistency by requiring ASCII digits and considering non-ASCII support to be a separate feature request.

CC-ing Inada on this as unicode guru and because he might have some intuition about how useful non-ASCII support might be. The only place I've seen non-ASCII dates is in Japanese graveyards, and those tend to use Chinese numerals (which don't match \d anyway), though Japanese and Korean also tends to make heavier use of "full-width numerals" block, so maybe parsing something like "2020-02-02" is an actual pain point that would be improved by this change (though, again, I suspect that this is just the beginning of the required changes and we may never get a decent implementation that supports unicode numerals).
msg359730 - (view) Author: Steven D'Aprano (steven.daprano) * (Python committer) Date: 2020-01-10 15:40
> If all we're going to do is 
> switch [0-9] to \d (which won't work for the places where it's 
> actually [1-9], mind you)

Ah, that's a good point.

[...]
> we're better off resolving the 
> inconsistency by requiring ASCII digits and considering non-ASCII 
> support to be a separate feature request.

This seems reasonable.
msg361123 - (view) Author: Ram Rachum (cool-RR) * Date: 2020-01-31 17:38
Hi guys,

Can we please have a decision on whether we want to move forward with this issue or not? I have a patch in the PR, and I'll be happy to continue working on it, or close this issue if that's the decision.


Thanks,
Ram.
msg361271 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2020-02-03 08:31
> Hi guys,

sidenote: please avoid "hi guys" which is not inclusive, see https://heyguys.cc/ Thanks ;-)
History
Date User Action Args
2020-07-17 17:11:22cool-RRsetstatus: open -> closed
stage: patch review -> resolved
2020-02-03 21:40:21mark.dickinsonsetnosy: + mark.dickinson
2020-02-03 08:31:04vstinnersetmessages: + msg361271
2020-01-31 17:38:36cool-RRsetmessages: + msg361123
2020-01-10 15:40:49steven.dapranosetmessages: + msg359730
2020-01-10 15:15:51p-gansslesetnosy: + methane
messages: + msg359729
2020-01-10 09:38:43cool-RRsetmessages: + msg359715
2020-01-10 08:12:49vstinnersetnosy: + vstinner
2020-01-10 07:58:47steven.dapranosetmessages: + msg359707
2020-01-09 23:00:24cool-RRsetmessages: + msg359698
2020-01-09 22:38:43steven.dapranosetnosy: + steven.daprano
messages: + msg359697
2020-01-09 22:36:38p-gansslesetnosy: + p-ganssle
messages: + msg359696
2020-01-09 22:14:58cool-RRsetkeywords: + patch
stage: patch review
pull_requests: + pull_request17337
2020-01-09 21:10:55cool-RRcreate