Message 359695 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	cool-RR
Recipients	cool-RR
Date	2020-01-09.21:10:55
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1578604255.88.0.113689913662.issue39280@roundup.psfhosted.org>
In-reply-to

Content
I've been doing some research into the use of `\d` in regular expressions in CPython, and any security vulnerabilities that might happen as a result of the fact that it accepts non-Ascii digits like ٢ and ５. In most places in the CPython codebase, the `re.ASCII` flag is used for such cases, thus ensuring the `re` module prohibits these non-Ascii digits. Personally, my preference is to never use `\d` and always use `[0-9]`. I think that it's rule that's more easy to enforce and less likely to result in a slipup, but that's a matter of personal taste. I found a few places where we don't use the `re.ASCII` flag and we do accept non-Ascii digits. The first and less interesting place is platform.py, where we define patterns used for detecting versions of PyPy and IronPython. I don't know how anyone would exploit that, but personally I'd change that to a [0-9] just to be safe. I've opened bpo-39279 for that. The more sensitive place is the `datetime` module. Happily, the `datetime.datetime.fromisoformat` function rejects non-Ascii digits. But the `datetime.datetime.strptime` function does not: from datetime import datetime time_format = '%Y-%m-%d' parse = lambda s: datetime.strptime(s, time_format) x = '٢019-12-22' y = '2019-12-22' assert x != y assert parse(x) == parse(y) print(parse(x)) # Output: 2019-12-22 00:00:00 If user code were to check for uniqueness of a datetime by comparing it as a string, this is where an attacker could fool this logic, by using a non-Ascii digit. Two more interesting points about this: 1. If you'd try the same trick, but you'd insert ٢ in the day section instead of the year section, Python would reject that. So we definitely have inconsistent behavior. 2. In the documentation for `strptime`, we're referencing the 1989 C standard. Since the first version of Unicode was published in 1991, it's reasonable not to expect the standard to support digits that were introduced in Unicode. If you'd scroll down in that documentation, you'll see that we also implement the less-known ISO 8601 standard, where `%G-%V-%u` represents a year, week number, and day of week. The `%G` is vulnerable: from datetime import datetime time_format = '%G-%V-%u' parse = lambda s: datetime.strptime(s, time_format) x = '٢019-53-4' y = '2019-53-4' assert x != y assert parse(x) == parse(y) print(parse(x)) # Output: 2020-01-02 00:00:00 I looked at the ISO 8601:2004 document, and under the "Fundamental principles" chapter, it says: This International Standard gives a set of rules for the representation of time points time intervals recurring time intervals. Both accurate and approximate representations can be identified by means of unique and unambiguous expressions specifying the relevant dates, times of day and durations. Note the "unique and unambiguous". By accepting non-Ascii digits, we're breaking the uniqueness requirement of ISO 8601.

I've been doing some research into the use of `\d` in regular expressions in CPython, and any security vulnerabilities that might happen as a result of the fact that it accepts non-Ascii digits like ٢ and ５.

In most places in the CPython codebase, the `re.ASCII` flag is used for such cases, thus ensuring the `re` module prohibits these non-Ascii digits. Personally, my preference is to never use `\d` and always use `[0-9]`. I think that it's rule that's more easy to enforce and less likely to result in a slipup, but that's a matter of personal taste.

I found a few places where we don't use the `re.ASCII` flag and we do accept non-Ascii digits.

The first and less interesting place is platform.py, where we define patterns used for detecting versions of PyPy and IronPython. I don't know how anyone would exploit that, but personally I'd change that to a [0-9] just to be safe. I've opened bpo-39279 for that. 

The more sensitive place is the `datetime` module. 

Happily, the `datetime.datetime.fromisoformat` function rejects non-Ascii digits. But the `datetime.datetime.strptime` function does not: 

    from datetime import datetime
    
    time_format = '%Y-%m-%d'
    parse = lambda s: datetime.strptime(s, time_format)
       
    x = '٢019-12-22'
    y = '2019-12-22'
    assert x != y
    assert parse(x) == parse(y)
    print(parse(x))
    # Output: 2019-12-22 00:00:00

If user code were to check for uniqueness of a datetime by comparing it as a string, this is where an attacker could fool this logic, by using a non-Ascii digit.

Two more interesting points about this: 

1. If you'd try the same trick, but you'd insert ٢ in the day section instead of the year section, Python would reject that. So we definitely have inconsistent behavior.
2. In the documentation for `strptime`, we're referencing the 1989 C standard. Since the first version of Unicode was published in 1991, it's reasonable not to expect the standard to support digits that were introduced in Unicode.

If you'd scroll down in that documentation, you'll see that we also implement the less-known ISO 8601 standard, where `%G-%V-%u` represents a year, week number, and day of week. The `%G` is vulnerable:
    
    from datetime import datetime
    
    time_format = '%G-%V-%u'
    parse = lambda s: datetime.strptime(s, time_format)
   
    x = '٢019-53-4'
    y = '2019-53-4'
    assert x != y
    assert parse(x) == parse(y)
    print(parse(x))
    # Output: 2020-01-02 00:00:00

I looked at the ISO 8601:2004 document, and under the "Fundamental principles" chapter, it says:

    This International Standard gives a set of rules for the representation of
        time points
        time intervals
        recurring time intervals.
        Both accurate and approximate representations can be identified by means of unique and unambiguous expressions specifying the relevant dates, times of day and durations.  

Note the "unique and unambiguous". By accepting non-Ascii digits, we're breaking the uniqueness requirement of ISO 8601.

History
Date	User	Action	Args
2020-01-09 21:10:55	cool-RR	set	recipients: + cool-RR
2020-01-09 21:10:55	cool-RR	set	messageid: <1578604255.88.0.113689913662.issue39280@roundup.psfhosted.org>
2020-01-09 21:10:55	cool-RR	link	issue39280 messages
2020-01-09 21:10:55	cool-RR	create