Message 359696 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	p-ganssle
Recipients	cool-RR, p-ganssle
Date	2020-01-09.22:36:36
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1578609398.15.0.316165218675.issue39280@roundup.psfhosted.org>
In-reply-to

Content
I don't love the inconsistency, but can you elaborate on the actual danger posed by this? What security vulnerabilities involve parsing a datetime using a non-ascii digit? The reason that `fromisoformat` doesn't accept non-ASCII digits is actually because it's the inverse of `datetime.isoformat`, which never emits non-ASCII digits. For `strptime`, we're really going more for a general specifier for parsing datetime strings in a given format. I'll note that we do accept any valid unicode character for the date/time separator. From my perspective, there are a few goals, some of which may be in conflict with the others: 1. Mitigating security vulnerabilities, if they exist. 2. Supporting international locales if possible. 3. Improving consistency in the API. If no one ever actually specifies datetimes in non-ascii locales (and this gravestone that includes the date in both Latin and Chinese/Japanese characters seems to suggest otherwise: https://jbnewall.com/wp-content/uploads/2017/02/LEE-MONUMENT1-370x270.jpg ), then I don't see a problem dropping our patchy support, but I think we need to carefully consider the backwards compatibility implications if we go through with that. One bit of evidence in favor of "no one uses this anyway" is that no one has yet complained that apparently this doesn't work for "%d" even if it works for "%y", so presumably it's not heavily used. If our support for this sort of thing is so broken that no one could possibly be using it, I suppose we may as well break it all the way, but it would be nice to try and identify some resources that the documentation can point to for how to handle international date parsing. > Note the "unique and unambiguous". By accepting non-Ascii digits, we're breaking the uniqueness requirement of ISO 8601. I think in this case "but the standard says X" is probably not a very strong argument. Even if we were coding to the ISO 8601 standard (I don't think we claim to be, we're just using that convention), I don't really know how to interpret the "unique" portion of that claim, considering that ISO 8601 specifies dozens of ways to represent the same datetime. Here's an example from [my `dateutil.parse.isoparse` test suite](https://github.com/dateutil/dateutil/blob/110a09b4ad46fb87ae858a14bfb5a6b92557b01d/dateutil/test/test_isoparser.py#L150): ``` '2014-04-11T00', '2014-04-10T24', '2014-04-11T00:00', '2014-04-10T24:00', '2014-04-11T00:00:00', '2014-04-10T24:00:00', '2014-04-11T00:00:00.000', '2014-04-10T24:00:00.000', '2014-04-11T00:00:00.000000', '2014-04-10T24:00:00.000000' ``` All of these represent the exact same moment in time, and this doesn't even get into using the week-number/day-number configurations or anything with time zones. They also allow for the use of `,` as the subsecond-component separator (so add 4 more variants for that) and they allow you to leave out the dashes between the date components and the colons between time components, so you can multiply the possible variants by 4. Just a random aside - I think there may be strong arguments for doing this even if we don't care about coding to a specific standard.

I don't love the inconsistency, but can you elaborate on the actual *danger* posed by this? What security vulnerabilities involve parsing a datetime using a non-ascii digit?

The reason that `fromisoformat` doesn't accept non-ASCII digits is actually because it's the inverse of `datetime.isoformat`, which never *emits* non-ASCII digits. For `strptime`, we're really going more for a general specifier for parsing datetime strings in a given format. I'll note that we do accept any valid unicode character for the date/time separator.

From my perspective, there are a few goals, some of which may be in conflict with the others:

1. Mitigating security vulnerabilities, if they exist.
2. Supporting international locales if possible.
3. Improving consistency in the API.

If no one ever actually specifies datetimes in non-ascii locales (and this gravestone that includes the date in both Latin and Chinese/Japanese characters seems to suggest otherwise: https://jbnewall.com/wp-content/uploads/2017/02/LEE-MONUMENT1-370x270.jpg ), then I don't see a problem dropping our patchy support, but I think we need to carefully consider the backwards compatibility implications if we go through with that.

One bit of evidence in favor of "no one uses this anyway" is that no one has yet complained that apparently this doesn't work for "%d" even if it works for "%y", so presumably it's not heavily used. If our support for this sort of thing is so broken that no one could possibly be using it, I suppose we may as well break it all the way, but it would be nice to try and identify some resources that the documentation can point to for how to handle international date parsing.

> Note the "unique and unambiguous". By accepting non-Ascii digits, we're breaking the uniqueness requirement of ISO 8601.

I think in this case "but the standard says X" is probably not a very strong argument.
Even if we were coding to the ISO 8601 standard (I don't think we claim to be, we're just using that convention), I don't really know how to interpret the "unique" portion of that claim, considering that ISO 8601 specifies dozens of ways to represent the same datetime. Here's an example from [my `dateutil.parse.isoparse` test suite](https://github.com/dateutil/dateutil/blob/110a09b4ad46fb87ae858a14bfb5a6b92557b01d/dateutil/test/test_isoparser.py#L150):

```
'2014-04-11T00',
'2014-04-10T24',
'2014-04-11T00:00',
'2014-04-10T24:00',
'2014-04-11T00:00:00',
'2014-04-10T24:00:00',
'2014-04-11T00:00:00.000',
'2014-04-10T24:00:00.000',
'2014-04-11T00:00:00.000000',
'2014-04-10T24:00:00.000000'
```

All of these represent the exact same moment in time, and this doesn't even get into using the week-number/day-number configurations or anything with time zones. They also allow for the use of `,` as the subsecond-component separator (so add 4 more variants for that) and they allow you to leave out the dashes between the date components and the colons between time components, so you can multiply the possible variants by 4.

Just a random aside - I think there may be strong arguments for doing this even if we don't care about coding to a specific standard.

History
Date	User	Action	Args
2020-01-09 22:36:38	p-ganssle	set	recipients: + p-ganssle, cool-RR
2020-01-09 22:36:38	p-ganssle	set	messageid: <1578609398.15.0.316165218675.issue39280@roundup.psfhosted.org>
2020-01-09 22:36:38	p-ganssle	link	issue39280 messages
2020-01-09 22:36:36	p-ganssle	create