Message 389127 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	p-ganssle
Recipients	belopolsky, eric.araujo, eryksun, lemburg, p-ganssle, terry.reedy, zzzeek
Date	2021-03-20.00:05:31
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1616198732.12.0.574286590999.issue43484@roundup.psfhosted.org>
In-reply-to

Content
> That it allows creating the datetime instance looks like a bug to me, i.e. a time before 0001-01-01 00:00 UTC is invalid. What am I misunderstanding? `datetime.datetime(1, 1, 1, tzinfo=timezone(timedelta(hours=1)))` is a valid datetime, it's just that it cannot be converted to all other timestamps, because in some time zones, the same absolute time is out of datetime's range. `datetime.datetime` is a representation of an abstract datetime, and it can also be annotated with a time zone to basically tag the civil time with a function for converting it into other representations of the same absolute time. The range of valid `datetime.datetime` objects is based entirely on the naïve portion of the datetime, and has nothing to do with the absolute time. So this is indeed a natural consequence of the chosen design. If we wanted to change things, it would cause a number of problems, and the cure would be much worse than the "disease". For one thing, accessing UTC offsets is done lazily, so `.utcoffset()` is not currently called during `datetime` creation. The datetime documentation lays out that this is consistent with the raison d'être of `datetime`: "While date and time arithmetic is supported, the focus of the implementation is on efficient attribute extraction for output formatting and manipulation." In order to determine whether a given `datetime` can always be converted to an equivalent datetime in any time zone, we'd need to actively determine its UTC offset, which would be a major performance regression in creating aware datetimes. We could avoid this performance regression by only doing the `.utcoffset()` check when the datetime is within 2 days of `MINYEAR` or `MAXYEAR`, but while this would be a more minor performance regression, it would also add new edge cases where `.utcoffset()` is sometimes called during the constructor and sometimes not, which is not ideal. Not to mention if we were to ever open up the allowed return values for `.utcoffset()` the logic might get hairier (depending on the nature of the allowed values). Another issue with "fixing" this is that it would take currently-valid datetimes and turn them into invalid datetimes, which violates backwards compatibility. I imagine in most cases this is only done as part of test suites, since TZ-aware datetimes near 0 and 10,000 CE are anachronistic and not likely to be of much instrumental value, but the same can be said of these potentially "invalid" dates in the first place. Additionally, even worse is that even naïve datetimes can be converted to UTC or other time zones, and if we want to add a new constraint that `some_datetime.astimezone(some_timezone)` must always work, then you wouldn't even be able to construct `datetime.MINYEAR` or `datetime.MAXYEAR`, since `datetime.MINYEAR.astimezone(timezone(timedelta(hours=-24)))` would fail everywhere, and worse, the minimum datetime value you could construct would depend on your system locale! Again, the alternative would be to make an exception for naïve datetimes, but given that this change is of dubious value to start with, I don't think it is worth it. > So I'm pretty sure this is "not a bug" but it's a bit of a problem and I have a user suggesting the "security vulnerability" bell on this one, and to be honest I don't even know what any library would do to "prevent" this. I don't really know why it would be a "security vulnerability", but presumably a library could either convert their datetimes to UTC as soon as they get them from the user if they want to use them as UTC in the future, or they could simply refuse to accept any datetimes outside the range `datetime.datetime.MINYEAR + timedelta(hours=48) < dt.replace(tzinfo=None) < datetime.datetime.MAXYEAR - timedelta(hours=48)`, or if the concern is only about UTC, then refuse datetimes outside the range `datetime.MINYEAR.replace(tzinfo=timezone.utc) < dt < datetime.MAXYEAR.replace(tzinfo=timezone.utc)`. > Why's this a security problem? ish? because PostgreSQL has a data type "TIMESTAMP WITH TIMEZONE" and if you take said date and INSERT it into your database, then SELECT it back using any Python DBAPI that returns datetime() objects like psycopg2, if your server is in a timezone with zero or negative offset compared to the given date, you get an error. So the mischievous user can create that datetime for some reason and now they've broken your website which can't SELECT that table anymore without crashing. Can you clarify why this crashes? Is it because it always returns the datetime value in UTC? > So, suppose you maintain the database library that helps people send data in and out of psycopg2. We have, the end user's application, we have the database abstraction library, we have the psycopg2 driver, we have Python's datetime() object with MIN_YEAR, and finally we have PostgreSQL with the TIMEZONE WITH TIMESTAMP datatype that I've never liked. > >Of those five roles, whose bug is this? I'd like to say it's the end user for letting untrusted input that can put unusual timezones and timestamps in their system. But at the same time it's a little weird Python is letting me create this date that lacks the ability to convert into UTC. thanks for reading! It sounds like the fundamental issue here is that PostgreSQL supports a different range of datetimes than Python does, regardless of the question of whether any Python datetime can be converted into another timezone. It sounds to me like that's a mismatch between PostgreSQL's data type and Python's data type, and I'm not sure how they are squared up now. If the database abstraction layer is making the guarantee that anything you can INSERT into the database can be SELECTed back out (meaning that it will artificially restrict the range of valid values to the union of those supported by Postgres and Python), then if the abstraction layer allows you to choose what time zone applies when you read the values out, it should probably make sure that it doesn't write something in that can't be represented in another time zone. If the abstraction layer's guarantee is that it can read out anything that can be stored in a postgres database, then I still think it might be the abstraction layer's bug, since apparently Postgres lets you store values that cannot be represented in all the supported time zones that you could read them out as — the abstraction layer may need to use a different type (possibly a subclass) for "out-of-range" values. I'll note that I don't actually understand the difference between "abstraction layer" and "psycopg2 driver", so I may be conflating those two, but at the end of the day I think one of those two is the right place to fix this. If there are good reasons for not fixing it in either of those two layers, then obviously the buck stops in the user's application — though obviously it would be better to fix it lower in the stack, since it's difficult for users to keep track of minutia like this. I think it likely that neither CPython nor Postgres are the right levels of abstraction to fix it as, since neither one is in the business of making their datetime types compatible with one another (and each presumably has a thousand other integrations to also worry about). Hopefully that's helpful information! Understandable if the answer is somewhat disappointing.

> That it allows creating the datetime instance looks like a bug to me, i.e. a time before 0001-01-01 00:00 UTC is invalid. What am I misunderstanding?

`datetime.datetime(1, 1, 1, tzinfo=timezone(timedelta(hours=1)))` is a valid datetime, it's just that it cannot be converted to all other timestamps, because in some time zones, the same absolute time is out of datetime's range.

`datetime.datetime` is a representation of an abstract datetime, and it can also be annotated with a time zone to basically tag the civil time with a function for converting it into other representations of the same *absolute* time. The range of valid `datetime.datetime` objects is based entirely on the naïve portion of the datetime, and has nothing to do with the absolute time. So this is indeed a natural consequence of the chosen design.

If we wanted to change things, it would cause a number of problems, and the cure would be much worse than the "disease". For one thing, accessing UTC offsets is done lazily, so `.utcoffset()` is not currently called during `datetime` creation. The datetime documentation lays out that this is consistent with the raison d'être of `datetime`: "While date and time arithmetic is supported, the focus of the implementation is on efficient attribute extraction for output formatting and manipulation." In order to determine whether a given `datetime` can always be converted to an equivalent datetime in any time zone, we'd need to actively determine its UTC offset, which would be a major performance regression in creating aware datetimes. We could avoid this performance regression by only doing the `.utcoffset()` check when the datetime is within 2 days of `MINYEAR` or `MAXYEAR`, but while this would be a more minor performance regression, it would also add new edge cases where `.utcoffset()` is sometimes called during the constructor and sometimes not, which is not ideal. Not to mention if we were to ever open up the allowed return values for `.utcoffset()` the logic might get hairier (depending on the nature of the allowed values).

Another issue with "fixing" this is that it would take currently-valid datetimes and turn them into invalid datetimes, which violates backwards compatibility. I imagine in most cases this is only done as part of test suites, since TZ-aware datetimes near 0 and 10,000 CE are anachronistic and not likely to be of much instrumental value, but the same can be said of these potentially "invalid" dates in the first place.

Additionally, even worse is that even naïve datetimes can be converted to UTC or other time zones, and if we want to add a new constraint that `some_datetime.astimezone(some_timezone)` must always work, then you wouldn't even be able to *construct* `datetime.MINYEAR` or `datetime.MAXYEAR`, since `datetime.MINYEAR.astimezone(timezone(timedelta(hours=-24)))` would fail everywhere, and worse, the minimum datetime value you could construct would depend on your system locale! Again, the alternative would be to make an exception for naïve datetimes, but given that this change is of dubious value to start with, I don't think it is worth it.

> So I'm pretty sure this is "not a bug" but it's a bit of a problem and I have a user suggesting the "security vulnerability" bell on this one, and to be honest I don't even know what any library would do to "prevent" this.

I don't really know why it would be a "security vulnerability", but presumably a library could either convert their datetimes to UTC as soon as they get them from the user if they want to use them as UTC in the future, or they could simply refuse to accept any datetimes outside the range `datetime.datetime.MINYEAR + timedelta(hours=48) < dt.replace(tzinfo=None) < datetime.datetime.MAXYEAR - timedelta(hours=48)`, or if the concern is only about UTC, then refuse datetimes outside the range `datetime.MINYEAR.replace(tzinfo=timezone.utc) < dt < datetime.MAXYEAR.replace(tzinfo=timezone.utc)`.

> Why's this a security problem?   ish?    because PostgreSQL has a data type "TIMESTAMP WITH TIMEZONE" and if you take said date and INSERT it into your database, then SELECT it back using any Python DBAPI that returns datetime() objects like psycopg2, if your server is in a timezone with zero or negative offset compared to the given date, you get an error.  So the mischievous user can create that datetime for some reason and now they've broken your website which can't SELECT that table anymore without crashing.

Can you clarify why this crashes? Is it because it always returns the datetime value in UTC?

> So, suppose you maintain the database library that helps people send data in and out of psycopg2.    We have, the end user's application, we have the database abstraction library, we have the psycopg2 driver, we have Python's datetime() object with MIN_YEAR, and finally we have PostgreSQL with the TIMEZONE WITH TIMESTAMP datatype that I've never liked.
>
>Of those five roles, whose bug is this?    I'd like to say it's the end user for letting untrusted input that can put unusual timezones and timestamps in their system.   But at the same time it's a little weird Python is letting me create this date that lacks the ability to convert into UTC.     thanks for reading!

It sounds like the fundamental issue here is that PostgreSQL supports a different range of datetimes than Python does, regardless of the question of whether any Python datetime can be converted into another timezone. It sounds to me like that's a mismatch between PostgreSQL's data type and Python's data type, and I'm not sure how they are squared up now. If the database abstraction layer is making the guarantee that anything you can INSERT into the database can be SELECTed back out (meaning that it will artificially restrict the range of valid values to the union of those supported by Postgres and Python), then if the abstraction layer allows you to choose what time zone applies when you read the values out, it should probably make sure that it doesn't write something in that can't be represented in another time zone.

If the abstraction layer's guarantee is that it can read out anything that can be stored in a postgres database, then I still think it might be the abstraction layer's bug, since apparently Postgres lets you store values that cannot be represented in all the supported time zones that you could read them out as — the abstraction layer may need to use a different type (possibly a subclass) for "out-of-range" values.

I'll note that I don't actually understand the difference between "abstraction layer" and "psycopg2 driver", so I may be conflating those two, but at the end of the day I think one of those two is the right place to fix this. If there are good reasons for not fixing it in either of those two layers, then obviously the buck stops in the user's application — though obviously it would be better to fix it lower in the stack, since it's difficult for users to keep track of minutia like this. I think it likely that neither CPython nor Postgres are the right levels of abstraction to fix it as, since neither one is in the business of making their datetime types compatible with one another (and each presumably has a thousand other integrations to also worry about).

Hopefully that's helpful information! Understandable if the answer is somewhat disappointing.

History
Date	User	Action	Args
2021-03-20 00:05:32	p-ganssle	set	recipients: + p-ganssle, lemburg, terry.reedy, belopolsky, eric.araujo, zzzeek, eryksun
2021-03-20 00:05:32	p-ganssle	set	messageid: <1616198732.12.0.574286590999.issue43484@roundup.psfhosted.org>
2021-03-20 00:05:32	p-ganssle	link	issue43484 messages
2021-03-20 00:05:31	p-ganssle	create