Message 373270 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	zzzeek
Recipients	Rokas K. (rku), crusaderky, djarb, jab, jcea, martin.panter, njs, yselivanov, zzzeek
Date	2020-07-08.03:44:20
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1594179862.3.0.875058244526.issue22239@roundup.psfhosted.org>
In-reply-to

Content
> Oh, I thought the primary problem for SQLAlchemy supporting async is that the ORM needs to do IO from inside __getattr__ methods. So I assumed that the reason you were so excited about greenlets was that it would let you use await_() from inside those __getattr__ calls, which would involve exposing your use of greenlets as part of your public API. The primary problem is people want to execute() a SQL statement using await, and then they want to use a non-blocking database driver (basically asyncpg, I'm not sure there are any others, maybe there's one for MySQL also) on the back. Tools like aiopg have provided partial SQLAlchemy-like front-ends to accomplish this but they can't do ORM support, not because the ORM has lazy loading, but just to do explicit operations like query.all() or session.flush() that can sometimes require a lot of front-to-back database operations to complete which would be very involved to rewrite all that code using async/await. Then there's the secondary problem of ORMs doing lazy loading, which is what you refer towards as "IO inside __getattr__ methods". SQLAlchemy is not actually as dependent on lazy loading as other ORMs as we support a wide range of ways to "eagerly" load data up front. With the SQLAlchemy 2.0-style ORM API that has a clear spot for "await" to occur, they can call "await session.execute(select(SomeObject))" and get a whole traversible graph of things loaded up front. We even have a loader called "raiseload" that is specifically anti-lazy loading, it's a loader that raises an error if you try to access something that wasn't explicitly loaded already. So for a lot of cases we are already there. But then, towards your example of "something.b = x", or more commonly in ORMS a get operation like "something.b" emitting SQL, the extension I'm building will very likely include some kind of feature that they can do this with an explicit call. At the moment with the preliminary code that's in there, this might look like: await greenlet_spawn(getattr, something, "b") not very pretty at the moment but that general idea. But the thing is, greenlet_spawn() can naturally apply to anything. So it remains to be seen both how I would want to present this notion, as well as if people are going to be interested in it or not, but as a totally extra thing beyond the "await session.execute()" API that is the main thing, someone could do something like this: await greenlet_spawn(my_business_orm_method) and then in "my_business_orm_method()", all the blocking style ORM things that async advocates warn against could be happening in there. I'm certainly not going to tell people they have to be doing that, but I dont think I should discourage it either, because if the above business method is written "reasonably" (see next paragraph), there really is no problem introduced by implicit IO. By "written reasonably" I'm referring to the fact that in this whole situation, 90% of everything people are doing here are in the context of HTTP services. The problem of, "something.a now creates state that other tasks might see" is not a real "problem" that is solved by using IO-only explicit context switching. This is because in a CRUD-style application, "something" is not going to be a process-local yet thread-global object that had to be created specifically for the application (there's things like the database connection pool and some registries that the ORM uses, but those problems are taken care of and aren't specific to one particular application). There is certainly going to be global mutable state with the CRUD/HTTP application which is the database itself. Event based programming doesn't save you from concurrency issues here because any number of processes maybe accessing the database at the same time. There are well-established concurrency patterns one uses with relational databases, which include first and foremost transaction isolation, but also things like compare-and-swap, "select for update", ensuring MVCC is turned on (SQL Server), table locks, etc. These techniques are independent of the concurrency pattern used within the application, and they are arguably better suited to blocking-style code in any case because on the database side we must emit our commands within a transaction serially in any case. The major convenient point of "async" that we can fire off a bunch of web service requests in parallel does not apply to the CRUD-style business methods within our web service request because we can only do things in our ACID transaction one at a time. The problem of "something.a" emitting IO needs to be made sane against other processes also viewing or altering "something.a", assuming "something" is a database-bound object like a row in a table, using traditional database concurrency constructs such as choosing an appropriate isolation mode, using atomically-composed SQL statements, things like that. The problem of two greenlets or coroutines seeing "something" before it's been fully altered would happen across two processes in any case, but if "something" is a database row, that second greenlet would not see "something.a / something.b" in mid-flight because the isolation level is going to be at least "read committed". In the realm of Python HTTP/CRUD applications, async is actually very popular however it is in the form of gevent and sometimes eventlet monkeypatching, often because people are using async web servers like gunicorn. I don't see much explicit async at all because as mentioned before, there are very few async database drivers and there are also very few async database abstraction layers. I've sort of made a side business at work out of helping people with the problems of gevent-enabled HTTP services. There are two problems that I see: the main one is that they configure their workers for 1000 greenlets, they set their database connection pool to only allow 20 database connections, and then their processes get totally hung as all the requests pile up in one process that is advertising that it still has 980 more requests it can service. The other one is that their application is completely CPU bound, and sometimes so badly that we see database timeouts because their greenlets can't respond to a database ping or authentication challenge within 30 seconds. I have never seen any issues related to the fact that IO is implicit or that lazy loading confused someone. Maybe this is a thing if they had some kind of microservice-parallel HTTP request spawning monster of some kind but we don't have that kind of thing in CRUD applications. The two aforementioned problems with too many greenlets or coroutines vs. what their application can actually handle would occur just as much with an explicit async driver, and that's fine, I know how to debug these cases. But in any case, people are already writing huge CRUD apps that run under gevent. To my secondary idea that someone can run their app using asyncio and then on an as needed basis put some more CRUD-like methods into greenlets with blocking style code, this is an improvement over the current state of affairs where everything everywhere is implicit IO. Not only that, but they can do this already common programming style and interact with a database driver that is designed for async. Right now everyone uses pymysql because it is pure Python and therefore can have all the socket / IO related code monkeypatched by gevent. It's bad. Whether or not one thinks writing HTTP services using greenlets is a good idea or not, it is definitely better to do it using a database driver that is designed for async talking to the database without doing any monkeypatching. My approach makes this possible where it has previously not been possible at all, so I think this represents a big improvement to an already popular programming pattern while at the same time introduces the notion of a single application using both explicit and implicit approaches simultaneously. I think the notion that someone who really wants to use async/await in order to carefully schedule how they communicate with other web services and resources which often need to be loaded in parallel, but then for their transactional CRUD code which is necessarily serial in any case they can write those parts in blocking style, is a good thing. This style of code is already prevalent and here we'd be giving an application the ability to use both styles simultaneously. I had always hoped that Python's move towards asyncio would allow this programming paradigm to flourish as it seems inherently useful. > If you're just talking about using greenlets internally and then writing both sync and async shims to be your public API, then obviously that reduces the risks. Maybe greenlets will cause you problems, maybe not, but either way you know what you're getting into and the decision only affects you :-). But, if that's all you're using them for, then I'm not sure that they have a significant advantage over the edgedb-style synchronous wrapper or the unasync-style automatically generated sync code.> w.r.t the issue of writing everything as async and then using the coroutine primitives to convert to "sync" as means of maintaining both facades, I don't think that covers the fact that most DBAPI drivers are sync only (and not monkeypatchable either, but I think we all agree here that monkeypatching is terrible in any case), and to suit the much more common use case of sync front end -> agnostic middle -> sync driver, to go from an async event loop to a blocking IO database driver you need to use a thread executor of some kind. The other way around, that the library code is written in "sync" and you can attach "async" to both ends of it using greenlets in the middle, much more lightweight of a transition in that direction, vs. the transition of async internals out to a sync only driver.

> Oh, I thought the primary problem for SQLAlchemy supporting async is that the ORM needs to do IO from inside __getattr__ methods. So I assumed that the reason you were so excited about greenlets was that it would let you use await_() from inside those __getattr__ calls, which would involve exposing your use of greenlets as part of your public API.


The primary problem is people want to execute() a SQL statement using await, and then they want to use a non-blocking database driver (basically asyncpg, I'm not sure there are any others, maybe there's one for MySQL also) on the back.    Tools like aiopg have provided partial SQLAlchemy-like front-ends to accomplish this but they can't do ORM support, not because the ORM has lazy loading, but just to do explicit operations like query.all() or session.flush() that can sometimes require a lot of front-to-back database operations to complete which would be very involved to rewrite all that code using async/await.

Then there's the secondary problem of ORMs doing lazy loading, which is what you refer towards as "IO inside __getattr__ methods".   SQLAlchemy is not actually as dependent on lazy loading as other ORMs as we support a wide range of ways to "eagerly" load data up front.  With the SQLAlchemy 2.0-style ORM API that has a clear spot for "await" to occur, they can call "await session.execute(select(SomeObject))" and get a whole traversible graph of things loaded up front.    We even have a loader called "raiseload" that is specifically anti-lazy loading, it's a loader that raises an error if you try to access something that wasn't explicitly loaded already.  So for a lot of cases we are already there.

But then, towards your example of "something.b = x", or more commonly in ORMS a get operation like "something.b" emitting SQL, the extension I'm building will very likely include some kind of feature that they can do this with an explicit call.  At the moment with the preliminary code that's in there, this might look like:

   await greenlet_spawn(getattr, something, "b")

not very pretty at the moment but that general idea.   

But the thing is, greenlet_spawn() can naturally apply to anything.  So it remains to be seen both how I would want to present this notion, as well as if people are going to be interested in it or not, but as a totally extra thing beyond the "await session.execute()" API that is the main thing, someone could do something like this:

   await greenlet_spawn(my_business_orm_method)

and then in "my_business_orm_method()", all the blocking style ORM things that async advocates warn against could be happening in there.     I'm certainly not going to tell people they have to be doing that, but I dont think I should discourage it either, because if the above business method is written "reasonably" (see next paragraph), there really is no problem introduced by implicit IO.

By "written reasonably" I'm referring to the fact that in this whole situation, 90% of everything people are doing here are in the context of HTTP services.   The problem of, "something.a now creates state that other tasks might see" is not a real "problem" that is solved by using IO-only explicit context switching.  This is because in a CRUD-style application, "something" is not going to be a process-local yet thread-global object that had to be created specifically for the application (there's things like the database connection pool and some registries that the ORM uses, but those problems are taken care of and aren't specific to one particular application).     There is certainly going to be global mutable state with the CRUD/HTTP application which is the database itself.  Event based programming doesn't save you from concurrency issues here because any number of processes maybe accessing the database at the same time.  There are well-established concurrency patterns one uses with relational databases, which include first and foremost transaction isolation, but also things like compare-and-swap, "select for update", ensuring MVCC is turned on (SQL Server), table locks, etc.  These techniques are independent of the concurrency pattern used within the application, and they are arguably better suited to blocking-style code in any case because on the database side we must emit our commands within a transaction serially in any case.   The major convenient point of "async" that we can fire off a bunch of web service requests in parallel does not apply to the CRUD-style business methods within our web service request because we can only do things in our ACID transaction one at a time.

The problem of "something.a" emitting IO needs to be made sane against other processes also viewing or altering "something.a", assuming "something" is a database-bound object like a row in a table, using traditional database concurrency constructs such as choosing an appropriate isolation mode, using atomically-composed SQL statements, things like that.   The problem of two greenlets or coroutines seeing "something" before it's been fully altered would happen across two processes in any case, but if "something" is a database row, that second greenlet would not see "something.a / something.b" in mid-flight because the isolation level is going to be at least "read committed".

In the realm of Python HTTP/CRUD applications, async is actually very popular however it is in the form of gevent and sometimes eventlet monkeypatching, often because people are using async web servers like gunicorn.    I don't see much explicit async at all because as mentioned before, there are very few async database drivers and there are also very few async database abstraction layers.   I've sort of made a side business at work out of helping people with the problems of gevent-enabled HTTP services.  There are two problems that I see: the main one is that they configure their workers for 1000 greenlets, they set their database connection pool to only allow 20 database connections, and then their processes get totally hung as all the requests pile up in one process that is advertising that it still has 980 more requests it can service.  The other one is that their application is completely CPU bound, and sometimes so badly that we see database timeouts because their greenlets can't respond to a database ping or authentication challenge within 30 seconds.   I have never seen any issues related to the fact that IO is implicit or that lazy loading confused someone.    Maybe this is a thing if they had some kind of microservice-parallel HTTP request spawning monster of some kind but we don't have that kind of thing in CRUD applications.

The two aforementioned problems with too many greenlets or coroutines vs. what their application can actually handle would occur just as much with an explicit async driver, and that's fine, I know how to debug these cases.  But in any case, people are already writing huge CRUD apps that run under gevent.   To my secondary idea that someone can run their app using asyncio and then on an *as needed* basis put some more CRUD-like methods into greenlets with blocking style code, this is an *improvement* over the current state of affairs where everything everywhere is implicit IO.  Not only that, but they can do this already common programming style and interact with a database driver that is *designed for async*.   Right now everyone uses pymysql because it is pure Python and therefore can have all the socket / IO related code monkeypatched by gevent.  It's bad.  Whether or not one thinks writing HTTP services using greenlets is a good idea or not, it is definitely better to do it using a database driver that is designed for async talking to the database without doing any monkeypatching.  My approach makes this possible where it has previously not been possible at all, so I think this represents a big improvement to an already popular programming pattern while at the same time introduces the notion of a single application using both explicit and implicit approaches simultaneously.

I think the notion that someone who really wants to use async/await in order to carefully schedule how they communicate with other web services and resources which often need to be loaded in parallel, but then for their transactional CRUD code which is necessarily serial in any case they can write those parts in blocking style, is a good thing.    This style of code is already prevalent and here we'd be giving an application the ability to use both styles simultaneously.   I had always hoped that Python's move towards asyncio would allow this programming paradigm to flourish as it seems inherently useful.  


> If you're just talking about using greenlets internally and then writing both sync and async shims to be your public API, then obviously that reduces the risks. Maybe greenlets will cause you problems, maybe not, but either way you know what you're getting into and the decision only affects you :-). But, if that's all you're using them for, then I'm not sure that they have a significant advantage over the edgedb-style synchronous wrapper or the unasync-style automatically generated sync code.> 

w.r.t the issue of writing everything as async and then using the coroutine primitives to convert to "sync" as means of maintaining both facades, I don't think that covers the fact that most DBAPI drivers are sync only (and not monkeypatchable either, but I think we all agree here that monkeypatching is terrible in any case), and to suit the much more common use case of sync front end -> agnostic middle -> sync driver, to go from an async event loop to a blocking IO database driver you need to use a thread executor of some kind.    The other way around, that the library code is written in "sync" and you can attach "async" to both ends of it using greenlets in the middle, much more lightweight of a transition in that direction, vs. the transition of async internals out to a sync only driver.

History
Date	User	Action	Args
2020-07-08 03:44:22	zzzeek	set	recipients: + zzzeek, jcea, djarb, njs, jab, martin.panter, yselivanov, Rokas K. (rku), crusaderky
2020-07-08 03:44:22	zzzeek	set	messageid: <1594179862.3.0.875058244526.issue22239@roundup.psfhosted.org>
2020-07-08 03:44:22	zzzeek	link	issue22239 messages
2020-07-08 03:44:20	zzzeek	create