I'm not experienced in import machinery. Here is preliminary patch which 
implements my idea for particular case.

Performance effect is almost so good as manual caching in a global.

>>> import timeit
>>> def f():
...      import locale
>>> min(timeit.repeat(f, number=100000, repeat=10))

Of course it breaks tests.

> It's possible there is room for other optimisations that don't break the
> import override semantics (such as a fast path for when __import__ is the
> standard import function).

Good idea.
