classification
Title: Enable non-ASCII extension module names
Type: enhancement Stage: resolved
Components: Interpreter Core Versions: Python 3.5
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: Arfrever, Suzumizaki, amaury.forgeotdarc, brett.cannon, eric.snow, jkloth, ncoghlan, scoder, serhiy.storchaka, vstinner
Priority: normal Keywords: patch

Created on 2014-02-02 14:20 by Suzumizaki, last changed 2017-06-13 23:10 by ncoghlan. This issue is now closed.

Files
File name Uploaded Description Edit
20140202_patch_for_python_default_branch_88885.patch Suzumizaki, 2014-02-02 14:20 Enable "import <NON-ASCII>.pyd" patch for default branch review
20140202_patch_for_python3.3_88884.patch Suzumizaki, 2014-02-02 14:21 Enable "import <NON-ASCII>.pyd" patch for 3.3 branch
sample_importing_non_ascii_pyd.zip Suzumizaki, 2014-02-02 14:27 Samples
Messages (20)
msg209988 - (view) Author: Suzumizaki (Suzumizaki) Date: 2014-02-02 14:20
Currently, the name of .pyd modules is limited within 7 bit US-ASCII.

I want to do "import X" to import X.pyd, where X contains Unicode characters.
I tried to make the patch and my work seems to be done successfully.
I will post the patch with this issue, and next what should I do?

About the solution:
To make the export entry 'PyInit_xxx' kept inside 7 bit, I use the simple encoding 'szm62' for unicode, called in the patch.

1) 'szm62' is used once per module(.pyd), only for 'PyInit_xxx'.
2) 'szm62' is used only when non-ASCII characters are in the name of the module.
3) 'szm62' generates short string as UTF-8, except 0-9A-Za-z are encoded to 2 bytes.
4) 'szm62' is very simple, much easier than UTF-8.
5) I tested it only with MS VC++, but I believe highly compatible with other environments.
6) 'szm62' also can decode 7 bits to unicode, but only the encoding is used for this issue.

Notes:
The simplicity is important for the project like Cython -- it generates .pyd files.
The codepoints over 16bits are also simply supported. They will be encoded to 4 alnum(0-9A-Za-z) characters.
0-9A-Za-z are (always) encoded to 2 alnums. They will be simply prefixed with '0'(U+000030).
When the generating 'Non-ASCII.pyd' with MSVC toolkit, the report 'LINK : warning LNK4232' will be raised on linking. But no problem. The warning says "When someone try to link with LoadLibraryEx'A' function, it may or may not fail, depends on the user locale." Our Python.exe uses always LoadLibraryEx'W', it never fail with locale issue.

Or if you have any question, please tell me that.

Regards,
Suzumizaki-Kimitaka
msg209998 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2014-02-02 16:14
I think that if you need a module with non-ASCII name, you can wrap it in Python module.

=== späm.py ===
from _spam import *
from _spam import __all__, __doc__
=== späm.py ===
msg210002 - (view) Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer) Date: 2014-02-02 17:53
I'd use a much simpler encoding.
Maybe something like
    name.encode('unicode-escape').replace(b'\\', b'_')
As you said, simplicity is important for tools which generate code!
msg210059 - (view) Author: Suzumizaki (Suzumizaki) Date: 2014-02-03 00:22
Thank you for reply.

The hack msg209998 is interesting, but how to name submodule with non latin like languages, especially keeping native reable? X( 

The reason I don't use like "name.encode('unicode-escape').replace(b'\\', b'_')" is the length limits of the identifiers.

In fact, Visual C++ can accept 2047 chars(bytes) and gcc have no logical limits. But the PEP 7 says we should use C89. And even C99 assumes first 63 bytes are significant. I don't know what C89 says, And my C99 reference is below, this means real-C99 is possibly different:
http://www.open-std.org/JTC1/SC22/wg14/www/docs/n1124.pdf

If we should keep C99 order above, 63 chars are too short to use 'unicode-escape' like. 'PyInit_' takes 7, remains 56. When each characters encoded as 5 chars like '_3010', only we can use 11 unicode-codepoints. When 6 chars, only 9 chars.

a) If we can break C99 or real-C89/C99 don't have 63 chars rule, we can simply use as Amaury Forgeot d'Arc says.
b) If we should keep 63 chars rules, the encodings longer than 'szm62' is not acceptable.
c) Of course, when there is no reason to make entry point 'PyInit_modulename', Make enable stable constant name 'PyInit' or 'PyInitUnicodeNamedModule' without individual module name is another acceptable idea.
msg210064 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2014-02-03 00:59
The PyInit_NAME symbol is not the only place where NAME is used. The NAME is also present in the PyModuleDef structure. It looks lie Python expects UTF-8 here. You encode it to UTF-8 and use "\xHH\xHH\xHH..." syntax to keep ASCII encoding for the C file? The NAME may also be mentionned in docstrings, C comments, type names, etc.

I don't like the idea of a new encoding just for one very specific function in C. There are already too many encodings in the world :-( The C language supports non-ASCII identifiers, but I don't know how they are encoded in the symbol table. I would prefer to rely on the C compiler if you would like to play in the playground of non-ASCII identifiers.

In Python/dynload_win.c, _PyImport_GetDynLoadWindows() uses GetProcAddressA().

Is it a theorical feature request, or you really have a Python module with a non-ASCII name?

I'm not sure that it's really useful to support non-ASCII module names for C modules, even if I spend many months to support non-ASCII module names for Python modules :-)
msg210076 - (view) Author: Suzumizaki (Suzumizaki) Date: 2014-02-03 03:28
Thank you for reply, STINNER.

> You encode it to UTF-8 and use "\xHH\xHH\xHH..." syntax to keep ASCII
> encoding for the C file? The NAME may also be mentionned in docstrings, 
> C comments, type names, etc.

The main purpose of this issue is "I want use Cython like Python without any trouble." You don't have to worry about mentioned above. I made, and will fix when needed, the patch for Cython to convert them automatically.

The sample C codes I posted uses UTF-8 directly only OUTSIDE of the "quotation", but they can be fixed if we really have to fix. 

> I don't like the idea of a new encoding just for one very specific
> function in C. There are already too many encodings in the world :-(

Of course I will accept any encoding and/or any solution to resolve this issue. I made new encoding only to keep the condition as possible as I can, and not to limit the naming too short when using non-ASCII characters.

The patch don't include encoding-module for any purpose. For this issue, decoding is not required inside the Python. 

> The C language supports non-ASCII identifiers, but I don't know how they
> are encoded in the symbol table. 

That's why we should resolve this problem, shouldn't we? Also the standards don't define about the symbol table.

> I would prefer to rely on the C compiler if you would like to play in the
> playground of non-ASCII identifiers.

The problem is we CAN'T as you say. Or, at least, if you really think that, any ASCII limiting against dynamic loading should be removed.

>In Python/dynload_win.c, _PyImport_GetDynLoadWindows() uses GetProcAddressA().

_PyImport_GetDynLoadWindows() seems to be called only to resolve PyInit_xxx entry from _PyImport_LoadDynamicModule() in Python/importdl.c. I have already resolved with the posted patch before.

>Is it a theorical feature request, or you really have a Python module with
>a non-ASCII name?

As I told, NO to 1st, YES to 2nd. I have many '<non-ASCII>.py' which I want to convert using Cython to '<non-ASCII>.pyd' files.

> I'm not sure that it's really useful to support non-ASCII module names
> for C modules, even if I spend many months to support non-ASCII module
> names for Python modules :-)

Because you are both English and Python expert. Thanks a lot to daily Python work!

Thank you for reading this long description.
msg210082 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2014-02-03 07:09
> The hack msg209998 is interesting, but how to name submodule with non latin like languages, especially keeping native reable? X( 

It is left to your discretion. You can use idna, punycode, utf-7, szm62 or romaji.
msg210096 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2014-02-03 08:31
Updating the C extension loading API to take advantage of PEP 451 is on the
to do list for 3.5, so I'll see if we can do something about this as well.
However, as Victor noted, it will depend on whether or not we can figure
out a compiler independent cross platform way to look up a non-ASCII symbol
in the extension module's symbol table.
msg210120 - (view) Author: Suzumizaki (Suzumizaki) Date: 2014-02-03 12:39
Thanks for taking into account this issue for PEP 451.

Honestly to say, I can't imagine why or/and how this issue(or my patches) causes any problems especially compatibility issues. If someone can point them, I will try to resolve.

Note that I extend only the definition of "PyInit_xxxx". I don't touch the code for loading modules.
msg210124 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2014-02-03 13:14
As Victor noted, inventing our own encoding scheme just for this use case
isn't desirable, although it's certainly a good fallback option that will
ensure the feature remains feasible even if trying to handle the Unicode
issues at the C compiler level proves too challenging.

The other aspect is that changes to the extension module initialisation API
always need to go into a PEP regardless, since we need to ensure such
changes are usable for both handwritten extensions and extension module
generators like Cython, cffi and SWIG.
msg210125 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2014-02-03 13:34
Oh, the topic was already discussed some years ago. Start:
https://mail.python.org/pipermail/python-dev/2011-May/111279.html
msg210177 - (view) Author: Suzumizaki (Suzumizaki) Date: 2014-02-04 04:24
Thank you Nick, I understand the behavior of this issue should be written on PEP.

By the way, Can I continue the discussion here? or is there elsewhere suitable place for the PEP?
msg210209 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2014-02-04 11:50
import-sig@python.org would be the appropriate list for this one. 

However, we can't do anything about it until Python 3.5 next year at the earliest, and I'm already planning to write a follow-up to http://www.python.org/dev/peps/pep-0451/ that adapts the extension module import mechanism to support those APIs (addressing a number of longstanding feature requests from the Cython developers).

That said, this is an independent proposal, so if you were willing to write it up as a separate PEP, that would be probably be a good idea. Our two choices to consider would be:

1. Using a custom 7-bit ASCII compatible encoding to support this on arbitrary C compilers (at the cost of making the identifiers unintuitive). (i.e. the approach in your patch)

2. Using the "Universal Character Name" support originally specified in C99, but retained in C11 (these are the \Uxxxxxxxx and \uxxxx escapes familiar from Python text literals). Note that *CPython* still won't need to be compiled with a compiler that supports UCN for this to work - we'll just need the dynamic linking API to support us looking for a symbol containing such a name.

Option 2 is what I think we *should* do, but there will be some research involved in figuring out how good the current support for UCN C identifiers is in at least gcc, clang and Visual Studio 2013, as well as what the dynamic linker APIs support in terms of passing identifiers containing Unicode escapes to be looked up in the exported symbols.
msg210211 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2014-02-04 11:59
> we *should* do, but there will be some research involved in figuring out how good the current support for UCN C identifiers is in at least gcc, clang and Visual Studio 2013

Python 3.4 uses Visual Studio 2010. I'm not sure that you can build an
extension with VS 2013 if Python was build with VS 2010.
msg210220 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2014-02-04 12:30
Oh, you're right - I temporarily forgot that the C runtime compatibility was compiler version specific on Windows. So such an approach *would* require updating the CPython compiler on Windows to at least VS2013 for 3.5. Still, we're likely to want to do that anyway - VS2010 will be as old in 2015 as VS2008 is now, and the latter is already causing hassles for building 2.7 extension modules.
msg210221 - (view) Author: Suzumizaki (Suzumizaki) Date: 2014-02-04 12:50
Both Visual Studio 2012 and 2013 CANNOT install on Windows Vista. That's OK for you even Vista alive until April 2017?
msg210292 - (view) Author: Suzumizaki (Suzumizaki) Date: 2014-02-05 08:52
Thank you Victor about msg210125, I read the discussion on ML, May 2011.

Inside the articles, the previous discussion on tracker is found:
"On Windows, don't encode filenames in the import machinery"
http://bugs.python.org/issue11619

Here is my memo, might be helpful to review the discussions.

-- About Window CE --
* Windows CE series have GetProcAddress() at First.
* with Windows CE 3.0, GetProcAddressA() is added.
* but Python community chose 'A' version to support Windows CE.
* Windows CE continues as Windows Embedded Compact today.
* but Python3 for Windows CE seems not to be distributed.

-- About Windows Desktop and Servers --
* Windows Desktops and Servers have GetProcAddress() only, neither A nor W postfix appended.
* GetProcAddress() on Windows Desktop and Servers takes LPCSTR as the 2nd parameter.
* but the parameter, in this case, is null-terminated binary block. neither MBCS nor UTF-8. 
* Visual C++ 2010 encodes non-ASCII export symbols as UTF-8.
* Because the 2 reasons described above the 2 lines, We can give UTF-8 encoded string to GetProcAddress().

I checked the last fact with my Window Japanese Editions:
* XP Home Edition (32bit)
* Vista Home Premium (64bit)
* Windows 8.1 Pro (64bit)

GetProcAddress (Windows CE)
The type of the 2nd parameter is LPC"W"STR, and the document says LPCSTR version added on CE 3.0.
http://msdn.microsoft.com/en-us/library/ms885634.aspx

GetProcAddress (Windows Desktop/Server)
The type of the 2nd parameter is LPCSTR, nor LPC"T"STR neither LPC"W"STR.
Note that the example seems to be wrong about using TEXT macro.
http://msdn.microsoft.com/en-us/library/windows/desktop/ms683212(v=vs.85).aspx

PythonCE (seems stopped at Python 2.5 compatible)
http://pythonce.sourceforge.net/

Symbols seem to be encoded utf-8 inside Windows Executable 
https://mail.python.org/pipermail/python-dev/2011-May/111325.html

-- About C/C++ Standards --
* C99 says the significant length of identifiers are 63.
* C99 allows to use Unicode to name identifiers.
* but not define how to translate \uNNNN or \uNNNNNNNN forms used in "quotations".
* C++11 defines u8"" literals. we can make utf-8 char* string inside u8"quotes" with \u formats.
* but the encoding of source file is platform dependent.
* also, how to export symbols is platform dependent.

-- About C/C++ tool kits --
* Window Executable can contain 2048 chars per each exported symbol.
* Visual C++ 2010 seems to encode exporting symbols with UTF-8.
* gcc don't have logical limit of the length of identifiers.
* Currently, Visual C++ 2010 and LLVM/Clang supports using UTF-8 in whole source code.
* gcc only support \uNNNN or \uNNNNNNNN form.
* About GetProcAddress() functions, see previous memo about Windows.
msg210293 - (view) Author: Suzumizaki (Suzumizaki) Date: 2014-02-05 09:11
Thank you Nick about msg210209.

I would like to try making PEP, but the work looks somewhat difficult. It may take the time.

BTW, C/C++ Standards only allow the encoding of source code as platform dependent. They don't define "the standard encoding of source codes"...

This means we have to choose to resolve this issue, one is giving up readability, the other is allowing platform-dependent feature, using UTF-8 to write the C code.
msg242049 - (view) Author: Stefan Behnel (scoder) * (Python committer) Date: 2015-04-26 07:26
PEP 489 (Redesigning extension module loading) includes the proposal to fix this by using punycode.
msg295968 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2017-06-13 23:10
PEP 489 was accepted and implemented, so Python 3.5+ supports non-ASCII extension module names as described in https://www.python.org/dev/peps/pep-0489/#export-hook-name
History
Date User Action Args
2017-06-13 23:10:49ncoghlansetstatus: open -> closed
resolution: fixed
messages: + msg295968

stage: test needed -> resolved
2015-04-26 07:26:40scodersetmessages: + msg242049
2014-02-05 17:01:56jklothsetnosy: + jkloth
2014-02-05 17:00:41scodersetnosy: + scoder
2014-02-05 16:40:38Arfreversetnosy: + Arfrever
2014-02-05 16:24:34brett.cannonsettitle: Enable 'import <Non-ASCII>.pyd' -> Enable non-ASCII extension module names
stage: test needed
2014-02-05 09:11:36Suzumizakisetmessages: + msg210293
2014-02-05 08:52:28Suzumizakisetmessages: + msg210292
2014-02-04 12:50:05Suzumizakisetmessages: + msg210221
2014-02-04 12:30:20ncoghlansetmessages: + msg210220
2014-02-04 11:59:40vstinnersetmessages: + msg210211
2014-02-04 11:50:49ncoghlansetmessages: + msg210209
2014-02-04 04:24:11Suzumizakisetmessages: + msg210177
2014-02-03 13:34:36vstinnersetmessages: + msg210125
2014-02-03 13:14:04ncoghlansetmessages: + msg210124
2014-02-03 12:39:26Suzumizakisetmessages: + msg210120
2014-02-03 08:31:43ncoghlansetmessages: + msg210096
2014-02-03 07:09:36serhiy.storchakasetmessages: + msg210082
2014-02-03 03:28:30Suzumizakisetmessages: + msg210076
2014-02-03 00:59:26vstinnersettype: behavior -> enhancement
messages: + msg210064
versions: - Python 3.3, Python 3.4
2014-02-03 00:22:24Suzumizakisetmessages: + msg210059
2014-02-02 17:53:05amaury.forgeotdarcsetnosy: + amaury.forgeotdarc
messages: + msg210002
2014-02-02 16:14:05serhiy.storchakasetnosy: + serhiy.storchaka
messages: + msg209998
2014-02-02 15:11:35pitrousetnosy: + brett.cannon, ncoghlan, vstinner, eric.snow
2014-02-02 14:27:47Suzumizakisetfiles: + sample_importing_non_ascii_pyd.zip
2014-02-02 14:21:37Suzumizakisetfiles: + 20140202_patch_for_python3.3_88884.patch
2014-02-02 14:20:28Suzumizakicreate