New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PyUnicode_FromWideChar incorrect for characters outside the BMP (unix only) #48724
Comments
On systems (Linux, OS X) where sizeof(wchar_t) is 4 and wchar_t arrays are Note that PyUnicode_FromWideChar is used to process command-line Here's an OS X 10.5 Terminal session (current directory is the root of the dickinsm$ cat test𐅭.py (In case the character after 'test' and before '.py' isn't showing up |
Comments from MvL in bpo-4388:
|
it is fine on linux (tested with UTF-8 codeset for locale): |
s/USC-4/UCS-4/g |
Interesting. Which version of Python is that? And is PyUNICODE 2 bytes |
Just to be clear, the defect in PyUnicode_FromWideChar is present both in The problem with command-line arguments only occurs in Python 3.x, since I can reproduce the 'No such file or directory' error on both OS X and |
This is due to the function downcasting the wchar_t values to Most Unixes ship with UCS4 builds, so you don't see the problem there. UCS2 builds are also the default build on Unix, so if you compile Python |
Marc-Andre explain all. For the protocol my version is from trunk, In the report output from python is with character 010d(UCS-2). May be issue is not for versions before 3.0. |
Patch fixing PyUnicode_FromWideChar() for UCS-2 build: create |
Note: I wrote my patch against py3k r68646. |
Thanks for the patch, Victor! Looks pretty good at first glance, except that it seems that the UTF-32 to A test would be good, too. |
#ifdef HAVE_USABLE_WCHAR_T
memcpy(unicode->str, w, size * sizeof(wchar_t));
#else
...
#endif I understand this code as: sizeof(wchar_t) == sizeof(Py_UNICODE). If I
PyUnicode_FromWideChar() is not a public API. Should I write a function in |
Yep, sorry. You're right.
I was actually thinking of a test for the "No such file or directory" |
On 2009-01-17 14:00, STINNER Victor wrote:
It is a public C API. Regardless of this aspect, we should always |
On 2009-01-17 14:00, STINNER Victor wrote:
If HAVE_USABLE_WCHAR_T is defined, Py_UNICODE is defined as wchar_t, That said, if Py_UNICODE is the same as wchar_t, no conversion is |
Updated patch including a test in _testcapi module: create two |
I run my test on py3k on Linux with 32 bits wchar_t:
Can someone test with 16 bits wchar_t (eg. Windows)? I think that the |
(with the full patch, all tests pass with 16 or 32 bits Py_UNICODE) |
Looks good to me. I'm not in a position to test with 16-bit wchar_t, but I can't see why Some minor whitespace issues in the unicodeobject.c part of the patch Marc-André, is it okay with you to check this in? |
On 2009-01-18 22:59, Mark Dickinson wrote:
I'd structure the patch differently, ie. put the whole support code
into a single #ifndef Py_UNICODE_WIDE section as part of the
#ifdef HAVE_USABLE_WCHAR_T pre-processor statement. Also note that on platforms with 16-bit wchar_t, the comparison BTW: Please always use upper-case hex literals, or at leat don't Thanks,Marc-Andre Lemburg ::: Try our new mxODBC.Connect Python Database Interface for free ! :::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 |
Yes, but the right test is (SIZEOF_WCHAR_T > 2). I wrote a new test: #if (Py_UNICODE_SIZE == 2) && (SIZEOF_WCHAR_T > 2)
#define USE_WCHAR_SURROGATE
const wchar_t *orig_w;
#endif
I try, but it would be easier if the rule was already respected: they Patch version 3:
|
@marketdickinson, @lemburg: ping! I updated the patch, does it look |
On 2009-01-26 17:56, STINNER Victor wrote:
Yes, but there are a few things that still need fixing: * SIZEOF_WCHAR_T is not defined for Windows builds, so needs
to be added to PC/pyconfig.h (OTOH wchar_t is 2 bytes on Windows)
* USE_WCHAR_SURROGATE should be #defined just before the
function and #undef'ed right after it; I'd also use a more
accurate name
* please use pre-processor indents, e.g.
#ifdef ...
# define ...
#endif I'd write #if (Py_UNICODE_SIZE == 2) && defined((SIZEOF_WCHAR_T) && (SIZEOF_WCHAR_T > 2)
# define CONVERT_WCHAR_TO_SURROGATES
#endif ... #undef CONVERT_WCHAR_TO_SURROGATES Thanks. |
For lemburg, updated patch:
|
Updated Victor's patch:
I find the patched version of PyUnicode_FromWideChar quite hard to follow |
On 2009-02-24 20:39, Mark Dickinson wrote:
Same here. It would be better to have a single #ifdef #else #endif No need for a new helper function. |
Yes, of course it would. :) |
New patch, with two separate versions of PyUnicode_FromWideChar. |
On 2009-02-24 21:50, Mark Dickinson wrote:
Thanks, much better :-) |
I don't understand why SIZEOF_WCHAR_T could be unset, but the patch |
Good catch! Added defined(SIZEOF_WCHAR) to the testcapi code as well, |
Committed to py3k, r70452. Since this is partway between a bugfix and a new feature, I suggest that |
Backported to the trunk in r70454. Thanks, all! |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: