msg19177 - (view) |
Author: Mark Hammond (mhammond) * |
Date: 2003-11-29 01:24 |
The following snippet:
>>> u'@test-\u5171'.encode("mbcs", "strict")
'@test-?'
Should raise a UnicodeError. The errors param is
completely ignored, and the function always works as
though errors='replace'.
Attaching a test case, and the start of a patch. The
patch has a number of issues:
* I'm not sure what errors are considered 'mandatory'.
I have handled 'strict', 'ignore' and 'replace' -
however, 'ignore' and 'replace' currently are exactly
the same (ie, replace)
* The Windows functions don't tell us exactly what
character failed in the conversion. Thus, the
exception I raise implies the first character is the
one that failed. For the same reason, I have made no
attempt to support error callbacks.
Comments/guidance appreciated.
|
msg19178 - (view) |
Author: Mark Hammond (mhammond) * |
Date: 2003-11-29 01:31 |
Logged In: YES
user_id=14198
Attaching a patch. This patch also attempts to handle
Encode, but I haven't worked out how to exercise this
code-path - ie, what mbcs encoded string can I pass that can
not be converted to unicode?
As I mentioned, patch has a few issues
|
msg19179 - (view) |
Author: Thomas Heller (theller) * |
Date: 2003-11-29 15:18 |
Logged In: YES
user_id=11105
No idea why this was assigned to me - unicode is certainly
not one of my strengths.
|
msg19180 - (view) |
Author: Martin v. Löwis (loewis) * |
Date: 2003-12-01 21:25 |
Logged In: YES
user_id=21627
The conventional semantics of "ignore" would be "remove the
failing characters from the output". This would be difficult
to implement if the Microsoft API provides no detailed error
indication.
You could try to get more detailed error indication by
re-encoding the resulting string with a NULL buffer,
counting the number of characters that have successfully
been encoded, atleast in the .decode case.
In the .encode case, you could try using \0 as the default
char. To my knowledge, no ACP ever uses \0 in a multi-byte
string.
What is the meaning of the WC_DEFAULTCHAR flag, in
WideCharToMultiByte, and why are you not using it?
I'm somewhat concerned with backwards compatibility, since
the mbcs codec has never returned errors. So this should be
applied to 2.4 only, and listed in whatsnew.tex.
|
msg82015 - (view) |
Author: Daniel Diniz (ajaksu2) * |
Date: 2009-02-14 11:35 |
Is this behavior still present? If so, is it still interesting to change it?
|
msg82133 - (view) |
Author: Mark Hammond (mhammond) * |
Date: 2009-02-14 22:40 |
It is still present, but I'm not sure what problems can be seen due to
this so can't comment on its desirability. It would also introduce a
backwards compatability concern but I've not enough experience to know
how much of a problem that would be in practice either.
|
msg106277 - (view) |
Author: STINNER Victor (vstinner) * |
Date: 2010-05-22 01:36 |
I patched py3k with mbcs_errors.patch (only encode_mbcs, not the decoder function) and most test pass: I opened #8784 for test_tarfile failure.
I don't think that it's a problem that mbcs only supports few error handlers, eg. 'strict', 'replace' and 'errors' (but not 'ignore' nor 'surrogateescape'). mbcs should be avoided anyway :-) It is kept for backward compatibility (with Python2). Python3 tries to avoid it by using the Unicode functions of Windows API.
I don't know exactly where mbcs is still used in Python3. If mbcs becomes more strict and raise new errors, I would like to say that the problem comes from the program, not in the encodig, and the program should be fixed (especilly if the "program" is the Python standard library).
About the backward compatibility with Python < 3.2: I don't know exactly if this change would be a problem or not. I bet that few people use (directly or indirectly) mbcs with Python 3.1 (on Windows), and few peple (or nobody) would notice this change. And as I wrote, if someone notices a problem: the problem should be fixed in the function using mbcs, not in the codec.
|
msg106278 - (view) |
Author: STINNER Victor (vstinner) * |
Date: 2010-05-22 01:38 |
Since this change breaks backward compatibility, it's a very bad idea to change mbcs codec in Python 2.7: remove this version from this issue.
|
msg106407 - (view) |
Author: STINNER Victor (vstinner) * |
Date: 2010-05-24 23:11 |
Updated version of the patch for py3k:
- don't accept "ignore" error handler anymore
- there is a FIXME near "mbcs_decode_error:"
The whole test suite pass with these patch.
|
msg107513 - (view) |
Author: STINNER Victor (vstinner) * |
Date: 2010-06-11 00:54 |
I worked again on the patch. I opened new issues to prepare the new mbcs codec:
- #8966: ctypes: remove implicit conversion between unicode and bytes
- #8967: Create PyErr_GetWindowsMessage() function
- #8969: Windows: use (mbcs in) strict mode to encode/decode filenames, and enable os.fsencode()
#8967 can be used to get the translated message of a mbcs encode error. PyErr_GetWindowsMessage() returns a PyUnicodeObject, whereas make_translate_exception() and PyUnicodeTranslateError_SetReason() expect a "char*". Another patch is requied: translate_reason_unicode.patch (attached to this issue, not tested). But I don't think that the message is very important for now :-)
#8784 (tarfile/Windows: Don't use mbcs as the default encoding) is still open.
|
msg107517 - (view) |
Author: STINNER Victor (vstinner) * |
Date: 2010-06-11 01:28 |
New version of the patch:
- decode_mbcs() calls raise_translate_exception() to set the error (in the previous patch, I'm not sure that the error was set)
- include #8784 patch (tarfile uses utf-8 as the default encoding)
- ctypes: use mbcs is strict mode instead of ignore mode. This is just a workaround, the real fix is to remove the implicit conversion between bytes and characters: see #8966
The patch requires #8969 patch (use mbcs in strict mode to encode/decode filenames).
|
msg107611 - (view) |
Author: STINNER Victor (vstinner) * |
Date: 2010-06-12 00:01 |
Tim: are you interested in testing this patch?
|
msg107612 - (view) |
Author: STINNER Victor (vstinner) * |
Date: 2010-06-12 00:06 |
Update the patch (I commited the patch on tarfile module): version 3.
|
msg107957 - (view) |
Author: STINNER Victor (vstinner) * |
Date: 2010-06-16 22:21 |
Patch version 4:
- encode_mbcs() uses WC_NO_BEST_FIT_CHARS flag in strict mode. Examples: ğ and ł are not more replaced by g and l
- encode_mbcs() doesn't set *repr to NULL on encode error: the caller does anyway destroy it
- write more documentation about mbcs, especially about the error handlers and the changes in Python 3.2
|
msg107965 - (view) |
Author: STINNER Victor (vstinner) * |
Date: 2010-06-16 23:35 |
I commited the last patch to py3k: r82037. Let see how the buildbots react :-)
|
msg108018 - (view) |
Author: Tim Golden (tim.golden) * |
Date: 2010-06-17 14:21 |
I'm unlikely to get to it soon. If there's no urgency I can
look at it later. FWIW, it's not something I'm especially
familiar with.
On 12/06/2010 01:02, STINNER Victor wrote:
>
> STINNER Victor<victor.stinner@haypocalc.com> added the comment:
>
> Tim: are you interested in testing this patch?
>
> ----------
> nosy: +tim.golden
>
> _______________________________________
> Python tracker<report@bugs.python.org>
> <http://bugs.python.org/issue850997>
> _______________________________________
|
msg108149 - (view) |
Author: STINNER Victor (vstinner) * |
Date: 2010-06-18 23:24 |
Close this issue: nothing special on the buildbots.
|
|
Date |
User |
Action |
Args |
2022-04-11 14:56:01 | admin | set | github: 39624 |
2010-06-18 23:24:39 | vstinner | set | status: open -> closed resolution: fixed messages:
+ msg108149
|
2010-06-17 14:21:56 | tim.golden | set | messages:
+ msg108018 |
2010-06-16 23:35:09 | vstinner | set | messages:
+ msg107965 |
2010-06-16 22:21:04 | vstinner | set | files:
+ mbcs_errors-py3k-4.patch
messages:
+ msg107957 |
2010-06-12 00:07:08 | vstinner | set | files:
- mbcs_errors-py3k-2.patch |
2010-06-12 00:07:02 | vstinner | set | files:
- mbcs_errors-py3k.patch |
2010-06-12 00:06:53 | vstinner | set | files:
+ mbcs_errors-py3k-3.patch
messages:
+ msg107612 |
2010-06-12 00:01:58 | vstinner | set | nosy:
+ tim.golden messages:
+ msg107611
|
2010-06-11 01:28:51 | vstinner | set | files:
+ mbcs_errors-py3k-2.patch
messages:
+ msg107517 |
2010-06-11 00:54:17 | vstinner | set | files:
+ translate_reason_unicode.patch
messages:
+ msg107513 |
2010-05-24 23:11:38 | vstinner | set | files:
+ mbcs_errors-py3k.patch
messages:
+ msg106407 |
2010-05-22 01:38:50 | vstinner | set | messages:
+ msg106278 versions:
- Python 2.7 |
2010-05-22 01:36:15 | vstinner | set | messages:
+ msg106277 |
2010-05-22 01:01:38 | vstinner | set | nosy:
+ vstinner
|
2010-02-05 16:57:18 | ezio.melotti | set | nosy:
+ ezio.melotti
versions:
+ Python 2.7, Python 3.2 |
2009-02-14 22:40:35 | mhammond | set | messages:
+ msg82133 |
2009-02-14 12:14:14 | theller | set | nosy:
- theller |
2009-02-14 11:35:45 | ajaksu2 | set | nosy:
+ ajaksu2 messages:
+ msg82015 components:
+ Unicode keywords:
+ patch type: enhancement stage: test needed |
2003-11-29 01:24:21 | mhammond | create | |