Message31810
The longtime arguable ZWNBSP is deprecated nowadays ( the http://www.unicode.org/unicode/faq/utf_bom.html#24 suggests a "U+2060 WORD JOINER" instead of ZWNBSP ). However I can understand that "backwards compatibility" is always a good concern, and that's why SteamReader seems reluctant to change.
In practice, a ZWNBSP inside a file is rarely intended (please also refer to the topic "Q: What should I do with U+FEFF in the middle of a file?" in same URL above). IMHO, it is very likely caused by the multi-append file operation or etc. Well, at least, the unsymmetric "what you write is NOT what you get/read" effect between "codecs.open(filename, 'a', 'UTF-16')" and "codecs.open(filename, 'r', 'UTF-16')" is not elegant enough.
Aiming at the unsymmetry, finally I come up with a wrapper function for the codecs.open(), which solve (or you may say "bypass") the problem well in my case. I'll post the code as attachment.
BTW, even the official document of Python2.4, chapter "7.3.2.1 Built-in Codecs", mentions that the:
PyObject* PyUnicode_DecodeUTF16( const char *s, int size, const char *errors, int *byteorder)
can "switches according to all byte order marks (BOM) it finds in the input data. BOMs are not copied into the resulting Unicode string". I don't know whether it is the BOM-less decoder we talked for long time. //shrug
Hope the information above can be some kind of recipe for those who encounter same problem. That's it. Thanks for your patience.
Best regards,
Iceberg
File Added: _codecs.py |
|
| Date |
User |
Action |
Args |
| 2007-08-23 14:53:09 | admin | link | issue1701389 messages |
| 2007-08-23 14:53:09 | admin | create | |
|