Issue534304
This issue tracker has been migrated to GitHub,
and is currently read-only.
For more information,
see the GitHub FAQs in the Python's Developer Guide.
Created on 2002-03-24 13:52 by suzuki_hisao, last changed 2022-04-10 16:05 by admin. This issue is now closed.
Files | ||||
---|---|---|---|---|
File name | Uploaded | Description | Edit | |
pep0263-2.2.1c2-03.tar.bz2 | suzuki_hisao, 2002-03-31 16:16 | for Python 2.2.1c2 | ||
pep263.diff | loewis, 2002-05-09 13:42 | for CVS of 20020509 |
Messages (11) | |||
---|---|---|---|
msg39337 - (view) | Author: SUZUKI Hisao (suzuki_hisao) | Date: 2002-03-24 13:52 | |
This is a sample implementation of PEP 263 phase 2. This implementation behaves just as normal Python does if no other coding hints are given. Thus it does not hurt anyone who uses Python now. Note that it is strictly compatible with the PEP in that every program valid in the PEP is also valid in this implementation. This implementation also accepts files in UTF-16 with BOM. They are read as UTF-8 internally. Please try "utf16sample.py" included. |
|||
msg39338 - (view) | Author: Martin v. Löwis (loewis) * ![]() |
Date: 2002-03-25 13:23 | |
Logged In: YES user_id=21627 The patch looks good, but needs a number of improvements. 1. I have problems building this code. When trying to build pgen, I get an error message of Parser/parsetok.c: In function `parsetok': Parser/parsetok.c:175: `encoding_decl' undeclared The problem here is that graminit.h hasn't been built yet, but parsetok refers to the symbol. 2. For some reason, error printing for incorrect encodings does not work - it appears that it prints the wrong line in the traceback. 3. The escape processing in Unicode literals is incorrect. For example, u"\<non-ascii character>" should denote only the non-ascii character. However, your implementation replaces the non-ASCII character with \u<hex>, resulting in \\u<hex>, so the first backslash unescapes the second one. 4. I believe the escape processing in byte strings is also incorrect for encodings that allow \ in the second byte. Before processing escape characters, you convert back into the source encoding. If this produces a backslash character, escape processing will misinterpret that byte as an escape character. |
|||
msg39339 - (view) | Author: Michael Hudson (mwh) ![]() |
Date: 2002-03-30 11:27 | |
Logged In: YES user_id=6656 Not going into 2.2.x. |
|||
msg39340 - (view) | Author: SUZUKI Hisao (suzuki_hisao) | Date: 2002-03-31 16:16 | |
Logged In: YES user_id=495142 Thank you for your review. Now 1. and 3. are fixed, and 2. is improved. (4. is not true.) |
|||
msg39341 - (view) | Author: Guido van Rossum (gvanrossum) * ![]() |
Date: 2002-04-23 21:26 | |
Logged In: YES user_id=6380 I haven't looked at this very carefully, but it looks like it's well thought-out. Suzuki, can you prepare a patch relative to current CVS? I get several patch failures now. (Fortunately I have a checkout of 2.2 so I can still review and test the patch.) I don't know what the patch failures are about (haven't investigated) but imagine it might have to do with the PEP 279 (universal newlines) changes checked in by Jack Jansen, which replaces the tokenizer's fgets() calls with calls to Py_UniversalNewlineFgets(). Also, I can't read the README file (it's in Japanese :-). What is the expected output from the samples? For me, sjis_sample.py gives SyntaxError: 'unknown encoding' Martin, I'm unclear of how you intend to use this code. Do you intend to go straight to phase 2 of the PEP using this patch? Or do you intend to implement phase 1 of the PEP by modifying this code? Also, does the PEP describe the UTF-16 support as implemented by Suziki's patch? |
|||
msg39342 - (view) | Author: Martin v. Löwis (loewis) * ![]() |
Date: 2002-04-26 19:41 | |
Logged In: YES user_id=21627 I've updated the PEP to describe how this approach should be used: Python 2.3 still should generate warnings only for using non-ASCII without declared encoding. I, too, hope that Mr Suzuki will update the patch to match the PEP, and for the CVS tree. As for supporting UTF-16: The stream reader currently has the .readline method disabled, since it won't work reliable for little-endian. So I think this should be an undocumented feature at the moment; I see no other technical problems with the approach taken in the patch. |
|||
msg39343 - (view) | Author: Martin v. Löwis (loewis) * ![]() |
Date: 2002-05-09 13:42 | |
Logged In: YES user_id=21627 I have now updated this patch to the current CVS, and to be a complete PEP 263 implementation; it will issue warnings when it finds non-ASCII characters but no encoding declaration. |
|||
msg39344 - (view) | Author: Neal Norwitz (nnorwitz) * ![]() |
Date: 2002-07-16 18:50 | |
Logged In: YES user_id=33168 I reviewed the patch. I don't like the usage of enc (and str to a lesser extent). In particular, there is an encoding field which is generally used. enc is used as a temporary from the callback. I don't have a solution, so perhaps it would be best to doc the purpose, usage and interaction of enc & str. There are some differences between the standard formatting and that used in the patch. return on same line as if among others. But these aren't too bad. Although I don't love the line do t++; while (...);. I didn't see any problems with the patch. |
|||
msg39345 - (view) | Author: Martin v. Löwis (loewis) * ![]() |
Date: 2002-08-04 17:42 | |
Logged In: YES user_id=21627 I have now implemented Neal's suggestions: I documented the processing of encodings, and changed a number of formatting problems. I have disabled the detection of UTF-16 BOMs, since they are not backed by the PEP. I have committed the changes as Makefile.pre.in 1.93 ref2.tex 1.38 Grammar 1.48 errcode.h 2.15 graminit.h 2.20 NEWS 1.451 parsetok.c 2.33 tokenizer.c 2.55 tokenizer.h 2.17 tokenizer_pgen.c 2.1 compile.c 2.250 graminit.c 2.34 pythonrun.c 2.165 The change to bltinmodule.c was there by mistake, so I have removed that change. SUZUKI Hisao, how would you like to be listed in Misc/ACKS? |
|||
msg39346 - (view) | Author: Jean Jordaan (neaj) | Date: 2003-08-06 11:35 | |
Logged In: YES user_id=59821 This is a very sensible enhancement to Python. I would just like to ask a niggling question .. In the PEP, and below, everyone always talks of "encoding". This is also the terminology I'm familiar with from all over. So why on earth is the magic comment string just "coding", and not "encoding"? Perhaps the recognizing regexp can match both, prefering "encoding" and deprecating "coding"? My apologies if I'm missing something or repeating someone else .. |
|||
msg39347 - (view) | Author: Michael Hudson (mwh) ![]() |
Date: 2003-08-06 11:39 | |
Logged In: YES user_id=6656 Compatibility with Emacs. |
History | |||
---|---|---|---|
Date | User | Action | Args |
2022-04-10 16:05:08 | admin | set | github: 36321 |
2002-03-24 13:52:25 | suzuki_hisao | create |