Message120982
haypo> See also #2382: I wrote patches two years ago for this issue.
Yes, this is the same issue. I don't want to close this as a duplicate because #2382 contains a much more ambitious set of patches. What I am trying to achieve here is similar to the adjust_offset.patch there.
I am attaching a patch that takes an alternative approach and computes the number of characters in the parser. I strongly believe that the buffer in the tokenizer always contains UTF-8 encoded text. If it is not so already, I would consider making it so by replacing a call to _PyUnicode_AsDefaultEncodedString() with a call to PyUnicode_AsUTF8String(). (if that matters)
The patch still needs unittests and possibly has some off-by-one issues, but I would like to get to an agreement that this is the right level at which the problem should be fixed first. |
|
Date |
User |
Action |
Args |
2010-11-11 23:05:54 | belopolsky | set | recipients:
+ belopolsky, lemburg, loewis, vstinner, ezio.melotti |
2010-11-11 23:05:54 | belopolsky | set | messageid: <1289516754.09.0.284658362081.issue10382@psf.upfronthosting.co.za> |
2010-11-11 23:05:52 | belopolsky | link | issue10382 messages |
2010-11-11 23:05:52 | belopolsky | create | |
|