Message 129685 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	lemburg
Recipients	belopolsky, dangra, ezio.melotti, lemburg, pitrou, sjmachin, vstinner
Date	2011-02-28.09:18:06
SpamBayes Score	5.551115e-17
Marked as misclassified	No
Message-id	<4D6B684D.3090804@egenix.com>
In-reply-to	<1298833138.36.0.0123208666019.issue8271@psf.upfronthosting.co.za>

Content
Ezio Melotti wrote: > > Ezio Melotti <ezio.melotti@gmail.com> added the comment: > > The patch turned out to be less trivial than I initially thought. > > The current algorithm checks for invalid continuation bytes in 4 places: > 1) before the switch/case statement in Objects/unicodeobject.c when it checks if there are enough bytes in the string (e.g. if the current byte is a start byte of a 4-bytes sequence and there are only 2 more bytes in the string, the sequence is invalid); > 2) in the "case 2" of the switch, where it's checked if the second byte is a valid continuation byte; > 3) in the "case 3" of the switch, where it's checked if the second and third bytes are valid continuation bytes, including additional invalid cases for the second bytes; > 3) in the "case 4" of the switch, where it's checked if the second, third, and fourth bytes are valid continuation bytes, including additional invalid cases for the second bytes; > > The correct algorithm should determine the maximum valid subpart of the sequence determining the position of the first invalid continuation byte. Continuation bytes are all in range 80..BF except for the second byte of 3-bytes sequences that start with E0 or ED and second byte of 4-bytes sequences that start with F0 or F4 (3rd and 4th continuation bytes are always in range 80..BF). > This means that the above 4 cases should be changed in this way: > 1) if there aren't enough bytes left to complete the sequence check for valid continuation bytes considering the special cases for the second bytes (E0, ED, F0, F4) instead of using the naive algorithm that checks only for continuation bytes in range 80..BF; > 2) the "case 2" is fine as is, because the second byte is always in range 80..BF; > 3) the "case 3" should check (pseudocode): > if (second_byte_is_not_valid) max_subpart_len = 1 > else if (third_byte not in 80..BF) max_subpart_len = 2 > else # the sequence is valid > the "second_byte_is_not_valid" part should consider the two special cases for E0 and ED. > 4) the "case 4" should check (pseudocode): > if (second_byte_is_not_valid) max_subpart_len = 1 > else if (third_byte not in 80..BF) max_subpart_len = 2 > else if (fourth_byte not in 80..BF) max_subpart_len = 3 > else # the sequence is valid > here the "second_byte_is_not_valid" part should consider the two special cases for F0 and E4. > > In order to avoid duplication of code I was thinking to add 2 macros (something like IS_VALID_3_SEQ_2ND_BYTE, IS_VALID_4_SEQ_2ND_BYTE) that will be used in the cases 1) and 3), and 1) and 4) respectively. > The change shouldn't affect the decoding speed, but it will increase the lines of code and complexity of the function. > Is this OK? Sure. It would be great if you could time the difference in performance. In tight loops like the codec ones, small changes in the way you write things can often make a big difference. Please also include a Misc/NEWS entry to point to the change and possibly consequences for existing code relying on the previous behavior. Thanks, -- Marc-Andre Lemburg eGenix.com ________________________________________________________________________ ::: Try our new mxODBC.Connect Python Database Interface for free ! :::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/

Ezio Melotti wrote:
> 
> Ezio Melotti <ezio.melotti@gmail.com> added the comment:
> 
> The patch turned out to be less trivial than I initially thought.
> 
> The current algorithm checks for invalid continuation bytes in 4 places:
> 1) before the switch/case statement in Objects/unicodeobject.c when it checks if there are enough bytes in the string (e.g. if the current byte is a start byte of a 4-bytes sequence and there are only 2 more bytes in the string, the sequence is invalid);
> 2) in the "case 2" of the switch, where it's checked if the second byte is a valid continuation byte;
> 3) in the "case 3" of the switch, where it's checked if the second and third bytes are valid continuation bytes, including additional invalid cases for the second bytes;
> 3) in the "case 4" of the switch, where it's checked if the second, third, and fourth bytes are valid continuation bytes, including additional invalid cases for the second bytes;
> 
> The correct algorithm should determine the maximum valid subpart of the sequence determining the position of the first invalid continuation byte. Continuation bytes are all in range 80..BF except for the second  byte of 3-bytes sequences that start with E0 or ED and second byte of 4-bytes sequences that start with F0 or F4 (3rd and 4th continuation bytes are always in range 80..BF).
> This means that the above 4 cases should be changed in this way:
> 1) if there aren't enough bytes left to complete the sequence check for valid continuation bytes considering the special cases for the second bytes (E0, ED, F0, F4) instead of using the naive algorithm that checks only for continuation bytes in range 80..BF;
> 2) the "case 2" is fine as is, because the second byte is always in range 80..BF;
> 3) the "case 3" should check (pseudocode):
>   if (second_byte_is_not_valid) max_subpart_len = 1
>   else if (third_byte not in 80..BF) max_subpart_len = 2
>   else  # the sequence is valid
> the "second_byte_is_not_valid" part should consider the two special cases for E0 and ED.
> 4) the "case 4" should check (pseudocode):
>   if (second_byte_is_not_valid) max_subpart_len = 1
>   else if (third_byte not in 80..BF) max_subpart_len = 2
>   else if (fourth_byte not in 80..BF) max_subpart_len = 3
>   else  # the sequence is valid
> here the "second_byte_is_not_valid" part should consider the two special cases for F0 and E4.
> 
> In order to avoid duplication of code I was thinking to add 2 macros (something like IS_VALID_3_SEQ_2ND_BYTE, IS_VALID_4_SEQ_2ND_BYTE) that will be used in the cases 1) and 3), and 1) and 4) respectively.
> The change shouldn't affect the decoding speed, but it will increase the lines of code and complexity of the function.
> Is this OK?

Sure.

It would be great if you could time the difference in
performance. In tight loops like the codec ones, small changes
in the way you write things can often make a big difference.

Please also include a Misc/NEWS entry to point to the change
and possibly consequences for existing code relying on the
previous behavior.

Thanks,
-- 
Marc-Andre Lemburg
eGenix.com

________________________________________________________________________

::: Try our new mxODBC.Connect Python Database Interface for free ! ::::

   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
    D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
           Registered at Amtsgericht Duesseldorf: HRB 46611
               http://www.egenix.com/company/contact/

History
Date	User	Action	Args
2011-02-28 09:18:08	lemburg	set	recipients: + lemburg, sjmachin, belopolsky, pitrou, vstinner, ezio.melotti, dangra
2011-02-28 09:18:07	lemburg	link	issue8271 messages
2011-02-28 09:18:06	lemburg	create