Message 129647 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	ezio.melotti
Recipients	belopolsky, dangra, ezio.melotti, lemburg, pitrou, sjmachin, vstinner
Date	2011-02-27.18:58:55
SpamBayes Score	5.551115e-17
Marked as misclassified	No
Message-id	<1298833138.36.0.0123208666019.issue8271@psf.upfronthosting.co.za>
In-reply-to

Content
The patch turned out to be less trivial than I initially thought. The current algorithm checks for invalid continuation bytes in 4 places: 1) before the switch/case statement in Objects/unicodeobject.c when it checks if there are enough bytes in the string (e.g. if the current byte is a start byte of a 4-bytes sequence and there are only 2 more bytes in the string, the sequence is invalid); 2) in the "case 2" of the switch, where it's checked if the second byte is a valid continuation byte; 3) in the "case 3" of the switch, where it's checked if the second and third bytes are valid continuation bytes, including additional invalid cases for the second bytes; 3) in the "case 4" of the switch, where it's checked if the second, third, and fourth bytes are valid continuation bytes, including additional invalid cases for the second bytes; The correct algorithm should determine the maximum valid subpart of the sequence determining the position of the first invalid continuation byte. Continuation bytes are all in range 80..BF except for the second byte of 3-bytes sequences that start with E0 or ED and second byte of 4-bytes sequences that start with F0 or F4 (3rd and 4th continuation bytes are always in range 80..BF). This means that the above 4 cases should be changed in this way: 1) if there aren't enough bytes left to complete the sequence check for valid continuation bytes considering the special cases for the second bytes (E0, ED, F0, F4) instead of using the naive algorithm that checks only for continuation bytes in range 80..BF; 2) the "case 2" is fine as is, because the second byte is always in range 80..BF; 3) the "case 3" should check (pseudocode): if (second_byte_is_not_valid) max_subpart_len = 1 else if (third_byte not in 80..BF) max_subpart_len = 2 else # the sequence is valid the "second_byte_is_not_valid" part should consider the two special cases for E0 and ED. 4) the "case 4" should check (pseudocode): if (second_byte_is_not_valid) max_subpart_len = 1 else if (third_byte not in 80..BF) max_subpart_len = 2 else if (fourth_byte not in 80..BF) max_subpart_len = 3 else # the sequence is valid here the "second_byte_is_not_valid" part should consider the two special cases for F0 and E4. In order to avoid duplication of code I was thinking to add 2 macros (something like IS_VALID_3_SEQ_2ND_BYTE, IS_VALID_4_SEQ_2ND_BYTE) that will be used in the cases 1) and 3), and 1) and 4) respectively. The change shouldn't affect the decoding speed, but it will increase the lines of code and complexity of the function. Is this OK?

The patch turned out to be less trivial than I initially thought.

The current algorithm checks for invalid continuation bytes in 4 places:
1) before the switch/case statement in Objects/unicodeobject.c when it checks if there are enough bytes in the string (e.g. if the current byte is a start byte of a 4-bytes sequence and there are only 2 more bytes in the string, the sequence is invalid);
2) in the "case 2" of the switch, where it's checked if the second byte is a valid continuation byte;
3) in the "case 3" of the switch, where it's checked if the second and third bytes are valid continuation bytes, including additional invalid cases for the second bytes;
3) in the "case 4" of the switch, where it's checked if the second, third, and fourth bytes are valid continuation bytes, including additional invalid cases for the second bytes;

The correct algorithm should determine the maximum valid subpart of the sequence determining the position of the first invalid continuation byte. Continuation bytes are all in range 80..BF except for the second  byte of 3-bytes sequences that start with E0 or ED and second byte of 4-bytes sequences that start with F0 or F4 (3rd and 4th continuation bytes are always in range 80..BF).
This means that the above 4 cases should be changed in this way:
1) if there aren't enough bytes left to complete the sequence check for valid continuation bytes considering the special cases for the second bytes (E0, ED, F0, F4) instead of using the naive algorithm that checks only for continuation bytes in range 80..BF;
2) the "case 2" is fine as is, because the second byte is always in range 80..BF;
3) the "case 3" should check (pseudocode):
  if (second_byte_is_not_valid) max_subpart_len = 1
  else if (third_byte not in 80..BF) max_subpart_len = 2
  else  # the sequence is valid
the "second_byte_is_not_valid" part should consider the two special cases for E0 and ED.
4) the "case 4" should check (pseudocode):
  if (second_byte_is_not_valid) max_subpart_len = 1
  else if (third_byte not in 80..BF) max_subpart_len = 2
  else if (fourth_byte not in 80..BF) max_subpart_len = 3
  else  # the sequence is valid
here the "second_byte_is_not_valid" part should consider the two special cases for F0 and E4.

In order to avoid duplication of code I was thinking to add 2 macros (something like IS_VALID_3_SEQ_2ND_BYTE, IS_VALID_4_SEQ_2ND_BYTE) that will be used in the cases 1) and 3), and 1) and 4) respectively.
The change shouldn't affect the decoding speed, but it will increase the lines of code and complexity of the function.
Is this OK?

History
Date	User	Action	Args
2011-02-27 18:58:58	ezio.melotti	set	recipients: + ezio.melotti, lemburg, sjmachin, belopolsky, pitrou, vstinner, dangra
2011-02-27 18:58:58	ezio.melotti	set	messageid: <1298833138.36.0.0123208666019.issue8271@psf.upfronthosting.co.za>
2011-02-27 18:58:56	ezio.melotti	link	issue8271 messages
2011-02-27 18:58:56	ezio.melotti	create