Message 376639 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	vstinner
Recipients	nowasky.jr, vstinner
Date	2020-09-09.14:12:10
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1599660730.93.0.37644546306.issue41748@roundup.psfhosted.org>
In-reply-to

Content
HTMLParser.check_for_whole_start_tag() uses locatestarttagend_tolerant regular expression to find the end of the start tag. This regex cuts the string at the first comma (","), but not if the comma is the first character of an attribute name * '<div id="test" , color="blue">' => '<div id="test" , color="blue"': OK! * '<div id="test" ,color="blue">' => '<div id="test" ,' => BUG The regex is quite complex: locatestarttagend_tolerant = re.compile(r""" <[a-zA-Z][^\t\n\r\f />\x00]* # tag name (?:[\s/]* # optional whitespace before attribute name (?:(?<=['"\s/])[^\s/>][^\s/=>]* # attribute name (?:\s=+\s # value indicator (?:'[^']' # LITA-enclosed value \|"[^"]" # LIT-enclosed value \|(?!['"])[^>\s]* # bare value ) (?:\s,) # possibly followed by a comma )?(?:\s\|/(?!>))* )* )? \s* # trailing whitespace """, re.VERBOSE) endendtag = re.compile('>') The problem is that this part of the regex: #(?:\s,) # possibly followed by a comma The comma is not seen as part of the attribute name.

HTMLParser.check_for_whole_start_tag() uses locatestarttagend_tolerant regular expression to find the end of the start tag. This regex cuts the string at the first comma (","), but not if the comma is the first character of an attribute name

* '<div id="test" , color="blue">' => '<div id="test" , color="blue"': OK!
* '<div id="test" ,color="blue">' => '<div id="test" ,' => BUG

The regex is quite complex:

locatestarttagend_tolerant = re.compile(r"""
  <[a-zA-Z][^\t\n\r\f />\x00]*       # tag name
  (?:[\s/]*                          # optional whitespace before attribute name
    (?:(?<=['"\s/])[^\s/>][^\s/=>]*  # attribute name
      (?:\s*=+\s*                    # value indicator
        (?:'[^']*'                   # LITA-enclosed value
          |"[^"]*"                   # LIT-enclosed value
          |(?!['"])[^>\s]*           # bare value
         )
         (?:\s*,)*                   # possibly followed by a comma
       )?(?:\s|/(?!>))*
     )*
   )?
  \s*                                # trailing whitespace
""", re.VERBOSE)
endendtag = re.compile('>')

The problem is that this part of the regex:

#(?:\s*,)*                   # possibly followed by a comma

The comma is not seen as part of the attribute name.

History
Date	User	Action	Args
2020-09-09 14:12:10	vstinner	set	recipients: + vstinner, nowasky.jr
2020-09-09 14:12:10	vstinner	set	messageid: <1599660730.93.0.37644546306.issue41748@roundup.psfhosted.org>
2020-09-09 14:12:10	vstinner	link	issue41748 messages
2020-09-09 14:12:10	vstinner	create