Message 147763 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	kxroberto
Recipients	Neil Muller, ezio.melotti, fdrake, jjlee, kxroberto, orsenthil, r.david.murray, terry.reedy
Date	2011-11-16.10:16:50
SpamBayes Score	8.052082e-08
Marked as misclassified	No
Message-id	<1321438612.03.0.65576791437.issue1486713@psf.upfronthosting.co.za>
In-reply-to

Content
The old patch warned already the majority of real cases - except the missing white space between attributes. "The tolerant regex will match both": locatestarttagend_tolerant: The main and frequent issue on the web here is the missing white space between attributes (with enclosed values). And there is the new tolerant comma between attributes, which however I have not seen so far anywhere (the old warning machanism and attrfind.match would have already raised it at "junk chars ..." event. Both issues can be easily warned (also/already) at quite no cost by the slightly extended regex below (when the 2 new non-pseudo regex groups are check against <>None in check_for_whole_start_tag). Or missing whitespace could be warned (multiple times) at attrfind time. attrfind_tolerant : I see no point in the old/"strict" attrfind. (and the difference is guessed 0.000% of real cases). attrfind_tolerant could become the only attrfind. -- locatestarttagend_tolerant = re.compile(r""" <[a-zA-Z][-.a-zA-Z0-9:_]* # tag name (?:(?:\s+\|(\s)) # optional whitespace before attribute name (?:[a-zA-Z_][-.:a-zA-Z0-9_] # attribute name (?:\s=\s # value indicator (?:'[^']' # LITA-enclosed value \|\"[^\"]\" # LIT-enclosed value \|[^'\">\s]+ # bare value ) (?:\s(,)) # possibly followed by a comma )? ) )* \s* # trailing whitespace """, re.VERBOSE) attrfind_tolerant = re.compile( r'\s([a-zA-Z_][-.:a-zA-Z_0-9])(\s=\s' r'(\'[^\']\'\|"[^"]"\|[^>\s]*))?') #s='<abc a="b,+"c="d"e=f>text' #s='<abc a="b,+" c="d"e=f>text' s='<abc a="b,+",c="d" e=f>text' m = locatestarttagend_tolerant.search(s) print m.group() print m.groups() #if m.group(1) is not None: self.warning('space missing ... #if m.group(2) is not None: self.warning('comma between attr... m = attrfind_tolerant.search(s, 5) print m.group() print m.groups()

The old patch warned already the majority of real cases  - except the missing white space between attributes.

"The tolerant regex will match both": 
locatestarttagend_tolerant: The main and frequent issue on the web here is the missing white space between attributes (with enclosed values). And there is the new tolerant comma between attributes, which however I have not seen so far anywhere (the old warning machanism and attrfind.match would have already raised it at "junk chars ..." event.
Both issues can be easily warned (also/already) at quite no cost by the slightly extended regex below (when the 2 new non-pseudo regex groups are check against <>None in check_for_whole_start_tag). 
Or missing whitespace could be warned (multiple times) at attrfind time.

attrfind_tolerant : I see no point in the old/"strict" attrfind. (and the difference is guessed 0.000% of real cases). attrfind_tolerant  could become the only attrfind.


--

locatestarttagend_tolerant = re.compile(r"""
  <[a-zA-Z][-.a-zA-Z0-9:_]*          # tag name
  (?:(?:\s+|(\s*))                   # optional whitespace before attribute name
    (?:[a-zA-Z_][-.:a-zA-Z0-9_]*     # attribute name
      (?:\s*=\s*                     # value indicator
        (?:'[^']*'                   # LITA-enclosed value
          |\"[^\"]*\"                # LIT-enclosed value
          |[^'\">\s]+                # bare value
         )
         (?:\s*(,))*                   # possibly followed by a comma
       )?
     )
   )*
  \s*                                # trailing whitespace
""", re.VERBOSE)
attrfind_tolerant = re.compile(
    r'\s*([a-zA-Z_][-.:a-zA-Z_0-9]*)(\s*=\s*'
    r'(\'[^\']*\'|"[^"]*"|[^>\s]*))?')


#s='<abc a="b,+"c="d"e=f>text'
#s='<abc a="b,+" c="d"e=f>text'
s='<abc a="b,+",c="d" e=f>text'

m = locatestarttagend_tolerant.search(s)
print m.group()
print m.groups()
#if m.group(1) is not None: self.warning('space missing ...
#if m.group(2) is not None: self.warning('comma between attr...

m = attrfind_tolerant.search(s, 5)
print m.group()
print m.groups()

History
Date	User	Action	Args
2011-11-16 10:16:52	kxroberto	set	recipients: + kxroberto, fdrake, terry.reedy, jjlee, orsenthil, ezio.melotti, Neil Muller, r.david.murray
2011-11-16 10:16:52	kxroberto	set	messageid: <1321438612.03.0.65576791437.issue1486713@psf.upfronthosting.co.za>
2011-11-16 10:16:51	kxroberto	link	issue1486713 messages
2011-11-16 10:16:50	kxroberto	create