This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author markm
Recipients cdavid, loewis, markm
Date 2011-03-26.12:24:18
SpamBayes Score 1.6751983e-10
Marked as misclassified No
Message-id <1301142259.03.0.117975383386.issue2694@psf.upfronthosting.co.za>
In-reply-to
Content
How about the following patch and tests...

Per: http://msdn.microsoft.com/en-us/library/aa369212(v=vs.85).aspx
"""The Identifier data type is a text string. Identifiers may contain the
ASCII characters A-Z (a-z), digits, underscores (_), or periods (.). However, every identifier must begin with either a letter or an underscore."""

So the spec would say that colons are NOT allowed. Editing some entries in the File table of an MSI (using Orca from the MSI SDK) and running the validation confirms that.

All the following were flagged as errors:
'KDiff3EXE;"ASDF@#$', 'chmFile-', 'pdfFile(', 'hgbook]', 'TortoisePlinkEXE]', 'Hg.Cämd'

I also did some speed testing (just in case non/regex might be slow)
Python 3.2 (r32:88445, Feb 20 2011, 21:29:02) [MSC v.1500 32 bit (Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> from timeit import timeit
>>> setup = 'import string\nidentifier_chars = string.ascii_letters + string.digits + "._"\ntmp_str = []'
>>> timeit("re.sub(r'[^a-zA-Z_\.]', '_', 'somefilename.txt')", setup = "import re")
4.434621757767205
>>> setup = 'import string\nidentifier_chars = string.ascii_letters + string.digits + "._"\ntmp_str = []'
>>> timeit('"".join([c if c in identifier_chars else "_" for c in "somefilename.txt"])', setup)
3.3757537425069906
>>>
History
Date User Action Args
2011-03-26 12:24:19markmsetrecipients: + markm, loewis, cdavid
2011-03-26 12:24:19markmsetmessageid: <1301142259.03.0.117975383386.issue2694@psf.upfronthosting.co.za>
2011-03-26 12:24:18markmlinkissue2694 messages
2011-03-26 12:24:18markmcreate