Issue 18779: Misleading documentations and comments in regular expression HOWTO

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/62979

classification

Title:	Misleading documentations and comments in regular expression HOWTO
Type:		Stage:	resolved
Components:	Documentation, Regular Expressions	Versions:	Python 3.3, Python 3.4

process

Status:	closed	Resolution:	wont fix
Dependencies:		Superseder:
Assigned To:	docs@python	Nosy List:	akuchling, docs@python, ezio.melotti, mrabarnett, pitrou, r.david.murray, vajrasky
Priority:	normal	Keywords:	patch

Created on 2013-08-19 07:45 by vajrasky, last changed 2022-04-11 14:57 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
fix_alphanumeric_and_underscore_doc_in_regex.patch	vajrasky, 2013-08-19 07:45		review

Messages (5)
msg195611 - (view)	Author: Vajrasky Kok (vajrasky) *	Date: 2013-08-19 07:45
According to: http://oald8.oxfordlearnersdictionaries.com/dictionary/alphanumeric http://en.wikipedia.org/wiki/Alphanumeric Alphanumeric is defined as [A-Za-z0-9]. Underscore (_) is not one of them. One of the documentation in Python (Doc/tutorial/stdlib2.rst) differentiates them very clearly: "The format uses placeholder names formed by ``$`` with valid Python identifiers (alphanumeric characters and underscores). Surrounding the placeholder with braces allows it to be followed by more alphanumeric letters with no intervening spaces. Writing ``$$`` creates a single escaped ``$``::" Yet, in documentations as well as comments in regex, we implicitely assumes underscore belongs to alphanumeric. Explicit is better than implicit! Attached the patch to differentiate alphanumeric and underscore in documentations and comments in regex. This is important in case someone is confused with this code: >>> import re >>> re.split('\W', 'haha$hihi*huhu_hehe hoho') ['haha', 'hihi', 'huhu_hehe', 'hoho'] On the side note: In Python code base, sometimes we write "alphanumerics" and "underscores", yet sometimes we write "alphanumeric characters" and "underscore characters". Which one again is the true way?
msg195617 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2013-08-19 09:37
I was wondering which doc you were alluding it, before I noticed your patch is against the regex HOWTO. The HOWTO seems quite outdated wrt. Python 3. For example "\w" is not equivalent to "[a-zA-Z0-9_]", anymore, except with the ASCII flag.
msg195618 - (view)	Author: Vajrasky Kok (vajrasky) *	Date: 2013-08-19 10:19
In Lib/re.py, starting from line 77 (Python 3.4): \w Matches any alphanumeric character; equivalent to [a-zA-Z0-9_] in bytes patterns or string patterns with the ASCII flag. In string patterns without the ASCII flag, it will match the range of Unicode alphanumeric characters (letters plus digits plus underscore). With LOCALE, it will match the set [0-9_] plus characters defined as letters for the current locale. The prelude is "Matches any alphanumeric character;". Yet, in any case (bytes, string patterns with ascii flag, string patterns without the ascii flag, strings with locale), the underscore is always included. Then why don't we change the prelude to "Matches any alphanumeric character and underscore character;"? In the description we explain the alphanumeric depending on it's unicode or not can be [A-Za-z0-9] or wider than that. The description is already okay but the prelude is misleading readers.
msg195627 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2013-08-19 12:39
The answer to the question about "alphanumerics" versus "alphanumeric characters" is that is is mostly likely context-dependent, so I'd have to see particular examples to say which I though read better. So, there is no One True Answer for this question, I think.
msg287809 - (view)	Author: A.M. Kuchling (akuchling) *	Date: 2017-02-15 02:46
Unfortunately making the sentences pedantically correct also makes them ungainly, and I think people generally assume that underscores are treated as a letter.

History
Date	User	Action	Args
2022-04-11 14:57:49	admin	set	github: 62979
2017-02-15 02:46:23	akuchling	set	status: open -> closed nosy: + akuchling messages: + msg287809 resolution: wont fix stage: patch review -> resolved
2014-12-31 16:15:07	akuchling	set	nosy: - akuchling
2013-08-19 12:39:58	r.david.murray	set	nosy: + r.david.murray messages: + msg195627
2013-08-19 10:19:36	vajrasky	set	messages: + msg195618
2013-08-19 09:37:36	pitrou	set	messages: + msg195617 title: Misleading documentations and comments in regular expression about alphanumerics and underscore -> Misleading documentations and comments in regular expression HOWTO
2013-08-19 09:31:47	serhiy.storchaka	set	nosy: + ezio.melotti, pitrou, mrabarnett components: + Regular Expressions stage: patch review
2013-08-19 07:45:16	vajrasky	create