This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: email.header uses re.IGNORECASE without re.ASCII
Type: Stage: resolved
Components: Regular Expressions Versions: Python 3.7
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: barry, ezio.melotti, methane, mrabarnett, serhiy.storchaka
Priority: normal Keywords: patch

Created on 2017-10-03 12:58 by methane, last changed 2022-04-11 14:58 by admin. This issue is now closed.

Pull Requests
URL Status Linked Edit
PR 3868 merged methane, 2017-10-03 13:03
PR 7856 closed hloeung, 2019-05-18 01:52
Messages (6)
msg303612 - (view) Author: Inada Naoki (methane) * (Python committer) Date: 2017-10-03 12:58
email.header has this pattern:

https://github.com/python/cpython/blob/85c0b8941f0c8ef3ed787c9d504712c6ad3eb5d3/Lib/email/header.py#L34-L43

# Match encoded-word strings in the form =?charset?q?Hello_World?=                       
ecre = re.compile(r'''                                                                   
  =\?                   # literal =?                                                     
  (?P<charset>[^?]*?)   # non-greedy up to the next ? is the charset                     
  \?                    # literal ?                                                      
  (?P<encoding>[qb])    # either a "q" or a "b", case insensitive                        
  \?                    # literal ?                                                      
  (?P<encoded>.*?)      # non-greedy up to the next ?= is the encoded string             
  \?=                   # literal ?=                                                     
  ''', re.VERBOSE | re.IGNORECASE | re.MULTILINE)


Since only 's' and 'i' has other lower case character, this is not a real bug.
But using re.ASCII is more safe.

Additionally, email.util has same pattern from 10 years ago, and it is not used by anywhere.
It should be removed.
msg303613 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2017-10-03 13:09
Alternatively, re.IGNORECASE can be removed and [qb] replaced with [QqBb].
msg303615 - (view) Author: Barry A. Warsaw (barry) * (Python committer) Date: 2017-10-03 13:40
I think using re.ASCII is a good addition since RFC 2047 says:

   Generally, an "encoded-word" is a sequence of printable ASCII
   characters that begins with "=?", ends with "?=", and has two "?"s in
   between.  It specifies a character set and an encoding method, and
   also includes the original text encoded as graphic ASCII characters,
   according to the rules for that encoding method.

It's better to keep the re.IGNORECASE since the RFC also says:

   Both 'encoding' and 'charset' names are case-independent.  Thus the
   charset name "ISO-8859-1" is equivalent to "iso-8859-1", and the
   encoding named "Q" may be spelled either "Q" or "q".
msg303668 - (view) Author: Inada Naoki (methane) * (Python committer) Date: 2017-10-04 03:47
New changeset bf477a99e0c85258e6573f4ee9eda68fa1f98a31 by INADA Naoki in branch 'master':
bpo-31677: email: Remove re.IGNORECASE flag (GH-3868)
https://github.com/python/cpython/commit/bf477a99e0c85258e6573f4ee9eda68fa1f98a31
msg303669 - (view) Author: Inada Naoki (methane) * (Python committer) Date: 2017-10-04 03:51
> It's better to keep the re.IGNORECASE since the RFC also says:
>
>   Both 'encoding' and 'charset' names are case-independent.  Thus the
>   charset name "ISO-8859-1" is equivalent to "iso-8859-1", and the
>   encoding named "Q" may be spelled either "Q" or "q".

I'm sorry, I've committed before reading this.
But I think it's not problem, because re.IGNORECASE doesn't affect to
"(?P<charset>[^?]*?)" pattern.
msg303695 - (view) Author: Barry A. Warsaw (barry) * (Python committer) Date: 2017-10-04 14:07
On Oct 3, 2017, at 23:51, INADA Naoki <report@bugs.python.org> wrote:
>> It's better to keep the re.IGNORECASE since the RFC also says:
>> 
>>  Both 'encoding' and 'charset' names are case-independent.  Thus the
>>  charset name "ISO-8859-1" is equivalent to "iso-8859-1", and the
>>  encoding named "Q" may be spelled either "Q" or "q".
> 
> I'm sorry, I've committed before reading this.
> But I think it's not problem, because re.IGNORECASE doesn't affect to
> "(?P<charset>[^?]*?)" pattern.

I think you’re change is fine, no need to revert or modify it.
History
Date User Action Args
2022-04-11 14:58:53adminsetgithub: 75858
2019-05-18 01:52:04hloeungsetpull_requests: + pull_request13314
2017-10-04 14:07:33barrysetmessages: + msg303695
2017-10-04 03:51:53methanesetstatus: open -> closed
resolution: fixed
messages: + msg303669

stage: patch review -> resolved
2017-10-04 03:47:41methanesetmessages: + msg303668
2017-10-03 13:40:03barrysetnosy: + barry
messages: + msg303615
2017-10-03 13:09:49serhiy.storchakasetnosy: + serhiy.storchaka
messages: + msg303613
2017-10-03 13:03:35methanesetkeywords: + patch
stage: patch review
pull_requests: + pull_request3848
2017-10-03 12:58:02methanecreate