This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: Expose regex bytecode of compiled pattern object
Type: enhancement Stage: resolved
Components: Library (Lib), Regular Expressions Versions: Python 3.7
process
Status: closed Resolution: rejected
Dependencies: Superseder:
Assigned To: Nosy List: JelleZijlstra, ezio.melotti, jcgoble3, mrabarnett, pitrou, serhiy.storchaka, terry.reedy
Priority: normal Keywords: easy, patch

Created on 2016-02-11 04:05 by jcgoble3, last changed 2022-04-11 14:58 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
issue26336.patch JelleZijlstra, 2016-06-05 01:12 fixed patch without unrelated turtle changes review
issue26336-cr.patch JelleZijlstra, 2016-06-05 05:50 review
issue26336-cr2.patch JelleZijlstra, 2016-06-05 07:35 patch addressing code review comments review
Messages (13)
msg260072 - (view) Author: Jonathan Goble (jcgoble3) * Date: 2016-02-11 04:05
Once a regular expression is compiled with `obj = re.compile()`, it would be nice to have access to the raw bytecode, probably as `obj.code` or `obj.bytecode`, so it can be explored programmatically. Currently, regex bytecode is only stored in a C struct and not exposed to Python code; the only way to examine the compiled version is to pass the `re.DEBUG` flag to `re.compile()`, which prints only to stdout and outputs not the finished bytecode, but a "pretty-printed" intermediate representation useless for programmatic analysis.

This is basically requesting the equivalent of the `co_code` attribute of the code object returned by the built-in `compile()`, but for regular expression objects instead of Python code objects.

Given that the bytecode can actually be multi-byte integers, `regexobj.bytecode` should return a list (perhaps even just the same list passed to the C function?) or an `array.array()` instance, rather than a bytestring.
msg260397 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2016-02-17 12:39
Regex bytecode is implementation detail. It was 16-bit in narrow builds, but was changed to at least 32-bit in bugfix releases. It can be changed to 64-bit or to pack an argument with an opcode in one word. The implementation can not use the bytecode at all, but use the tree instead.
msg260408 - (view) Author: Jonathan Goble (jcgoble3) * Date: 2016-02-17 20:24
It would indeed be marked as a CPython implementation detail, and with no guarantee of backward compatibility. Others (well, at least one other) have suggested the same on python-ideas. So a simple note in the accompanying documentation would suffice.
msg260569 - (view) Author: Jonathan Goble (jcgoble3) * Date: 2016-02-20 18:14
Noting for the record that, as I had brought up on python-ideas [1], in addition to simply exposing the raw code, it would be nice to have a public constructor for the compiled pattern type and a 'dis'-like module for support. The former would enable optimizers, and the latter would simplify programmatic analysis.

[1] https://mail.python.org/pipermail/python-ideas/2016-February/thread.html#38488
msg267356 - (view) Author: Jelle Zijlstra (JelleZijlstra) * (Python committer) Date: 2016-06-05 00:46
This patch exposes the bytecode as a __code__ attribute on pattern objects as a Unicode string (consistent with the internal representation as Py_UCS4 instances).
msg267381 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2016-06-05 05:04
__code__ is associated with Python bytecode. Regex bytecode can't be represented as a Unicode string since it is a sequence of 32-bit integers that can be out of the ord(sys.maxunicode) limit.
msg267388 - (view) Author: Jelle Zijlstra (JelleZijlstra) * (Python committer) Date: 2016-06-05 05:50
Thanks for the feedback. This patch instead exposes the code as a tuple of integers named __pattern_code__. "Bytecode" is technically inaccurate since the code isn't limited to bytes but can contain larger integers.
msg267395 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2016-06-05 07:19
Added comments on Rietveld.

I still not think this is a good idea.
msg267396 - (view) Author: Jelle Zijlstra (JelleZijlstra) * (Python committer) Date: 2016-06-05 07:35
Updated patch attached.

I don't feel strongly about whether this should be in Python, but it seems potentially useful at least as a tool to learn more about how re is implemented. If I have time I may write a tool using __pattern_code__ and the sre_constants module to provide a disassembly for regexes.
msg267467 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2016-06-05 19:51
I prefer 'rexcode' for the attribute name.

I share Serhiy's  reservations.  When people write code that depends on CPython implementation details, even though documented as such, the existence of such code becomes a drag on change, especially when details have been stable for awhile.  I just saw this used as an argument against one of the proposed bytecode/wordcode changes. "It would break current 3rd party code." It also came up a few years ago with randomizing hashes (and dict iteration order).

Jelle, can one access the 'rexcode' via ctypes?  Is so, I think an re disassembler with docs would be a good pypi module.  Maybe you could also make it work with Barnett's regex module.
msg267483 - (view) Author: Jelle Zijlstra (JelleZijlstra) * (Python committer) Date: 2016-06-05 23:40
Yes, you can get at it with ctypes. I released a small (and virtually untested) library at https://github.com/JelleZijlstra/regdis that provides dis-like capabilities.
msg293199 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2017-05-07 15:41
See issue30299 which adds the output of decoded bytecode in debug mode. The format of the bytecode is implementation detail, it is irregular, new opcodes can be added, and the format of existing opcodes can be changed. Thus it is hard to support third-party disassembler.
msg293210 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2017-05-07 19:33
I changed the title because I believe a) the Python-level tuple of ints should be created on demand (I am not sure what the patch does); and b) the exposure should be done by an overt function rather than be an  'attribute', even if that is a front for a property getter.  The former would have to come with re.  A function does not need to be a method, and can therefore be provided in a 3rd party module that accesses the C attribute via ctypes.  Such a function could be used with past versions of CPython.

Jelle, I suggest that you augment regdis with such function, so it does not depend on this issue.


I am closing this for reasons stated and the following.

I reread the thread.  Only one person with commit privileges participated.  The proposal got only tepid support for stdlib inclusion and at least as much or more support for 3rd party activity.  The thread ended with Jonathan saying "I've decided to shelve this idea for the time being".  Given the opposition of the current re maintainer, this proposal lacks sufficient support.

Modules as a whole can be OS specific (several examples) or CPython-specific (dis).  We generally avoid adding features within modules whose existence is implementation-specific.  "This attribute [a tuple of ints] is not guaranteed to exist in all implementations of Python." is almost enough to kill the proposal.
History
Date User Action Args
2022-04-11 14:58:27adminsetgithub: 70524
2017-05-07 19:33:47terry.reedysetstatus: open -> closed
versions: + Python 3.7, - Python 3.6
title: Expose regex bytecode as attribute of compiled pattern object -> Expose regex bytecode of compiled pattern object
messages: + msg293210

resolution: rejected
stage: resolved
2017-05-07 15:41:59serhiy.storchakasetmessages: + msg293199
2016-06-05 23:40:48JelleZijlstrasetmessages: + msg267483
2016-06-05 19:51:12terry.reedysetnosy: + terry.reedy
messages: + msg267467
2016-06-05 07:35:28JelleZijlstrasetfiles: + issue26336-cr2.patch

messages: + msg267396
2016-06-05 07:19:02serhiy.storchakasetmessages: + msg267395
2016-06-05 05:50:09JelleZijlstrasetfiles: + issue26336-cr.patch

messages: + msg267388
2016-06-05 05:04:12serhiy.storchakasetmessages: + msg267381
2016-06-05 01:12:59JelleZijlstrasetfiles: - issue26336.patch
2016-06-05 01:12:44JelleZijlstrasetfiles: + issue26336.patch
2016-06-05 00:46:11JelleZijlstrasetfiles: + issue26336.patch

nosy: + JelleZijlstra
messages: + msg267356

keywords: + patch
2016-02-20 18:14:00jcgoble3setmessages: + msg260569
2016-02-17 20:24:04jcgoble3setmessages: + msg260408
2016-02-17 12:39:37serhiy.storchakasetmessages: + msg260397
2016-02-16 19:03:57paul.mooresetkeywords: + easy
2016-02-11 04:05:27jcgoble3create