Title: Expose regex bytecode of compiled pattern object
Type: enhancement Stage: resolved
Components: Library (Lib), Regular Expressions Versions: Python 3.7
Status: closed Resolution: rejected
Dependencies: Superseder:
Assigned To: Nosy List: Jelle Zijlstra, ezio.melotti, jcgoble3, mrabarnett, pitrou, serhiy.storchaka, terry.reedy
Priority: normal Keywords: easy, patch

Created on 2016-02-11 04:05 by jcgoble3, last changed 2017-05-07 19:33 by terry.reedy. This issue is now closed.

File name Uploaded Description Edit
issue26336.patch Jelle Zijlstra, 2016-06-05 01:12 fixed patch without unrelated turtle changes review
issue26336-cr.patch Jelle Zijlstra, 2016-06-05 05:50 review
issue26336-cr2.patch Jelle Zijlstra, 2016-06-05 07:35 patch addressing code review comments review
Messages (13)
msg260072 - (view) Author: Jonathan Goble (jcgoble3) * Date: 2016-02-11 04:05
Once a regular expression is compiled with `obj = re.compile()`, it would be nice to have access to the raw bytecode, probably as `obj.code` or `obj.bytecode`, so it can be explored programmatically. Currently, regex bytecode is only stored in a C struct and not exposed to Python code; the only way to examine the compiled version is to pass the `re.DEBUG` flag to `re.compile()`, which prints only to stdout and outputs not the finished bytecode, but a "pretty-printed" intermediate representation useless for programmatic analysis.

This is basically requesting the equivalent of the `co_code` attribute of the code object returned by the built-in `compile()`, but for regular expression objects instead of Python code objects.

Given that the bytecode can actually be multi-byte integers, `regexobj.bytecode` should return a list (perhaps even just the same list passed to the C function?) or an `array.array()` instance, rather than a bytestring.
msg260397 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2016-02-17 12:39
Regex bytecode is implementation detail. It was 16-bit in narrow builds, but was changed to at least 32-bit in bugfix releases. It can be changed to 64-bit or to pack an argument with an opcode in one word. The implementation can not use the bytecode at all, but use the tree instead.
msg260408 - (view) Author: Jonathan Goble (jcgoble3) * Date: 2016-02-17 20:24
It would indeed be marked as a CPython implementation detail, and with no guarantee of backward compatibility. Others (well, at least one other) have suggested the same on python-ideas. So a simple note in the accompanying documentation would suffice.
msg260569 - (view) Author: Jonathan Goble (jcgoble3) * Date: 2016-02-20 18:14
Noting for the record that, as I had brought up on python-ideas [1], in addition to simply exposing the raw code, it would be nice to have a public constructor for the compiled pattern type and a 'dis'-like module for support. The former would enable optimizers, and the latter would simplify programmatic analysis.

msg267356 - (view) Author: Jelle Zijlstra (Jelle Zijlstra) * (Python triager) Date: 2016-06-05 00:46
This patch exposes the bytecode as a __code__ attribute on pattern objects as a Unicode string (consistent with the internal representation as Py_UCS4 instances).
msg267381 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2016-06-05 05:04
__code__ is associated with Python bytecode. Regex bytecode can't be represented as a Unicode string since it is a sequence of 32-bit integers that can be out of the ord(sys.maxunicode) limit.
msg267388 - (view) Author: Jelle Zijlstra (Jelle Zijlstra) * (Python triager) Date: 2016-06-05 05:50
Thanks for the feedback. This patch instead exposes the code as a tuple of integers named __pattern_code__. "Bytecode" is technically inaccurate since the code isn't limited to bytes but can contain larger integers.
msg267395 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2016-06-05 07:19
Added comments on Rietveld.

I still not think this is a good idea.
msg267396 - (view) Author: Jelle Zijlstra (Jelle Zijlstra) * (Python triager) Date: 2016-06-05 07:35
Updated patch attached.

I don't feel strongly about whether this should be in Python, but it seems potentially useful at least as a tool to learn more about how re is implemented. If I have time I may write a tool using __pattern_code__ and the sre_constants module to provide a disassembly for regexes.
msg267467 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2016-06-05 19:51
I prefer 'rexcode' for the attribute name.

I share Serhiy's  reservations.  When people write code that depends on CPython implementation details, even though documented as such, the existence of such code becomes a drag on change, especially when details have been stable for awhile.  I just saw this used as an argument against one of the proposed bytecode/wordcode changes. "It would break current 3rd party code." It also came up a few years ago with randomizing hashes (and dict iteration order).

Jelle, can one access the 'rexcode' via ctypes?  Is so, I think an re disassembler with docs would be a good pypi module.  Maybe you could also make it work with Barnett's regex module.
msg267483 - (view) Author: Jelle Zijlstra (Jelle Zijlstra) * (Python triager) Date: 2016-06-05 23:40
Yes, you can get at it with ctypes. I released a small (and virtually untested) library at that provides dis-like capabilities.
msg293199 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2017-05-07 15:41
See issue30299 which adds the output of decoded bytecode in debug mode. The format of the bytecode is implementation detail, it is irregular, new opcodes can be added, and the format of existing opcodes can be changed. Thus it is hard to support third-party disassembler.
msg293210 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2017-05-07 19:33
I changed the title because I believe a) the Python-level tuple of ints should be created on demand (I am not sure what the patch does); and b) the exposure should be done by an overt function rather than be an  'attribute', even if that is a front for a property getter.  The former would have to come with re.  A function does not need to be a method, and can therefore be provided in a 3rd party module that accesses the C attribute via ctypes.  Such a function could be used with past versions of CPython.

Jelle, I suggest that you augment regdis with such function, so it does not depend on this issue.

I am closing this for reasons stated and the following.

I reread the thread.  Only one person with commit privileges participated.  The proposal got only tepid support for stdlib inclusion and at least as much or more support for 3rd party activity.  The thread ended with Jonathan saying "I've decided to shelve this idea for the time being".  Given the opposition of the current re maintainer, this proposal lacks sufficient support.

Modules as a whole can be OS specific (several examples) or CPython-specific (dis).  We generally avoid adding features within modules whose existence is implementation-specific.  "This attribute [a tuple of ints] is not guaranteed to exist in all implementations of Python." is almost enough to kill the proposal.
Date User Action Args
2017-05-07 19:33:47terry.reedysetstatus: open -> closed
versions: + Python 3.7, - Python 3.6
title: Expose regex bytecode as attribute of compiled pattern object -> Expose regex bytecode of compiled pattern object
messages: + msg293210

resolution: rejected
stage: resolved
2017-05-07 15:41:59serhiy.storchakasetmessages: + msg293199
2016-06-05 23:40:48Jelle Zijlstrasetmessages: + msg267483
2016-06-05 19:51:12terry.reedysetnosy: + terry.reedy
messages: + msg267467
2016-06-05 07:35:28Jelle Zijlstrasetfiles: + issue26336-cr2.patch

messages: + msg267396
2016-06-05 07:19:02serhiy.storchakasetmessages: + msg267395
2016-06-05 05:50:09Jelle Zijlstrasetfiles: + issue26336-cr.patch

messages: + msg267388
2016-06-05 05:04:12serhiy.storchakasetmessages: + msg267381
2016-06-05 01:12:59Jelle Zijlstrasetfiles: - issue26336.patch
2016-06-05 01:12:44Jelle Zijlstrasetfiles: + issue26336.patch
2016-06-05 00:46:11Jelle Zijlstrasetfiles: + issue26336.patch

nosy: + Jelle Zijlstra
messages: + msg267356

keywords: + patch
2016-02-20 18:14:00jcgoble3setmessages: + msg260569
2016-02-17 20:24:04jcgoble3setmessages: + msg260408
2016-02-17 12:39:37serhiy.storchakasetmessages: + msg260397
2016-02-16 19:03:57paul.mooresetkeywords: + easy
2016-02-11 04:05:27jcgoble3create