Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

In codecs, function 'normalizestring' should convert both spaces and hyphens to underscores. #81932

Closed
qigangxu mannequin opened this issue Aug 3, 2019 · 31 comments
Closed
Labels
3.9 only security fixes topic-unicode type-bug An unexpected behavior, bug, or error

Comments

@qigangxu
Copy link
Mannequin

qigangxu mannequin commented Aug 3, 2019

BPO 37751
Nosy @malemburg, @gpshead, @vstinner, @mark-summerfield, @ezio-melotti, @methane, @miss-islington, @shihai1991, @qigangxu, @akdor1154, @bodograumann
PRs
  • bpo-37751: Fix normalizestring() with hyphens and spaces converted to underscores #15092
  • bpo-37751: Document the change in What's New in Python 3.9 #17997
  • bpo-37751: Document the change in What's New in Python 3.9 #23096
  • bpo-37751: Update codecs.register() doc. #25643
  • [3.9] bpo-37751: Document codecs.lookup() change in What's New in Python 3.9 (GH-23096) #25659
  • [3.9] bpo-37751: Update codecs.register() doc. (GH-25643) #25677
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2021-04-24.16:55:03.898>
    created_at = <Date 2019-08-03.11:34:13.329>
    labels = ['type-bug', '3.9', 'expert-unicode']
    title = "In codecs, function 'normalizestring' should convert both spaces and hyphens to underscores."
    updated_at = <Date 2022-01-25.00:29:03.446>
    user = 'https://github.com/qigangxu'

    bugs.python.org fields:

    activity = <Date 2022-01-25.00:29:03.446>
    actor = 'gregory.p.smith'
    assignee = 'none'
    closed = True
    closed_date = <Date 2021-04-24.16:55:03.898>
    closer = 'shihai1991'
    components = ['Unicode']
    creation = <Date 2019-08-03.11:34:13.329>
    creator = 'qigangxu'
    dependencies = []
    files = []
    hgrepos = []
    issue_num = 37751
    keywords = ['patch']
    message_count = 31.0
    messages = ['348953', '348954', '348956', '348959', '349448', '350086', '350087', '350155', '359970', '359971', '359972', '359973', '359974', '360005', '391648', '391652', '391653', '391654', '391664', '391666', '391671', '391785', '392077', '392159', '392161', '392178', '395730', '395764', '411536', '411537', '411538']
    nosy_count = 11.0
    nosy_names = ['lemburg', 'gregory.p.smith', 'vstinner', 'mark', 'ezio.melotti', 'methane', 'miss-islington', 'shihai1991', 'qigangxu', 'akdor1154', 'bodograumann']
    pr_nums = ['15092', '17997', '23096', '25643', '25659', '25677']
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue37751'
    versions = ['Python 3.9']

    @qigangxu
    Copy link
    Mannequin Author

    qigangxu mannequin commented Aug 3, 2019

    In codecs.c, when _PyCodec_Lookup() call normalizestring(), both spaces and hyphens should be convered to underscores. Not convert spaces to hyphens.

    see:https://github.com/python/peps/blob/master/pep-0100.txt, Codecs (Coder/Decoders) Lookup

    @qigangxu qigangxu mannequin added 3.9 only security fixes topic-unicode type-bug An unexpected behavior, bug, or error labels Aug 3, 2019
    @qigangxu
    Copy link
    Mannequin Author

    qigangxu mannequin commented Aug 3, 2019

    and I will try to fix it.

    @shihai1991
    Copy link
    Member

    @qigangxu
    Copy link
    Mannequin Author

    qigangxu mannequin commented Aug 3, 2019

    @malemburg
    Copy link
    Member

    Jordon is right. Conversion has to be to underscores, not hyphens. I guess this bug was introduced when the normalization function was converted to C.

    @vstinner
    Copy link
    Member

    New changeset 20f59fe by Victor Stinner (Jordon Xu) in branch 'master':
    bpo-37751: Fix codecs.lookup() normalization (GH-15092)
    20f59fe

    @vstinner
    Copy link
    Member

    Thanks for the fix Jordon Xu.

    IMHO this change is not strictly a bugfix, but more like an enhancement. I close the issue.

    If you consider that a backport to Python 3.7 and 3.8 is needed, please say so.

    @qigangxu
    Copy link
    Mannequin Author

    qigangxu mannequin commented Aug 22, 2019

    Thanks vstinner. I also don't think it's necessary to backport to the old version. Close this issue is fine.

    @hroncok
    Copy link
    Mannequin

    hroncok mannequin commented Jan 14, 2020

    The change is backwards incompatible and a backport would break things. See for example how it breaks latexcodec:

    https://bugzilla.redhat.com/show_bug.cgi?id=1789613#c2

    @vstinner
    Copy link
    Member

    The change is backwards incompatible and a backport would break things. See for example how it breaks latexcodec:

    I reopen the issue. I proposed PR 17997 to *document* the incompatible change in What's New in Python 3.8. IMO it's a deliberate change and it's correct.

    I rely on Marc-Andre Lemburg who implemented codecs and encodings modules. He wrote: "Jordon is right. Conversion has to be to underscores, not hyphens.".

    @vstinner vstinner reopened this Jan 14, 2020
    @vstinner
    Copy link
    Member

    It seems quite easy to update latexcodec project to support Python 3.9. I proposed a solution there:
    https://bugzilla.redhat.com/show_bug.cgi?id=1789613#c6

    @malemburg
    Copy link
    Member

    Just to clarify: the change in the C implementation was the breaking change. The patch just restores the previous behavior: https://github.com/python/cpython/blob/master/Lib/encodings/__init__.py#L43

    Please note that external codec packages should not rely on the semantics of the Python stdlib encodings package's search function. They should really register their own search function: https://docs.python.org/3.9/library/codecs.html#codecs.register

    It's good practice to always only use ASCII lower case chars and the underscore for codec names.

    @vstinner
    Copy link
    Member

    Please note that external codec packages should not rely on the semantics of the Python stdlib encodings package's search function.

    latexcodec does register a search function.

    It's good practice to always only use ASCII lower case chars and the underscore for codec names.

    latexcodec uses encoding names like "latex+ascii" and their search function used "+" as a separator.

    Don't worry, I just fixed latexcodec, my fix is already merged upstream! I simply changed the search function to split on "_" if the name contains "_".

    @vstinner
    Copy link
    Member

    I created bpo-39337: codecs.lookup() ignores non-ASCII characters, whereas encodings.normalize_encoding() copies them.

    @akdor1154
    Copy link
    Mannequin

    akdor1154 mannequin commented Apr 23, 2021

    If I understand the target of this issue, this is a breaking change in python 3.9 .

    E.g. see SAP-archive/PyHDB#149

    Ideally if we did not intend to break libraries then can this be fixed?
    Or if it is acceptable to have such a breaking change, can it be documented as such? (maybe this is #23096 ? though I would not interpret that as a breaking change at first glance)

    @shihai1991
    Copy link
    Member

    Ideally if we did not intend to break libraries then can this be fixed?
    Or if it is acceptable to have such a breaking change.

    Hi, akdor1154, thanks for your notice. It was a bugfix in fact:) https://bugs.python.org/issue37751#msg349448

    maybe this is #23096 ? though I would not interpret that as a breaking change at first glance

    @victor ping :)

    @methane
    Copy link
    Member

    methane commented Apr 23, 2021

    I think it is too late. Python 3.9 has been released already. Reverting the change is also breaking change.

    PEP-100 says:
    "Search functions are expected to take one argument, the encoding name in all lower case letters and with hyphens and spaces converted to underscores"
    https://www.python.org/dev/peps/pep-0100/#codecs-coder-decoders-lookup

    But codecs.register() says:
    "Search functions are expected to take one argument, being the encoding name in all lower case letters".

    I don't know historical reason why two document are inconsistent.
    https://docs.python.org/3/library/codecs.html#codecs.register

    @methane
    Copy link
    Member

    methane commented Apr 23, 2021

    codecs.register() was added in this commit.
    e2d67f9

    And its docstring has been added in this commit.
    0ae2981

    Both commits doesn't describe why normalization was differ from PEP-100.

    @malemburg
    Copy link
    Member

    On 23.04.2021 03:37, akdor1154 wrote:

    akdor1154 <akdor1154@gmail.com> added the comment:

    If I understand the target of this issue, this is a breaking change in python 3.9 .

    E.g. see SAP-archive/PyHDB#149

    Ideally if we did not intend to break libraries then can this be fixed?
    Or if it is acceptable to have such a breaking change, can it be documented as such? (maybe this is #23096 ? though I would not interpret that as a breaking change at first glance)

    This patch only restored the behavior we had before (and for many many
    years). It's not breaking, it's in fact resolving a break which was
    caused by earlier:

    https://bugs.python.org/issue37751#msg349448

    Please note that search functions determine how to map codec names
    to codec implementations. The codec search function in the encodings
    package uses one way to do this (and depends on how the package
    is structured).

    The approach taken by the encodings search function is listed here:
    https://github.com/python/cpython/blob/master/Lib/encodings/__init__.py#L43

    Other search functions can work in different ways.

    Now, unfortunately, parts of this kind of normalization have also made
    its way into the codecs module itself and into the Unicode
    implementation and perhaps not always in a way which allows search
    functions to use a different approach or which is consistent.

    As I mentioned before, the safest way to go about this is to use
    alnum only names for codecs, with the addition of underscores to
    separate important parts.

    The Python implementation should make sure that such names continue
    to work when passed through any codec name normalization.

    @malemburg malemburg changed the title In codecs, function 'normalizestring' should convert both spaces and hyphens to underscores. In codecs, function 'normalizestring' should convert both spaces and hyphens to underscores. Apr 23, 2021
    @malemburg
    Copy link
    Member

    On 23.04.2021 07:47, Inada Naoki wrote:

    Inada Naoki <songofacandy@gmail.com> added the comment:

    I think it is too late. Python 3.9 has been released already. Reverting the change is also breaking change.

    PEP-100 says:
    "Search functions are expected to take one argument, the encoding name in all lower case letters and with hyphens and spaces converted to underscores"
    https://www.python.org/dev/peps/pep-0100/#codecs-coder-decoders-lookup

    But codecs.register() says:
    "Search functions are expected to take one argument, being the encoding name in all lower case letters".

    I don't know historical reason why two document are inconsistent.
    https://docs.python.org/3/library/codecs.html#codecs.register

    I guess just an oversight on my part.

    PEP-100 is certainly what I meant and implemented. I should have also
    made it clear in PEP-100 that I meant lower case ASCII letters.

    @vstinner
    Copy link
    Member

    New changeset 32980fb by Hai Shi in branch 'master':
    bpo-37751: Document codecs.lookup() change in What's New in Python 3.9 (GH-23096)
    32980fb

    @shihai1991
    Copy link
    Member

    Thanks Marc-Andre for your supplement of PEP-100.
    Thanks Inada, victor for your review and merge.

    After PR-23096 merged, I suggest to close this bpo.
    If there have any other questions, we can reopen it again.

    @vstinner
    Copy link
    Member

    New changeset 531c810 by Miss Islington (bot) in branch '3.9':
    bpo-37751: Document codecs.lookup() change in What's New in Python 3.9 (GH-23096) (GH-25659)
    531c810

    @methane
    Copy link
    Member

    methane commented Apr 28, 2021

    New changeset 5c84bb5 by Inada Naoki in branch 'master':
    bpo-37751: Update codecs.register() doc. (GH-25643)
    5c84bb5

    @methane
    Copy link
    Member

    methane commented Apr 28, 2021

    New changeset cf9d65c by Miss Islington (bot) in branch '3.9':
    bpo-37751: Update codecs.register() doc. (GH-25643)
    cf9d65c

    @vstinner
    Copy link
    Member

    Thanks Inada-san for documenting the change in codecs.register() doc!

    @bodograumann
    Copy link
    Mannequin

    bodograumann mannequin commented Jun 13, 2021

    Unfortunately this is not quite finished yet.

    First of all, the change is bigger than what is documented: “Changed in version 3.9: Hyphens and spaces are converted to underscore.“

    In reality, now
    | Normalization works as follows: all non-alphanumeric
    | characters except the dot used for Python package names are
    | collapsed and replaced with a single underscore, e.g. ' -;#'
    | becomes '_'. Leading and trailing underscores are removed.”
    Cf. encodings/init.py

    Secondly, this change breaks lots of iconv codecs with the python-iconv binding. E.g. ASCII//TRANSLIT is now normalized to ascii_translit, which iconv does not understand. Codec names which use hyphens also break and iinm not all of them have aliases in iconv without hyphens.
    Cf. python-iconv #4

    The codecs api feels extremely well-fitting for integrating iconv in python and any alternative I can think of seems unsatisfactory.
    Please advise.

    @vstinner
    Copy link
    Member

    The codecs api feels extremely well-fitting for integrating iconv in python and any alternative I can think of seems unsatisfactory.

    This issue is now closed, would you mind to open a new issue?

    @gpshead
    Copy link
    Member

    gpshead commented Jan 25, 2022

    https://bugs.python.org/issue46508 filed to track fixing the acceptance and use of garbage codec values regression that this caused.

    @gpshead
    Copy link
    Member

    gpshead commented Jan 25, 2022

    (note: this might not be the true cause of that issue; though it sounds potentially related - I haven't investigated far enough yet)

    @gpshead
    Copy link
    Member

    gpshead commented Jan 25, 2022

    note that Bodo's own followup issue about the breaking change for python-iconv was filed as https://bugs.python.org/issue44723

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.9 only security fixes topic-unicode type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    No branches or pull requests

    5 participants