Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UTF-8 incremental decoder doesn't support surrogatepass correctly #68402

Closed
RalfM mannequin opened this issue May 16, 2015 · 20 comments
Closed

UTF-8 incremental decoder doesn't support surrogatepass correctly #68402

RalfM mannequin opened this issue May 16, 2015 · 20 comments
Assignees
Labels
3.7 (EOL) end of life 3.8 only security fixes 3.9 only security fixes interpreter-core (Objects, Python, Grammar, and Parser dirs) topic-unicode type-bug An unexpected behavior, bug, or error

Comments

@RalfM
Copy link
Mannequin

RalfM mannequin commented May 16, 2015

BPO 24214
Nosy @vstinner, @ned-deily, @ezio-melotti, @methane, @serhiy-storchaka, @miss-islington, @tirkarthi, @roufique7
PRs
  • bpo-24214: Fixed the UTF-8 incremental decoder. #12603
  • [3.7] bpo-24214: Fixed the UTF-8 incremental decoder. (GH-12603) #12627
  • bpo-24214: Fixed the UTF-8 and UTF-16 incremental decoders. #14304
  • [3.8] bpo-24214: Fixed the UTF-8 and UTF-16 incremental decoders. (GH-14304) #14368
  • [3.7] bpo-24214: Fixed the UTF-8 and UTF-16 incremental decoders. (GH-14304) #14369
  • Files
  • Demo.txt: File to demonstrate the issue
  • surrogatepass.patch
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/serhiy-storchaka'
    closed_at = <Date 2019-06-25.10:30:35.570>
    created_at = <Date 2015-05-16.22:56:34.256>
    labels = ['interpreter-core', 'type-bug', '3.8', '3.9', '3.7', 'expert-unicode']
    title = "UTF-8 incremental decoder doesn't support surrogatepass correctly"
    updated_at = <Date 2019-07-02.22:34:03.466>
    user = 'https://bugs.python.org/RalfM'

    bugs.python.org fields:

    activity = <Date 2019-07-02.22:34:03.466>
    actor = 'ned.deily'
    assignee = 'serhiy.storchaka'
    closed = True
    closed_date = <Date 2019-06-25.10:30:35.570>
    closer = 'vstinner'
    components = ['Interpreter Core', 'Unicode']
    creation = <Date 2015-05-16.22:56:34.256>
    creator = 'RalfM'
    dependencies = []
    files = ['39400', '43911']
    hgrepos = []
    issue_num = 24214
    keywords = ['patch']
    message_count = 20.0
    messages = ['243376', '271412', '271461', '271839', '339036', '339177', '339199', '346152', '346248', '346273', '346494', '346496', '346503', '346504', '346822', '346838', '346839', '346842', '346844', '347163']
    nosy_count = 9.0
    nosy_names = ['vstinner', 'ned.deily', 'ezio.melotti', 'methane', 'serhiy.storchaka', 'RalfM', 'miss-islington', 'xtreak', 'roufique7']
    pr_nums = ['12603', '12627', '14304', '14368', '14369']
    priority = 'high'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue24214'
    versions = ['Python 3.7', 'Python 3.8', 'Python 3.9']

    @RalfM
    Copy link
    Mannequin Author

    RalfM mannequin commented May 16, 2015

    I have an utf-8 encoded file containing single surrogates. Reading this file, specifying surrgatepass, works fine when I read the whole file with .read(), but raises an UnicodeDecodeError when I read the file line by line:

    ----- start of demo -----

    Python 3.4.3 (v3.4.3:9b73f1c3e601, Feb 24 2015, 22:44:40) [MSC v.1600 64 bit (AM
    D64)] on win32
    Type "help", "copyright", "credits" or "license" for more information.
    >>> with open("Demo.txt", encoding="utf-8", errors="surrogatepass") as f:
    ...   s = f.read()
    ...
    >>> "\ud900" in s
    True
    >>> with open("Demo.txt", encoding="utf-8", errors="surrogatepass") as f:
    ...   for line in f:
    ...     pass
    ...
    Traceback (most recent call last):
      File "<stdin>", line 2, in <module>
      File "C:\Python\34x64\lib\codecs.py", line 319, in decode
        (result, consumed) = self._buffer_decode(data, self.errors, final)
    UnicodeDecodeError: 'utf-8' codec can't decode byte 0xed in position 8190: inval
    id continuation byte
    >>>
    ----- end of demo 

    I attached the file used for the demo such that you can reproduce the problem.

    If I change all 0xED bytes in the file to 0xEC (i.e. effectively change all surrogates to non-surrogates), the problem disappears.

    The original file I noticed the problem with was 73 MB. The demo file was derived from the original by removing data around the critical section, keeping the alignment to 16-k-blocks, and then replacing all printable ASCII characters by x.

    If I change the file length by adding or removing 16 bytes to / from the beginning of the demo file, the problem disappears, so alignment seems to be an issue.

    All this seems to indicate that the utf-8 decoder has problems when used for incremental decoding and a single surrogate appears around the block boundary.

    @RalfM RalfM mannequin added topic-unicode type-bug An unexpected behavior, bug, or error labels May 16, 2015
    @RalfM
    Copy link
    Mannequin Author

    RalfM mannequin commented Jul 26, 2016

    I just tested Python 3.6.0a3, and that (mis)behaves exactly like 3.4.3.

    @vstinner
    Copy link
    Member

    Attached patch fixes the UTF-8 decoder to support correctly incremental decoder using surrogatepass error handler.

    The bug occurs when b'\xed\xa4\x80' is decoded in two parts: the first two bytes (b'\xed\xa4'), and then the last byte (b'\x80').

    It works as expected if we decode the first byte (b'\xed') and then the two last bytes (b'\xa4\x80').

    My patch tries to keep best performances for the UTF-8/strict decoder.

    @serhiy: Would you mind to review my patch since you helped to design the fast UTF-8 decoder?

    @vstinner vstinner changed the title Exception with utf-8, surrogatepass and incremental decoding UTF-8 incremental decoder doesn't support surrogatepass correctly Jul 27, 2016
    @serhiy-storchaka
    Copy link
    Member

    The patch slows down decoding up to 20%.

    $ ./python -m timeit -s 'b = b"\xc4\x80"*10000' -- 'b.decode()'
    Unpatched:  10000 loops, best of 3: 50.8 usec per loop
    Patched:    10000 loops, best of 3: 63.3 usec per loop

    And I'm not sure that fixing only for the surrogatepass handler is enough. Other standard error handlers look working, but what if a user handler consumes more then one byte?

    @serhiy-storchaka serhiy-storchaka added the interpreter-core (Objects, Python, Grammar, and Parser dirs) label Aug 2, 2016
    @serhiy-storchaka serhiy-storchaka self-assigned this Mar 28, 2019
    @serhiy-storchaka serhiy-storchaka added 3.8 only security fixes 3.9 only security fixes 3.7 (EOL) end of life and removed 3.9 only security fixes labels Mar 28, 2019
    @serhiy-storchaka
    Copy link
    Member

    PR 12603 fixes this issue in more general way and does not affect performance.

    @serhiy-storchaka
    Copy link
    Member

    New changeset 7a465cb by Serhiy Storchaka in branch 'master':
    bpo-24214: Fixed the UTF-8 incremental decoder. (GH-12603)
    7a465cb

    @serhiy-storchaka
    Copy link
    Member

    New changeset bd48280 by Serhiy Storchaka (Miss Islington (bot)) in branch '3.7':
    bpo-24214: Fixed the UTF-8 incremental decoder. (GH-12603) (GH-12627)
    bd48280

    @tirkarthi
    Copy link
    Member

    This change seems to have caused test failure reported in python-hyper/wsproto#126

    from codecs import getincrementaldecoder
    decoder = getincrementaldecoder("utf-8")()
    print(decoder.decode(b'f\xf1\xf6rd', False))

    # With this commit 7a465cb

    ➜ cpython git:(7a465cb) ./python.exe /tmp/foo.py
    f

    Before 7a465cb

    ➜  cpython git:(38f4e468d4) ./python.exe /tmp/foo.py
    Traceback (most recent call last):
      File "/tmp/foo.py", line 3, in <module>
        print(decoder.decode(b'f\xf1\xf6rd', False))
      File "/Users/karthikeyansingaravelan/stuff/python/cpython/Lib/codecs.py", line 322, in decode
        (result, consumed) = self._buffer_decode(data, self.errors, final)
    UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf1 in position 1: invalid continuation byte

    @vstinner vstinner reopened this Jun 20, 2019
    @vstinner
    Copy link
    Member

    UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf1 in position 1: invalid continuation byte

    Python is right: b'f\xf1\xf6rd' is not a valid UTF-8 string:

    $ python3
    Python 3.7.3 (default, May 11 2019, 00:38:04) 
    >>> b'f\xf1\xf6rd'.decode()
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf1 in position 1: invalid continuation byte

    This change is deliberate: it makes UTF-8 incremental decoder correct (respect the UTF-8 standard). I close the issue.

    @serhiy-storchaka
    Copy link
    Member

    Victor, I think you misunderstood the issue. The problem is not that a decoding error is raised. The problem is that the incremental decoder no longer raises where it raised before.

    I think that both behavior may be correct, and that it is better to not rely on ability of the incremental decoder with final=False to detect an invalid encoded data, but I see now that it is possible to fix for the original issue more carefully, without changing that behavior. PR 14304 does this.

    It also change the UTF-16 incremental decoder with the surrogatepass error handler to return a non-empty data when decode a low surrogate with final=False. It is not necessary, but it makes all UTF-* decoders consistent and makes tests simpler.

    @serhiy-storchaka serhiy-storchaka added the 3.9 only security fixes label Jun 22, 2019
    @serhiy-storchaka
    Copy link
    Member

    New changeset 894263b by Serhiy Storchaka in branch 'master':
    bpo-24214: Fixed the UTF-8 and UTF-16 incremental decoders. (GH-14304)
    894263b

    @miss-islington
    Copy link
    Contributor

    New changeset d32594a by Miss Islington (bot) in branch '3.8':
    bpo-24214: Fixed the UTF-8 and UTF-16 incremental decoders. (GH-14304)
    d32594a

    @vstinner
    Copy link
    Member

    New changeset c755ca8 by Victor Stinner (Miss Islington (bot)) in branch '3.7':
    [3.7] bpo-24214: Fixed the UTF-8 and UTF-16 incremental decoders. (GH-14304) (GH-14369)
    c755ca8

    @vstinner
    Copy link
    Member

    Thanks Karthikeyan Singaravelan for the bug report of the regression. You're right that I misunderstood it. Thanks Serhiy for the second fix. The regression should now be fixed as well, I close the issue again.

    @ned-deily
    Copy link
    Member

    Since the latest fix (GH-14369) addresses a regression between 3.7.3 and 3.7.4rc1, I am going to cherry-pick it into 3.7.4 final. (In retrospect, this regression should have been marked as a "release blocker".)

    @tirkarthi
    Copy link
    Member

    Thanks Ned, is there a general policy on regressions to be marked as release blocker, like if a regression that was made in 3.7.3 then it acts as a release blocker for 3.7.4 or is it based on severity? ?

    @ned-deily
    Copy link
    Member

    "The only changes allowed to occur in a maintenance branch without debate are bug fixes. Also, a general rule for maintenance branches is that compatibility must not be broken at any point between sibling minor releases (3.5.1, 3.5.2, etc.). For both rules, only rare exceptions are accepted and must be discussed first."

    https://devguide.python.org/devcycle/#maintenance-branches

    The principle here is that we "promise" users that they can upgrade from any version of 3.n.x to the latest version of 3.n without needing to make any changes except in very rare cases when there is an overriding concern and which must be well-documented in the release materials. We're human so we sometimes slip up and inadvertently break that promise but that's the goal. And it's because of that promise that we can take the approach of immediately obsoleting previous older micro releases when a new micro release occurs, i.e. we don't provide fixes for 3.7.2 once 3.7.3 is released.

    So, from my perspective, pretty much *any* regression between micro releases is a release blocker but especially in a case like this where it can be addressed before a final release. That's basically why we do release candidates :)

    @tirkarthi
    Copy link
    Member

    Thanks Ned for the details, in future reports I will also try to add the respective release manager to the issue in case of regressions.

    @ned-deily
    Copy link
    Member

    @XTreak, If you set the priority to "release blocker", the appropriate release managers will be nosied automatically. I'd rather see too many potential "release blocker"s than too few.

    @ned-deily
    Copy link
    Member

    New changeset 30c2ae4 by Ned Deily (Miss Islington (bot)) in branch '3.7':
    [3.7] bpo-24214: Fixed the UTF-8 and UTF-16 incremental decoders. (GH-14304) (GH-14369)
    30c2ae4

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.7 (EOL) end of life 3.8 only security fixes 3.9 only security fixes interpreter-core (Objects, Python, Grammar, and Parser dirs) topic-unicode type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    No branches or pull requests

    5 participants