classification
Title: Calling TextIOWrapper.tell() in the middle of reading a gb2312-encoded file causes UnicodeDecodeError
Type: crash Stage: resolved
Components: IO, Unicode Versions: Python 3.7
process
Status: closed Resolution: duplicate
Dependencies: Superseder: cjkcodecs missing getstate and setstate implementations
View: 33578
Assigned To: Nosy List: ezio.melotti, malin, methane, rmalouf, terry.reedy
Priority: normal Keywords:

Created on 2020-04-28 04:33 by rmalouf, last changed 2020-05-03 09:14 by terry.reedy. This issue is now closed.

Files
File name Uploaded Description Edit
udhr-gb2312.txt rmalouf, 2020-04-28 04:33 GB2312-encoded file to demonstrate error
Messages (8)
msg367494 - (view) Author: Rob Malouf (rmalouf) * Date: 2020-04-28 04:33
Calling TextIOWrapper.tell() while reading the attached gb2312-encoded file like this:

with open('udhr-gb2312.txt', encoding='GB2312') as f: 
    while True: 
       line = f.readline() 
       t = f.tell()
       if not line: 
           break 

gives this result:

Traceback (most recent call last):
  File "test.py", line 4, in <module>
    t = f.tell()
UnicodeDecodeError: 'gb2312' codec can't decode byte 0xb5 in position 0: illegal multibyte sequence

The file seems to be well-formed and can be read without any problem.  It's only the call to tell() that raises an issue.
msg367894 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2020-05-01 23:04
OS? in case it matters
msg367896 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2020-05-01 23:07
Change the line to 'print(f.tell())'.  Are any lines printed before the error?
msg367942 - (view) Author: Rob Malouf (rmalouf) * Date: 2020-05-02 17:30
Same results on MacOS 10.15.4 (both the system python and the intel/anaconda version) and on CentOS 7.8

Here's the output with print(...):

13
71
72
392
393
399
536
537
761
762
879
880
933
934
1146
1147
1254
1255
1359
1360
1760
1761
1772
1895
1897
1906
2105
2107
2338
2339
2348
2398
2399
2408
2509
2510
2519
2612
2614
2622
2682
2684
2693
2898
2900
2909
3050
3052
3061
3113
3115
3124
3295
3297
3309
3445
3632
3644
3814
3816
3828
3882
3967
3979
4048
4184
4196
4226
4308
4320
4492
4559
4641
4653
4728
4770
4782
4999
5001
5013
5202
5204
5216
5270
5318
5333
5411
5465
5672
5687
5953
5954
5969
6082
6137
6307
6373
6388
6494
6496
6511
6786
6913
6928
7148
7371
7447
7462
7569
7704
7719
7847
7848
7863
7972
8238
8342
Traceback (most recent call last):
  File "test.py", line 4, in <module>
    print(f.tell())
UnicodeDecodeError: 'gb2312' codec can't decode byte 0xb5 in position 0: illegal multibyte sequence
msg367953 - (view) Author: Ma Lin (malin) * Date: 2020-05-03 03:18
On Windows 10, Python 3.7, I get the same message as above reply.

If use Python 3.8, it works well.
msg367955 - (view) Author: Ma Lin (malin) * Date: 2020-05-03 05:00
I did a git bisect, this commit fixed the bug:

https://github.com/python/cpython/commit/ac22f6aa989f18c33c12615af1c66c73cf75d5e7
msg367958 - (view) Author: Inada Naoki (methane) * (Python committer) Date: 2020-05-03 08:57
I think this is not a bug, but a limitation of Python 3.7, and improvement in 3.8.
msg367961 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2020-05-03 09:14
The commit referenced above is for #33578.  The symptoms for that issue were very similar, including involving a cjk codec.  The change was not backported because it was seen an enhancement.  Rob, if you try 3.8.2 or 3.8.3 (the release candidate was out Wednesday, the final probably next week or so) and still have the same problem, re-open this.
History
Date User Action Args
2020-05-03 09:14:03terry.reedysetstatus: open -> closed
superseder: cjkcodecs missing getstate and setstate implementations
messages: + msg367961

resolution: duplicate
stage: resolved
2020-05-03 08:57:48methanesetnosy: + methane
messages: + msg367958
2020-05-03 05:00:26malinsetmessages: + msg367955
2020-05-03 03:18:53malinsetnosy: + malin
messages: + msg367953
2020-05-02 17:30:34rmaloufsetmessages: + msg367942
2020-05-01 23:07:46terry.reedysetmessages: + msg367896
2020-05-01 23:04:00terry.reedysetnosy: + terry.reedy
messages: + msg367894
2020-04-28 12:46:41vstinnersetnosy: - vstinner
2020-04-28 04:33:56rmaloufcreate