Rietveld Code Review Tool
Help | Bug tracker | Discussion group | Source code | Sign in
(9725)

Side by Side Diff: Lib/codecs.py

Issue 20132: Many incremental codecs don’t handle fragmented data
Patch Set: Created 5 years, 5 months ago
Left:
Right:
Use n/p to move between diff chunks; N/P to move between comments. Please Sign in to add in-line comments.
Jump to:
View unified diff | Download patch
« no previous file with comments | « Doc/library/codecs.rst ('k') | Lib/encodings/bz2_codec.py » ('j') | no next file with comments »
Toggle Intra-line Diffs ('i') | Expand Comments ('e') | Collapse Comments ('c') | Show Comments Hide Comments ('s')
OLDNEW
1 """ codecs -- Python Codec Registry, API and helpers. 1 """ codecs -- Python Codec Registry, API and helpers.
2 2
3 3
4 Written by Marc-Andre Lemburg (mal@lemburg.com). 4 Written by Marc-Andre Lemburg (mal@lemburg.com).
5 5
6 (c) Copyright CNRI, All Rights Reserved. NO WARRANTY. 6 (c) Copyright CNRI, All Rights Reserved. NO WARRANTY.
7 7
8 """#" 8 """#"
9 9
10 import builtins, sys 10 import builtins, sys
(...skipping 431 matching lines...) Expand 10 before | Expand all | Expand 10 after
442 self.linebuffer = None 442 self.linebuffer = None
443 443
444 def decode(self, input, errors='strict'): 444 def decode(self, input, errors='strict'):
445 raise NotImplementedError 445 raise NotImplementedError
446 446
447 def read(self, size=-1, chars=-1, firstline=False): 447 def read(self, size=-1, chars=-1, firstline=False):
448 448
449 """ Decodes data from the stream self.stream and returns the 449 """ Decodes data from the stream self.stream and returns the
450 resulting object. 450 resulting object.
451 451
452 chars indicates the number of decoded code points or bytes to 452 size indicates the approximate chunk size of decoded
453 return. read() will never return more data than requested,
454 but it might return less, if there is not enough available.
455
456 size indicates the approximate maximum number of decoded
457 bytes or code points to read for decoding. The decoder 453 bytes or code points to read for decoding. The decoder
458 can modify this setting as appropriate. The default value 454 can modify this setting as appropriate. The default value
459 -1 indicates to read and decode as much as possible. size 455 -1 indicates to read and decode as much as possible. size
460 is intended to prevent having to decode huge files in one 456 is intended to prevent having to decode huge files in one
461 step. 457 step.
462 458
459 chars indicates the number of decoded code points or bytes to
460 return. read() will never return more data than requested,
461 but it might return less, if the end of the stream is reached.
462
463 If firstline is true, and a UnicodeDecodeError happens 463 If firstline is true, and a UnicodeDecodeError happens
464 after the first line terminator in the input only the first line 464 after the first line terminator in the input, only the first line
465 will be returned, the rest of the input will be kept until the 465 will be returned; the rest of the input will be kept until the
466 next call to read(). 466 next call to read().
467 467
468 The method should use a greedy read strategy, meaning that 468 The method should use a greedy read strategy, meaning that
469 it should read as much data as is allowed within the 469 it should read as much data as is allowed within the
470 definition of the encoding and the given size, e.g. if 470 definition of the encoding and the given size, e.g. if
471 optional encoding endings or state markers are available 471 optional encoding endings or state markers are available
472 on the stream, these should be read too. 472 on the stream, these should be read too.
473 """ 473 """
474 # If we have lines cached, first merge them back into characters 474 # If we have lines cached, first merge them back into characters
475 if self.linebuffer: 475 if self.linebuffer:
476 self.charbuffer = self._empty_charbuffer.join(self.linebuffer) 476 self.charbuffer = self._empty_charbuffer.join(self.linebuffer)
477 self.linebuffer = None 477 self.linebuffer = None
478 478
479 # read until we get the required number of characters (if available) 479 # read until we get the required number of characters (if available)
480 while True: 480 while True:
(...skipping 111 matching lines...) Expand 10 before | Expand all | Expand 10 after
592 if not data or size is not None: 592 if not data or size is not None:
593 if line and not keepends: 593 if line and not keepends:
594 line = line.splitlines(keepends=False)[0] 594 line = line.splitlines(keepends=False)[0]
595 break 595 break
596 if readsize < 8000: 596 if readsize < 8000:
597 readsize *= 2 597 readsize *= 2
598 return line 598 return line
599 599
600 def readlines(self, sizehint=None, keepends=True): 600 def readlines(self, sizehint=None, keepends=True):
601 601
602 """ Read all lines available on the input stream 602 """ Read all lines from the input stream and return them as a list.
603 and return them as a list.
604 603
605 Line breaks are implemented using the codec's decoder 604 The universal newlines approach is used to determine line breaks.
606 method and are included in the list entries.
607 605
608 sizehint, if given, is ignored since there is no efficient 606 sizehint, if given, is ignored since there is no efficient
609 way to finding the true end-of-line. 607 way to finding the true end-of-line.
610 608
611 """ 609 """
612 data = self.read() 610 data = self.read()
613 return data.splitlines(keepends) 611 return data.splitlines(keepends)
614 612
615 def reset(self): 613 def reset(self):
616 614
(...skipping 482 matching lines...) Expand 10 before | Expand all | Expand 10 after
1099 1097
1100 ### Tests 1098 ### Tests
1101 1099
1102 if __name__ == '__main__': 1100 if __name__ == '__main__':
1103 1101
1104 # Make stdout translate Latin-1 output into UTF-8 output 1102 # Make stdout translate Latin-1 output into UTF-8 output
1105 sys.stdout = EncodedFile(sys.stdout, 'latin-1', 'utf-8') 1103 sys.stdout = EncodedFile(sys.stdout, 'latin-1', 'utf-8')
1106 1104
1107 # Have stdin translate Latin-1 input into UTF-8 input 1105 # Have stdin translate Latin-1 input into UTF-8 input
1108 sys.stdin = EncodedFile(sys.stdin, 'utf-8', 'latin-1') 1106 sys.stdin = EncodedFile(sys.stdin, 'utf-8', 'latin-1')
OLDNEW
« no previous file with comments | « Doc/library/codecs.rst ('k') | Lib/encodings/bz2_codec.py » ('j') | no next file with comments »

RSS Feeds Recent Issues | This issue
This is Rietveld 894c83f36cb7+