classification
Title: Unable parse csv on latin iso or binary mode
Type: behavior Stage: resolved
Components: Unicode Versions: Python 3.7
process
Status: closed Resolution: not a bug
Dependencies: Superseder:
Assigned To: Nosy List: Yhojann Aguilera, ezio.melotti, rhettinger, vstinner
Priority: normal Keywords:

Created on 2019-08-29 22:17 by Yhojann Aguilera, last changed 2019-08-30 04:46 by Yhojann Aguilera. This issue is now closed.

Messages (5)
msg350836 - (view) Author: Yhojann Aguilera (Yhojann Aguilera) Date: 2019-08-29 22:17
Unable parse a csv with latin iso charset.

with open('./exported.csv', newline='') as csvFileHandler:
            csvHandler = csv.reader(csvFileHandler, delimiter=';', quotechar='"')
            for line in csvHandler:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd1 in position 1032: invalid continuation byte

I try using a binary mode on open() but says: binary mode doesn't take a newline argument. Ok, replace newline to binary char: newline=b'', but says: open() argument 6 must be str or None, not bytes. Ok, remove newline argument: _csv.Error: iterator should return strings, not bytes (did you open the file in text mode?).

Ok, csv module no support binary read mode. Try use latin iso:

with open('./exported.csv', mode='r', encoding='ISO-8859', newline='') as csvFileHandler:

UnicodeDecodeError: 'charmap' codec can't decode byte 0xd1 in position 1032: character maps to <undefined>

But the charset is latin iso:

$ file exported.csv 
exported.csv: ISO-8859 text, with very long lines, with CRLF line terminators

Ok, change to ISO-8859-8:

UnicodeDecodeError: 'charmap' codec can't decode byte 0xd1 in position 1032: character maps to <undefined>

Unable load the file. Why not give the option to work binary? the delimiters can be represented with binary values.
msg350838 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2019-08-29 22:26
Try passing an "encoding" argument to open():

with open('./exported.csv', newline='', encoding='latin-1') as csvFileHandler:
   ...
msg350839 - (view) Author: Yhojann Aguilera (Yhojann Aguilera) Date: 2019-08-29 22:35
Thanks, works fine, but anyway why not give the option to work binary? the delimiters can be represented with binary values. In python it is difficult to autodetect the encoding of characters in a file.
msg350840 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2019-08-29 22:37
There is an option to work with binary:

    with open(filename, 'rb') as f:
         bin_data = f.read()
    str_data = bin_data.decode('latin-1')
msg350852 - (view) Author: Yhojann Aguilera (Yhojann Aguilera) Date: 2019-08-30 04:46
For big files (like as >= 1gb) can not load the all string on memory, need use a file stream using open().
History
Date User Action Args
2019-08-30 04:46:09Yhojann Aguilerasetmessages: + msg350852
2019-08-29 22:37:48rhettingersetstatus: open -> closed
resolution: not a bug
messages: + msg350840

stage: resolved
2019-08-29 22:35:24Yhojann Aguilerasetmessages: + msg350839
2019-08-29 22:26:57rhettingersetnosy: + rhettinger
messages: + msg350838
2019-08-29 22:17:04Yhojann Aguileracreate