Message 369877 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	remi.lapeyre
Recipients	remi.lapeyre, serhiy.storchaka, sidhant
Date	2020-05-25.14:05:31
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1590415532.57.0.540581335385.issue40762@roundup.psfhosted.org>
In-reply-to

Content
> in real-life that b-prefixed string is just not readable by another program in an easy way If another program opens this CSV file, it will read the string "b'A'" which is what this field actually contains. Everything that is not a number or a string gets converted to a string: In [1]: import collections, dataclasses, random, secrets, io, csv ...: ...: Point = collections.namedtuple('Point', 'x y') ...: ...: @dataclasses.dataclass ...: class Valar: ...: name: str ...: age: int ...: ...: a = Point(1, 2) ...: b = Valar('Melkor', 2900) ...: c = secrets.token_bytes(4) ...: ...: out = io.StringIO() ...: f = csv.writer(out) ...: f.writerow((a, b, c)) ...: ...: out.seek(0) ...: print(out.read()) ...: "Point(x=1, y=2)","Valar(name='Melkor', age=2900)",b'\x95g6\xa2' Here another would find three fields, all strings: "Point(x=1, y=2)", "Valar(name='Melkor', age=2900)" and "b'\x95g6\xa2'". Would you expect to get actual objects instead of strings when reading the two first fields? > Incase it fails to decode using that, then it will throw a UnicodeDecodeError I read your PR, but succeeding to decode it does not mean it's correct: In [4]: b'r\xc3\xa9sum\xc3\xa9'.decode('latin') Out[4]: 'rÃ©sumÃ©' It worked, but is it the appropriate encoding? Probably not In [5]: b'r\xc3\xa9sum\xc3\xa9'.decode('utf8') Out[5]: 'résumé' If you want to be able to save bytes, the best way is to use a format that can roundtrip bytes like parquet: In [18]: df = pd.DataFrame.from_dict({'a': [b'a']}) In [19]: df.to_parquet('foo.parquet') In [20]: type(pd.read_parquet('foo.parquet')['a'][0]) Out[20]: bytes

> in real-life that b-prefixed string is just not readable by another program in an easy way

If another program opens this CSV file, it will read the string "b'A'" which is what this field actually contains. Everything that is not a number or a string gets converted to a string:

In [1]: import collections, dataclasses, random, secrets, io, csv 
   ...:  
   ...: Point = collections.namedtuple('Point', 'x y') 
   ...:  
   ...: @dataclasses.dataclass 
   ...: class Valar: 
   ...:     name: str 
   ...:     age: int 
   ...:  
   ...: a = Point(1, 2) 
   ...: b = Valar('Melkor', 2900) 
   ...: c = secrets.token_bytes(4) 
   ...:  
   ...: out = io.StringIO() 
   ...: f = csv.writer(out) 
   ...: f.writerow((a, b, c)) 
   ...:  
   ...: out.seek(0) 
   ...: print(out.read()) 
   ...:                                                                                                                                                                
"Point(x=1, y=2)","Valar(name='Melkor', age=2900)",b'\x95g6\xa2'

Here another would find three fields, all strings: "Point(x=1, y=2)", "Valar(name='Melkor', age=2900)" and "b'\x95g6\xa2'". Would you expect to get actual objects instead of strings when reading the two first fields?


> Incase it fails to decode using that, then it will throw a UnicodeDecodeError

I read your PR, but succeeding to decode it does not mean it's correct:

   In [4]: b'r\xc3\xa9sum\xc3\xa9'.decode('latin')                                                                                                                        
   Out[4]: 'rÃ©sumÃ©'

It worked, but is it the appropriate encoding? Probably not

   In [5]: b'r\xc3\xa9sum\xc3\xa9'.decode('utf8')                                                                                                                         
   Out[5]: 'résumé'



If you want to be able to save bytes, the best way is to use a format that can roundtrip bytes like parquet:

    In [18]: df = pd.DataFrame.from_dict({'a': [b'a']})                                                                                                                    

    In [19]: df.to_parquet('foo.parquet')                                                                                                                                  

    In [20]: type(pd.read_parquet('foo.parquet')['a'][0])                                                                                                                  
    Out[20]: bytes

History
Date	User	Action	Args
2020-05-25 14:05:32	remi.lapeyre	set	recipients: + remi.lapeyre, serhiy.storchaka, sidhant
2020-05-25 14:05:32	remi.lapeyre	set	messageid: <1590415532.57.0.540581335385.issue40762@roundup.psfhosted.org>
2020-05-25 14:05:32	remi.lapeyre	link	issue40762 messages
2020-05-25 14:05:31	remi.lapeyre	create