classification
Title: CSV sniffing falsely detects space as a delimiter
Type: behavior Stage: patch review
Components: Library (Lib) Versions: Python 3.8
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: pt12lol, python-dev, rhettinger
Priority: normal Keywords: patch

Created on 2021-07-19 17:40 by pt12lol, last changed 2021-07-22 02:15 by rhettinger.

Pull Requests
URL Status Linked Edit
PR 27256 open python-dev, 2021-07-20 09:27
Messages (4)
msg397821 - (view) Author: Piotr Tokarski (pt12lol) * Date: 2021-07-19 17:40
Let's consider the following CSV content: "a|b\nc| 'd\ne|' f". The real delimiter in this case is '|' character while ' ' is sniffed. Find verbose example attached.

Problem lays in csv.py file in the following code:

```
        matches = []
        for restr in (r'(?P<delim>[^\w\n"\'])(?P<space> ?)(?P<quote>["\']).*?(?P=quote)(?P=delim)', # ,".*?",
                      r'(?:^|\n)(?P<quote>["\']).*?(?P=quote)(?P<delim>[^\w\n"\'])(?P<space> ?)',   #  ".*?",
                      r'(?P<delim>[^\w\n"\'])(?P<space> ?)(?P<quote>["\']).*?(?P=quote)(?:$|\n)',   # ,".*?"
                      r'(?:^|\n)(?P<quote>["\']).*?(?P=quote)(?:$|\n)'):                            #  ".*?" (no delim, no space)
            regexp = re.compile(restr, re.DOTALL | re.MULTILINE)
            matches = regexp.findall(data)
            if matches:
                break
```

What makes matches non-empty and farther processing happens with delimiter falsely set to ' '.
msg397859 - (view) Author: Piotr Tokarski (pt12lol) * Date: 2021-07-20 07:36
Test sample:

```
import csv
from io import StringIO


def csv_text():
    return StringIO("a|b\nc| 'd\ne|' f")


with csv_text() as input_file:
    print('The following text is going to be parsed:')
    print(input_file.read())
    print()


with csv_text() as input_file:
    dialect_params = [
        'delimiter',
        'quotechar',
        'escapechar',
        'lineterminator',
        'quoting',
        'doublequote',
        'skipinitialspace'
    ]
    dialect = csv.Sniffer().sniff(input_file.read())
    print('The following dialect has been detected:')
    for dialect_param in dialect_params:
        print(f'- {dialect_param}: {repr(getattr(dialect, dialect_param))}')
    print()


with csv_text() as input_file:
    print('Parsed csv text:')
    for entry in csv.reader(input_file, dialect=dialect):
        print(f'- {entry}')
    print()
```

Actual output:

```
The following text is going to be parsed:
a|b
c| 'd
e|' f

The following dialect has been detected:
- delimiter: ' '
- quotechar: "'"
- escapechar: None
- lineterminator: '\r\n'
- quoting: 0
- doublequote: False
- skipinitialspace: False

Parsed csv text:
- ['a|b']
- ['c|', 'd\ne|', 'f']

```

Expected output:

```
The following text is going to be parsed:
a|b
c| 'd
e|' f

The following dialect has been detected:
- delimiter: '|'
- quotechar: '"'
- escapechar: None
- lineterminator: '\r\n'
- quoting: 0
- doublequote: False
- skipinitialspace: False

Parsed csv text:
- ['a', 'b']
- ['c', " 'd"]
- ['e', "' f"]

```
msg397860 - (view) Author: Piotr Tokarski (pt12lol) * Date: 2021-07-20 07:42
I think changing `(?P<quote>["\']).*?(?P=quote)` to `(?P<quote>["\'])[^\n]*?(?P=quote)` in all regexes does the trick, doesn't it?
msg397974 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2021-07-22 02:15
Changing sniffer logic is risky because it risks breaking existing code that relies on the current predictions.

FWIW, in your example, the sniffer gets the desired result if given a delimiter hint:

>>> s = "a|b\nc| 'd\ne|' f"
>>> pprint.pp(dict(vars(Sniffer().sniff(s, '|'))))
{'__module__': 'csv',
 '_name': 'sniffed',
 'lineterminator': '\r\n',
 'quoting': 0,
 '__doc__': None,
 'doublequote': False,
 'delimiter': '|',
 'quotechar': "'",
 'skipinitialspace': False}
History
Date User Action Args
2021-07-22 02:15:15rhettingersetnosy: + rhettinger
messages: + msg397974
2021-07-20 09:27:36python-devsetkeywords: + patch
nosy: + python-dev

pull_requests: + pull_request25801
stage: patch review
2021-07-20 07:42:40pt12lolsetmessages: + msg397860
2021-07-20 07:36:23pt12lolsetmessages: + msg397859
2021-07-19 17:40:33pt12lolcreate