msg329490 - (view) |
Author: Dragoljub (Dragoljub) |
Date: 2018-11-08 22:56 |
xref: https://github.com/pandas-dev/pandas/issues/23516
Example:
import io
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(1000000, 10), columns=('COL{}'.format(i) for i in range(10)))
csv = io.StringIO(df.to_csv(index=False))
df2 = pd.read_csv(csv) #3.5X slower on Python 3.7.1
pd.read_csv() reads data at 30MB/sec on Python 3.7.1 while at 100MB/sec on Python 3.6.7.
This issue seems to be only present on Windows 10 Builds both x86 & x64.
Possibly some IO changes in Python 3.7 could have contributed to this slowdown on Windows but not on Linux?
|
msg329545 - (view) |
Author: Dragoljub (Dragoljub) |
Date: 2018-11-09 18:29 |
After some more benchmarks I'm seeing this line of code called in Python 3.7 but not in Python 3.5:
{built-in method _thread.allocate_lock}
|
msg329573 - (view) |
Author: Dragoljub (Dragoljub) |
Date: 2018-11-10 00:12 |
After some more digging it appears that we see the 3.5x slowdown manifest in Python 3.7.0a4 and is not present in Python 3.7.0a3.
One guess is that
https://docs.python.org/3.7/whatsnew/changelog.html#python-3-7-0-alpha-4
bpo-29240: Add a new UTF-8 mode: implementation of the PEP 540
may contribute to this slowdown on windows. Is there a way to ensure we disable any native to UTF conversion that may be happening in Python 3.7.a4?
|
msg329575 - (view) |
Author: Karthikeyan Singaravelan (xtreak) * |
Date: 2018-11-10 00:19 |
From the PEP 540
This mode is off by default, but is automatically activated when using the "POSIX" locale.
Add the -X utf8 command line option and PYTHONUTF8 environment variable to control UTF-8 Mode.
https://docs.python.org/3.7/using/cmdline.html#envvar-PYTHONUTF8
I think you can set it to 0 from the docs yo see if it has any effect.
|
msg329587 - (view) |
Author: Dragoljub (Dragoljub) |
Date: 2018-11-10 02:20 |
I tried playing around with the UTF-8 mode settings but did not get a speed improvement.
After reading through the PEP it appears that on Windoes:
"To allow for better cross-platform binary portability and to adjust automatically to future changes in locale availability, these checks will be implemented at runtime on all platforms other than Windows, rather than attempting to determine which locales to try at compile time."
So if i'm understanding this correctly the locale coercion would not be controllable from Windows after Python is compiled?
|
msg329589 - (view) |
Author: Karthikeyan Singaravelan (xtreak) * |
Date: 2018-11-10 02:36 |
I have limited understanding of Windows and I don't have access to a Windows machine to check this out. I am adding Victor who implemented the PEP and might help here. There also seems to be PYTHONLEGACYWINDOWSFSENCODING for windows specific use case. Some more notes on the original issue https://bugs.python.org/issue29240#msg285278.
https://bugs.python.org/issue29240#msg285325
> Handle PYTHONLEGACYWINDOWSFSENCODING: this env var now disables the UTF-8 mode and has the priority over -X utf8 and PYTHONUTF8
|
msg329592 - (view) |
Author: Dragoljub (Dragoljub) |
Date: 2018-11-10 05:14 |
I tested this at runtime with sys._enablelegacywindowsfsencoding()
Also this was new in 3.6 and Py 3.6 does not have the slowdown issue.
New in version 3.6: See PEP 529 for more details.
|
msg329641 - (view) |
Author: Dragoljub (Dragoljub) |
Date: 2018-11-10 20:02 |
@cgohlke compared the statement df2 = pd.read_csv(csv) on Python 3.7.0a3 and a4 in the Visual Studio profiler. The culprit is the isdigit function called in the parsers extension module. On 3.7.0a3 the function is fast at ~8% of samples. On 3.7.0a4 the function is slow at ~64% samples because it calls the _isdigit_l function, which seems to update and restore the locale in the current thread every time...
3.7.0a3:
Function Name Inclusive Samples Exclusive Samples Inclusive Samples % Exclusive Samples % Module Name
+ [parsers.cp37-win_amd64.pyd] 705 347 28.52% 14.04% parsers.cp37-win_amd64.pyd
isdigit 207 207 8.37% 8.37% ucrtbase.dll
- _errno 105 39 4.25% 1.58% ucrtbase.dll
toupper 24 24 0.97% 0.97% ucrtbase.dll
isspace 21 21 0.85% 0.85% ucrtbase.dll
[python37.dll] 1 1 0.04% 0.04% python37.dll
3.7.0a4:
Function Name Inclusive Samples Exclusive Samples Inclusive Samples % Exclusive Samples % Module Name
+ [parsers.cp37-win_amd64.pyd] 8,613 478 83.04% 4.61% parsers.cp37-win_amd64.pyd
+ isdigit 6,642 208 64.04% 2.01% ucrtbase.dll
+ _isdigit_l 6,434 245 62.03% 2.36% ucrtbase.dll
+ _LocaleUpdate::_LocaleUpdate 5,806 947 55.98% 9.13% ucrtbase.dll
+ __acrt_getptd 2,121 1,031 20.45% 9.94% ucrtbase.dll
FlsGetValue 647 647 6.24% 6.24% KernelBase.dll
- RtlSetLastWin32Error 296 235 2.85% 2.27% ntdll.dll
_guard_dispatch_icall_nop 101 101 0.97% 0.97% ucrtbase.dll
GetLastError 46 46 0.44% 0.44% KernelBase.dll
+ __acrt_update_multibyte_info 1,475 246 14.22% 2.37% ucrtbase.dll
- __crt_state_management::get_current_state_index 1,229 513 11.85% 4.95% ucrtbase.dll
+ __acrt_update_locale_info 1,263 235 12.18% 2.27% ucrtbase.dll
- __crt_state_management::get_current_state_index 1,028 429 9.91% 4.14% ucrtbase.dll
_ischartype_l 383 383 3.69% 3.69% ucrtbase.dll
|
msg329743 - (view) |
Author: Dragoljub (Dragoljub) |
Date: 2018-11-12 16:34 |
@Vstinner,
Any way you can help test out a config setting to avoid the locale changes on Python 2.7.0a4+? It is currently causing the isdigit() low-level function to call the local-specific function on windows and update locals each call slowing down CSV Paring on Windows 3.5X
How can we configure python to not be different than 3.6.7 when it come to locale behavior?
|
msg329745 - (view) |
Author: STINNER Victor (vstinner) * |
Date: 2018-11-12 16:50 |
Can someone please try to write an example which only uses the stdlib?
> The culprit is the isdigit function called in the parsers extension module.
What are these extensions? Where do them come from?
> Any way you can help test out a config setting to avoid the locale changes on Python 3.7.0a4+?
(I fixed 2.7 => 3.7)
I don't understand which "locale changes" you are talking about. You can change the locale using locale.setlocale().
|
msg329786 - (view) |
Author: Christoph Gohlke (cgohlke) |
Date: 2018-11-13 01:26 |
> Can someone please try to write an example which only uses the stdlib?
The simplest is to compare performance of the `windll.LoadLibrary('API-MS-WIN-CRT-STRING-L1-1-0.DLL')` function on Python 3.7.0a3 and 3.7.0a4, but that will mostly measure Python/ctypes overhead. I will post a minimal C extension instead.
> What are these extensions? Where do them come from?
The `isdigit` function is from the UCRT. The `parsers` Cython/C extension is part of the pandas wheel on PyPI. The context for this issue is at https://github.com/pandas-dev/pandas/issues/23516
> I don't understand which "locale changes" you are talking about. You can change the locale using locale.setlocale().
The `UCRT.isdigit` function, when run on Python >=3.7.0a4, calls the `_isdigit_l` function, which calls `_LocaleUpdate::_LocaleUpdate` (see the VS profiler output).
|
msg329787 - (view) |
Author: Dragoljub (Dragoljub) |
Date: 2018-11-13 01:39 |
Here is a simple pure python example:
digits = ''.join([str(i) for i in range(10)]*10000000)
%timeit digits.isdigit() # --> 2X+ slower on python 3.7.1
Basically in Pandas C-code parser we call the isdigit() function for each number that is to be parsed. so 12345.6789 calls isdigt() 9 times to determine if this is a digit character that can be converted to a float. The problem is in the latest version of Python with locale updates isdigit() takes a locale argument that seems to be passed over and over slowing down this check. Is it possible that we disable any local passing from Python down to lower-level C code, or simply set the default locale to 'C' to keep it from thrashing?
|
msg329789 - (view) |
Author: STINNER Victor (vstinner) * |
Date: 2018-11-13 01:45 |
> digits = ''.join([str(i) for i in range(10)]*10000000)
> %timeit digits.isdigit() # --> 2X+ slower on python 3.7.1
This code calls:
* (Python) str.isdigit()
* unicode_isdigit_impl()
* _PyUnicode_IsDigit()
* _PyUnicode_ToDigit() which uses Python internal Unicode database
This code doesn't depend on locales at all. It's pure Unicode.
|
msg329790 - (view) |
Author: Christoph Gohlke (cgohlke) |
Date: 2018-11-13 01:45 |
I attached a minimal C extension module that can be used to demonstrate the performance degradation from Python 3.7.0a3 to 3.7.0a4.
Build the extension with `py setup.py build_ext --inplace`, then run the following code on Python 3.7.0a3 to 3.7.0a4:
```
import time
from test_isdigit import test_isdigit
start_time = time.time()
test_isdigit()
print(time.time() - start_time)
```
On my Windows 10 Pro WS system, the timings are:
Python 3.7.0a3: ~0.0156
Python 3.7.0a4: ~0.3281
I would expect that other locale aware functions in the UCRT are also affected but I have not tested that.
|
msg329791 - (view) |
Author: STINNER Victor (vstinner) * |
Date: 2018-11-13 01:47 |
test_isdigit.c: Can you try to call locale.setlocale(locale.LC_CTYPE, "") before running your benchmark on Python 3.7.0?
This issue may be related to bpo-34485.
|
msg329792 - (view) |
Author: STINNER Victor (vstinner) * |
Date: 2018-11-13 01:50 |
> This issue may be related to bpo-34485.
I'm thinking to:
New changeset 177d921c8c03d30daa32994362023f777624b10d by Victor Stinner in branch 'master':
bpo-34485, Windows: LC_CTYPE set to user preference (GH-8988)
https://github.com/python/cpython/commit/177d921c8c03d30daa32994362023f777624b10d
Oh, I only made this change in the future Python 3.8 (master branch). So this change may be unrelated.
Note: Right now, my Windows VM is broken, so I cannot investigate this performance issue which seems to be specific to the msvcrt (libc of Microsoft VisualStudio).
|
msg329793 - (view) |
Author: Christoph Gohlke (cgohlke) |
Date: 2018-11-13 01:53 |
> test_isdigit.c: Can you try to call locale.setlocale(locale.LC_CTYPE, "") before running your benchmark on Python 3.7.0?
Yes, that slows down Python 3.7.0a3 to the 3.7.0a4 level.
|
msg329828 - (view) |
Author: STINNER Victor (vstinner) * |
Date: 2018-11-13 11:42 |
> Yes, that slows down Python 3.7.0a3 to the 3.7.0a4 level.
Ok, so something calls setlocale(LC_CTYPE, "") or setlocale(LC_ALL, "") in Python 3.7.0. I'm not interested to dig the Git history. It doesn't really matter at this point.
Can you try to get the current LC_CTYPE locale on Python 3.6 or 3.7.0a3? Example:
$ python3 -c 'import _locale; print(_locale.setlocale(_locale.LC_CTYPE, None))'
fr_FR.utf8
Or using the locale module (it should give the same result):
$ python3 -c 'import locale; print(locale.setlocale(locale.LC_CTYPE, None))'
fr_FR.utf8
Can also also try on 3.7.0a4 and newer (ex: Python 3.7.1): locale.setlocale(locale.LC_ALL, "C"), to see it does workaround your performance issue? I don't recall if "C" or "POSIX" locales are supported on Windows.
|
msg329854 - (view) |
Author: Dragoljub (Dragoljub) |
Date: 2018-11-13 17:31 |
On Python 3.7.1 and Windows 10:
I attempted locale.setlocale(locale.LC_ALL, "POSIX") --> Errors Out
---------------------------------------------------------------------------
Error Traceback (most recent call last)
<ipython-input-8-f384c02ab238> in <module>
----> 1 locale.setlocale(locale.LC_ALL, "POSIX")
D:\Python37\lib\locale.py in setlocale(category, locale)
602 # convert to string
603 locale = normalize(_build_localename(locale))
--> 604 return _setlocale(category, locale)
605
606 def resetlocale(category=LC_ALL):
Error: unsupported locale setting
I was able to set the loacle to "C" but that does not improve the parsing performance.
locale.setlocale(locale.LC_ALL, "C") --> returns 'C'
|
msg329855 - (view) |
Author: Dragoljub (Dragoljub) |
Date: 2018-11-13 17:42 |
This is the default LC_CTYPE locale type I see on Windows10 and Python 3.7.1 vs 3.5.2: Same: 'C'
'3.7.1 (v3.7.1:260ec2c36a, Oct 20 2018, 14:57:15) [MSC v.1915 64 bit (AMD64)]'
import _locale
_locale.setlocale(_locale.LC_CTYPE, None)
'C'
'3.5.2 (v3.5.2:4def2a2901a5, Jun 25 2016, 22:18:55) [MSC v.1900 64 bit (AMD64)]'
import _locale
_locale.setlocale(_locale.LC_CTYPE, None)
'C'
|
msg330007 - (view) |
Author: Dragoljub (Dragoljub) |
Date: 2018-11-16 17:50 |
Do we know if its possible to prevent the initialize LC_CTYPE on startup? Is there some combination of ENV-Var or CMD-Args that can avoid this slowdown on Windows?
What are the next step to get the issue assigned?
|
msg341901 - (view) |
Author: Christoph Gohlke (cgohlke) |
Date: 2019-05-08 16:14 |
This is still an issue with Python 3.8.0a4
|
msg342702 - (view) |
Author: STINNER Victor (vstinner) * |
Date: 2019-05-17 09:56 |
I created bpo-36945: "Add _PyPreConfig.configure_locale: allow to leave LC_CTYPE unchanged when embedding Python".
|
|
Date |
User |
Action |
Args |
2022-04-11 14:59:07 | admin | set | github: 79376 |
2019-05-17 09:56:11 | vstinner | set | messages:
+ msg342702 |
2019-05-08 16:14:50 | cgohlke | set | versions:
+ Python 3.8 nosy:
+ paul.moore, tim.golden, zach.ware, steve.dower
messages:
+ msg341901
components:
+ Windows |
2018-11-16 17:50:36 | Dragoljub | set | messages:
+ msg330007 |
2018-11-13 17:42:08 | Dragoljub | set | messages:
+ msg329855 |
2018-11-13 17:31:32 | Dragoljub | set | messages:
+ msg329854 |
2018-11-13 11:43:46 | vstinner | set | title: Pandas read_csv() is 3.5X Slower on Python 3.7.1 vs Python 3.6.7 & 3.5.2 On Windows 10 -> [Windows] Python 3.7 initializes LC_CTYPE locale at startup, causing performance issue on msvcrt isdigit() |
2018-11-13 11:42:37 | vstinner | set | messages:
+ msg329828 |
2018-11-13 01:53:12 | cgohlke | set | messages:
+ msg329793 |
2018-11-13 01:50:16 | vstinner | set | messages:
+ msg329792 |
2018-11-13 01:47:22 | vstinner | set | messages:
+ msg329791 |
2018-11-13 01:45:32 | cgohlke | set | messages:
+ msg329790 |
2018-11-13 01:45:23 | vstinner | set | messages:
+ msg329789 |
2018-11-13 01:39:54 | Dragoljub | set | messages:
+ msg329787 |
2018-11-13 01:36:51 | cgohlke | set | files:
+ setup.py |
2018-11-13 01:33:34 | cgohlke | set | files:
+ test_isdigit.c |
2018-11-13 01:26:31 | cgohlke | set | nosy:
+ cgohlke messages:
+ msg329786
|
2018-11-12 16:50:20 | vstinner | set | messages:
+ msg329745 |
2018-11-12 16:34:13 | Dragoljub | set | messages:
+ msg329743 |
2018-11-10 20:02:15 | Dragoljub | set | messages:
+ msg329641 |
2018-11-10 05:14:39 | Dragoljub | set | messages:
+ msg329592 |
2018-11-10 02:36:27 | xtreak | set | nosy:
+ vstinner messages:
+ msg329589
|
2018-11-10 02:20:08 | Dragoljub | set | messages:
+ msg329587 |
2018-11-10 00:19:43 | xtreak | set | messages:
+ msg329575 |
2018-11-10 00:12:18 | Dragoljub | set | messages:
+ msg329573 |
2018-11-09 18:29:24 | Dragoljub | set | messages:
+ msg329545 |
2018-11-09 03:34:36 | xtreak | set | nosy:
+ xtreak
|
2018-11-08 22:56:21 | Dragoljub | create | |