Week #3: Improving stats.binned_statistic_dd, somersd and _tau_b

Xingyu-Liu
Published: 06/27/2021

What did you do this week?

As this project progressed, I came to realize that the diffculty of this project is to find good potential algorithms. This week I was searching potential algorithms in scipy.stats. At first, like the past, I used pytest --durations=50 to find the slowest 50 tests. I looked into all the functions but didn't find any suitable algorithms: some do not have obvious loops; some have loops but call another scipy method in the loop... Therefore, I began checking all the functions under scipy/stats one by one and finally found somersd and _tau_b are good candidates and I submitted a PR( gh-14308) to speedup them 4x~20x.

Besides, For the works that mentioned last week:

stats._moment: submitted an issue(pythran #1820) for keep_dims is not supported in np.mean()
stats._calc_binned_statistic: successfully improved this function and made the public function stats.binned_statistic_dd 3x-10x faster on min,max,std,median. I tried to improved the whole if-elif block but encountered some errors that I can't fix (see pythran #1819 )
stats._sum_abs_axis0: Thanks to Serge, the compliation errror due to variant type is fixed. I compiled and it is ~2x faster on _sum_abs_axis0 but do not have much gain on the public function onenormest. Moreover, actually there is no loop in _sum_abs_axis0 for input size smaller than 2**20(my bad!)
sparse.linalg.expm(_fragment_2_1): Last week I said this one is slower than the pure python version but it is not. My bad, actually with my input at that time, it will not get into _fragment_2_1 so I actually didn't test it. Only when the input size is larger than 9000 it will get into the function. Moreover, the input is csc matrix so it is not a suitable one for pythran.
the SciPy build error(see pythran #1815) mentioned last week: It is a really struggling problem. Serge and Ralf tried to help me fix that but it is still not working for now.

What is coming up next?

add benchmarks for somersd and _tau_b
consider merging the old benchmark PR?(gh-14228)
keep searching good potential algorithms to be improved.

Did you get stuck anywhere?

The problem I encountered when improving the whole if-elif block in stats.binned_statistic_dd (see pythran #1819 )

Week #3: Improving stats.binned_statistic_dd, somersd and _tau_b

What did you do this week?

What is coming up next?

Did you get stuck anywhere?

Versions

Time

Settings from gsoc.settings

Headers

Request

SQL queries from 1 connection

Static files (2312 found, 3 used)

Templates (11 rendered)

Cache calls from 1 backend

Signals

Log messages