Xingyu-Liu's Blog

Week #3: Improving stats.binned_statistic_dd, somersd and _tau_b

Xingyu-Liu
Published: 06/27/2021

What did you do this week?

As this project progressed, I came to realize that the diffculty of this project is to find good potential algorithms. This week I was searching potential algorithms in scipy.stats. At first, like the past, I used pytest --durations=50 to find the slowest 50 tests. I looked into all the functions but didn't find any suitable algorithms: some do not have obvious loops; some have loops but call another scipy method in the loop... Therefore, I began checking all the functions under scipy/stats one by one and finally found somersd and _tau_b are good candidates and I submitted a PR( gh-14308) to speedup them 4x~20x.

Besides, For the works that mentioned last week:

  1. stats._moment: submitted an issue(pythran #1820) for keep_dims is not supported in np.mean()
  2. stats._calc_binned_statistic: successfully improved this function and made the public function stats.binned_statistic_dd 3x-10x faster on min,max,std,median. I tried to improved the whole if-elif block but encountered some errors that I can't fix (see pythran #1819 )
  3. stats._sum_abs_axis0: Thanks to Serge, the compliation errror due to variant type is fixed. I compiled and it is ~2x faster on _sum_abs_axis0 but do not have much gain on the public function onenormest. Moreover, actually there is no loop in _sum_abs_axis0 for input size smaller than 2**20(my bad!)
  4. sparse.linalg.expm(_fragment_2_1): Last week I said this one is slower than the pure python version but it is not. My bad, actually with my input at that time, it will not get into _fragment_2_1 so I actually didn't test it. Only when the input size is larger than 9000 it will get into the function. Moreover, the input is csc matrix so it is not a suitable one for pythran.
  5. the SciPy build error(see pythran #1815) mentioned last week: It is a really struggling problem. Serge and Ralf tried to help me fix that but it is still not working for now.

What is coming up next?

  1. add benchmarks for somersd and _tau_b
  2. consider merging the old benchmark PR?(gh-14228)
  3. keep searching good potential algorithms to be improved.

Did you get stuck anywhere?

The problem I encountered when improving the whole if-elif block in stats.binned_statistic_dd (see pythran #1819 )
View Blog Post

Week #2: Improving stats.ks_2samp

Xingyu-Liu
Published: 06/21/2021

What did you do this week?

This week was quite struggling. My mentor Serge implemented supporting for `scipy.special.binom` quickly in Pythran and it shows a great improvement on the public function stats.ks_2samp( 2.62ms vs 88.2ms). However, when I was building scipy with the improved algroithm, we found that it would cause a loop problem. Serge made a PR to break the loop but in my computer it is still not working.

Then I turned to try other algorithms that I mentioned last week, however I encountered more problems:

  1. stats._moment: keep_dims is not supported in np.mean()
  2. stats._calc_binned_statistic: invalid pythran spec but I don't find anything wrong
  3. stats.rankdata: invalid pythran spec
  4. stats._sum_abs_axis0: compliation error
  5. sparse.linalg.expm(_fragment_2_1): much slower than the pure python one, will keep investigating it.

What is coming up next?

  • submit issue for keep_dims
  • submit issue for error in stats._sum_abs_axis0
  • continue improving stats._calc_binned_statistic
  • Find out why sparse.linalg.expm pythran version is slower.

Did you get stuck anywhere?

Got stuck in many problems, as is written in What did you do this week section.
View Blog Post

Week #1: Writing Benchmarks

Xingyu-Liu
Published: 06/14/2021

What did you do this week?

This week, I mainly focused on writing benchmarks and investigating potential slow algorithms.
  1. Wrote more benchmarks for inferential stats: my PR
    • KS test
    • MannWhitneyU
    • RankSums
    • BrunnerMunzel
    • chisqure
    • friedmanchisquare
    • epps_singleton_2samp
    • kruskal
  2. Modified to use new random API `rng = np.random.default_rng(12345678)`my PR
  3. Documented why some functions can’t be speedup via Pythran: my doc
  4. Found more potential algorithms that can be speedup via Pythran

What is coming up next?

Improve two of the following functions:
  • stats.friedmanchisquare: related to rankdata
  •     Line #      Hits         Time  Per Hit   % Time  Line Contents
        ==============================================================
        7970       501        351.0      0.7      0.5      for i in range(len(data)):
        7971       500      51417.0    102.8     75.8          data[i] = rankdata(data[i])
        
  • stats.binned_statistic_dd
  • sparse.linalg.onenormest
  • _fragment_2_1 in scipy/sparse/linalg/matfuncs.py

Did you get stuck anywhere?

When benchmarking, I found Mannwhitney is pretty slow. After profiling, it shows `p = _mwu_state.sf(U.astype(int), n1, n2)` occupys 100% time. Look into the function, `pmf` is the slowest part. @mdhaber mentioned that he would be interested in looking into these things himself later this summer.
Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    25                                               @profile
    26                                               def pmf(self, k, m, n):
    27                                                   '''Probability mass function'''
    28         1      29486.0  29486.0      0.2          self._resize_fmnks(m, n, np.max(k))
    29                                                   # could loop over just the unique elements, but probably not worth
    30                                                   # the time to find them
    31      1384       1701.0      1.2      0.0          for i in np.ravel(k):
    32      1383   18401083.0  13305.2     99.8              self._f(m, n, i)
    33         1         71.0     71.0      0.0          return self._fmnks[m, n, k] / special.binom(m + n, m)
View Blog Post

Week #0: Community Building and Getting Started

Xingyu-Liu
Published: 06/08/2021

Introduction

Hi everyone! I’m Xingyu Liu, a first-year data science master student at Harvard University. I’m very excited to be accepted by SciPy and I will work on using Pythran to improve algorithms’ performance in SciPy! There are currently many algorithms that would be too slow as pure Python, and Pythran can be a good tool to accelerate them. My goal is to investigate and improve the slow algorithms, as well as write benchmarks for them.

What did you do this week?

In the community bonding period, I met with my mentors, Ralf Gommers and Serge Guelton. They are very kind, responsive and helpful. We discussed about my project and set up a chat and weekly sync. In the last week, I've started doing my project:

Issues:

  1. Pythran makes np.searchsorted much slower
  2. u_values[u_sorter].searchsort would cause "Function path is chained attributes and name" but np.search would not
  3. all_values.sort() would cause compilation error but np.sort(all_values) would not

Pull Requests:

  1. ENH: Pythran implementation of _cdf_distance
  2. BENCH: add benchmark for energy_distance and wasserstein_distance

Readings:

  1. Pythran tutorial.
  2. Profiling Cython code

What is coming up next?

  1. Write benchmarks for inferential stats
  2. Modify to use new random API `rng = np.random.default_rng(12345678)`(according to comments in BENCH: add benchmark for f_oneway )
  3. Finding more potential algorithms that can be speedup via Pythran
  4. Document why some functions can’t be speedup via Pythran

Did you get stuck anywhere?

For my first pull request, we found the Pythran version is not better than the orginal due to the indexing operations.
View Blog Post