Xingyu-Liu's Blog

Week #1: Writing Benchmarks

Xingyu-Liu
Published: 06/14/2021

What did you do this week?

This week, I mainly focused on writing benchmarks and investigating potential slow algorithms.
  1. Wrote more benchmarks for inferential stats: my PR
    • KS test
    • MannWhitneyU
    • RankSums
    • BrunnerMunzel
    • chisqure
    • friedmanchisquare
    • epps_singleton_2samp
    • kruskal
  2. Modified to use new random API `rng = np.random.default_rng(12345678)`my PR
  3. Documented why some functions can’t be speedup via Pythran: my doc
  4. Found more potential algorithms that can be speedup via Pythran

What is coming up next?

Improve two of the following functions:
  • stats.friedmanchisquare: related to rankdata
  •     Line #      Hits         Time  Per Hit   % Time  Line Contents
        ==============================================================
        7970       501        351.0      0.7      0.5      for i in range(len(data)):
        7971       500      51417.0    102.8     75.8          data[i] = rankdata(data[i])
        
  • stats.binned_statistic_dd
  • sparse.linalg.onenormest
  • _fragment_2_1 in scipy/sparse/linalg/matfuncs.py

Did you get stuck anywhere?

When benchmarking, I found Mannwhitney is pretty slow. After profiling, it shows `p = _mwu_state.sf(U.astype(int), n1, n2)` occupys 100% time. Look into the function, `pmf` is the slowest part. @mdhaber mentioned that he would be interested in looking into these things himself later this summer.
Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    25                                               @profile
    26                                               def pmf(self, k, m, n):
    27                                                   '''Probability mass function'''
    28         1      29486.0  29486.0      0.2          self._resize_fmnks(m, n, np.max(k))
    29                                                   # could loop over just the unique elements, but probably not worth
    30                                                   # the time to find them
    31      1384       1701.0      1.2      0.0          for i in np.ravel(k):
    32      1383   18401083.0  13305.2     99.8              self._f(m, n, i)
    33         1         71.0     71.0      0.0          return self._fmnks[m, n, k] / special.binom(m + n, m)
View Blog Post

Week #0: Community Building and Getting Started

Xingyu-Liu
Published: 06/08/2021

Introduction

Hi everyone! I’m Xingyu Liu, a first-year data science master student at Harvard University. I’m very excited to be accepted by SciPy and I will work on using Pythran to improve algorithms’ performance in SciPy! There are currently many algorithms that would be too slow as pure Python, and Pythran can be a good tool to accelerate them. My goal is to investigate and improve the slow algorithms, as well as write benchmarks for them.

What did you do this week?

In the community bonding period, I met with my mentors, Ralf Gommers and Serge Guelton. They are very kind, responsive and helpful. We discussed about my project and set up a chat and weekly sync. In the last week, I've started doing my project:

Issues:

  1. Pythran makes np.searchsorted much slower
  2. u_values[u_sorter].searchsort would cause "Function path is chained attributes and name" but np.search would not
  3. all_values.sort() would cause compilation error but np.sort(all_values) would not

Pull Requests:

  1. ENH: Pythran implementation of _cdf_distance
  2. BENCH: add benchmark for energy_distance and wasserstein_distance

Readings:

  1. Pythran tutorial.
  2. Profiling Cython code

What is coming up next?

  1. Write benchmarks for inferential stats
  2. Modify to use new random API `rng = np.random.default_rng(12345678)`(according to comments in BENCH: add benchmark for f_oneway )
  3. Finding more potential algorithms that can be speedup via Pythran
  4. Document why some functions can’t be speedup via Pythran

Did you get stuck anywhere?

For my first pull request, we found the Pythran version is not better than the orginal due to the indexing operations.
View Blog Post