Week #1: Writing Benchmarks

Xingyu-Liu
Published: 06/14/2021

What did you do this week?

This week, I mainly focused on writing benchmarks and investigating potential slow algorithms.
  1. Wrote more benchmarks for inferential stats: my PR
    • KS test
    • MannWhitneyU
    • RankSums
    • BrunnerMunzel
    • chisqure
    • friedmanchisquare
    • epps_singleton_2samp
    • kruskal
  2. Modified to use new random API `rng = np.random.default_rng(12345678)`my PR
  3. Documented why some functions can’t be speedup via Pythran: my doc
  4. Found more potential algorithms that can be speedup via Pythran

What is coming up next?

Improve two of the following functions:
  • stats.friedmanchisquare: related to rankdata
  •     Line #      Hits         Time  Per Hit   % Time  Line Contents
        ==============================================================
        7970       501        351.0      0.7      0.5      for i in range(len(data)):
        7971       500      51417.0    102.8     75.8          data[i] = rankdata(data[i])
        
  • stats.binned_statistic_dd
  • sparse.linalg.onenormest
  • _fragment_2_1 in scipy/sparse/linalg/matfuncs.py

Did you get stuck anywhere?

When benchmarking, I found Mannwhitney is pretty slow. After profiling, it shows `p = _mwu_state.sf(U.astype(int), n1, n2)` occupys 100% time. Look into the function, `pmf` is the slowest part. @mdhaber mentioned that he would be interested in looking into these things himself later this summer.
Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    25                                               @profile
    26                                               def pmf(self, k, m, n):
    27                                                   '''Probability mass function'''
    28         1      29486.0  29486.0      0.2          self._resize_fmnks(m, n, np.max(k))
    29                                                   # could loop over just the unique elements, but probably not worth
    30                                                   # the time to find them
    31      1384       1701.0      1.2      0.0          for i in np.ravel(k):
    32      1383   18401083.0  13305.2     99.8              self._f(m, n, i)
    33         1         71.0     71.0      0.0          return self._fmnks[m, n, k] / special.binom(m + n, m)
DJDT

Versions

Time

Settings from gsoc.settings

Headers

Request

SQL queries from 1 connection

Static files (2312 found, 3 used)

Templates (11 rendered)

Cache calls from 1 backend

Signals

Log messages