Articles on Xingyu-Liu's Blog

Week #11: Writing Tests and Finished Submission

xingyuliu@g.harvard.edu (Xingyu-Liu) — Tue, 24 Aug 2021 12:16:38 +0000

What did you do this week?

Finished General implementation of supporting immediate arguments
GSoC submission: blog
Merge some cases WIP: TST: add tests for Pythran somersd

What is coming up next?

ENH: optimize min max and median scipy.stats.binned_statistic by another contributor His performance is even better than our Pythran improvement( ENH: improved binned_statistic_dd via Pythran) He did vectorization improvement and reduced two loops into one. Maybe we can try Pythran again based on his implementation?
Finish WIP: TST: add tests for Pythran somersd

Week #10: Supporting immediate arguments in Pythran

xingyuliu@g.harvard.edu (Xingyu-Liu) — Tue, 17 Aug 2021 03:38:52 +0000

What did you do this week?

What is coming up next?

Finished General implementation of supporting immediate arguments
Investigate more on WIP: TST: add tests for Pythran somersd
Prepare for the final evaluation

Did you get stuck anywhere?

In General implementation of supporting immediate arguments, I met a AttributeError: 'FunctionDef' object has no attribute 'immediate_arguments', the potential solution is hard-code checking if it is FunctionDef object, if so then skip.

Week #9: Adding tests for Pythran functions, and review the opened PRs

xingyuliu@g.harvard.edu (Xingyu-Liu) — Mon, 09 Aug 2021 15:07:06 +0000

What did you do this week?

WIP: TST: add tests for Pythran somersd
WIP: support keepdims in numpy mean
Revisit and summarize the unsuccessful PRs about Pythran in SciPy

PR	Reason
ENH: Pythran implementation of _compute_prob_outside_square and _compute_prob_inside_method to speedup stats.ks_2samp	Failed some tests before but works now
ENH: Pythran implementation of _cdf_distance	Pythran version is slightly better than the Python one after fixing np.searchsorted. Could be better after SciPy began to use XSIMD. Hold it for now.
WIP: ENH: improve _count_paths_outside_method via pythran	Relates to bus error on Mac but works fine on Linux for _count_paths_outside_method pythran version
WIP: ENH: improve sort_vertices_of_regions via Pythran and made it more readable	Test Failures: 1) Test_spherical_voronoi: inplace sort 2) Test_region_types: The specified input regions type is int64 list list. When the element in self.regions is numpy.int64, Pythran will automatically turn it to int type

What is coming up next?

WIP: TST: add tests for Pythran somersd Keep working on this
WIP: support keepdims in numpy mean : make it more general
Test XSIMD for_cdf_distance

Did you get stuck anywhere?

Stuck in supporting keepdims in numpy mean in Pythran, thanks to Serge, he helped me fixed many problems and this will be completed this week.

Week #8: Support keepdims in numpy mean, hunt potential algorithms to be improved

xingyuliu@g.harvard.edu (Xingyu-Liu) — Tue, 03 Aug 2021 17:07:31 +0000

What did you do this week?

ENH: improve siegelslopes via pythranClean code, all checks passed.
ENH: improve cspline1d, qspline1d, and relative funcs via Pythran Only improve the private funcs, has passed all the checks. However, find a potential problem: array assignment res[cond1] = ax[cond1] works fine for int[] or float[] or float[:,:] but not int[:,:]
WIP: support keepdims in numpy mean It passed all the checks after I changed to use str(node.value).lower(). I added tests for keepdims=False but there are some check failures.
ENH: improve _cplxreal, _falling_factorial, _bessel_poly, _arc_jac_sn… This enhancement is little and seems so meaningless that I opened the PR only in my own repo: they are already fast algorithms. Now I got stuck in finding potential algorithms to improve: often spending ~10 hrs to find algorithms, ~2hr to improve them.

What is coming up next?

Since it is not easy to find good algorithms anymore and we've already improved some, it is time to change the plan. Therefore, I will work on:

Use Pytest and Decorator to support different dype input testing for Pythran imporved functions.
Revisit the algorithms we worked, get a final conclusion maybe.
Finish supporting keepdims in numpy mean in Pythran

Did you get stuck anywhere?

Stuck in supporting keepdims in numpy mean in Pythran and finding potential algorithms.

Week #7: Support keepdims in Pythran's numpy mean

xingyuliu@g.harvard.edu (Xingyu-Liu) — Mon, 26 Jul 2021 18:02:28 +0000

What did you do this week?

[Merged] Review the PR DOC: clarify meaning of rvalue in stats.linregress
Document ENH: improve _sosfilt_float via Pythran
Leave the validation in the Python func: ENH: improve siegelslopes via pythran
ENH: improve cspline1d, qspline1d, and relative funcs via Pythran

In this case, I left cspline1d, qspline1d, cspline1d_eval, qspline1d_eval public function and doc in Python
How about 'cubic' and 'quadratic'? They also seem to be a public function.
Need to check if we need to support more types even if passes checks

WIP: support keepdims in numpy mean

What is coming up next?

Keep working on 3./4./5. mentioned above. Merge them hopefully
Find more potential algorithms and improve them
Completed BENCH: add more benchmarks for inferential statistics tests

Did you get stuck anywhere?

While supporting keepdims in numpy mean, I added a function mean(E const &expr, types::none_type axis, dtype d, std::true_type keepdims) , but I'm not sure how can I declare the return for this function . I think we need to calculated the out_shape so we can -> decltype(numpy::functor::asarray{}(sum(expr) / typename dtype::type(expr.flat_size())).reshape(out_shape))

Week #6: Improving siegelslopes, cspline1d, qspline1d, etc.

xingyuliu@g.harvard.edu (Xingyu-Liu) — Thu, 22 Jul 2021 02:47:09 +0000

What did you do this week?

Look at the issue Is the r-value outputted by scipy.stats.linregress always the Pearson correlation coefficient?
WIP: ENH: improve sort_vertices_of_regions via Pythran and made it more readable
- Tyler said test_spherical_voronoi may test inplace sort, and it is not recommended to remove a test. In this way, we’ll never pass the test.
- For the type error, I can’t reproduce it on my computer. Is it similar to the issue BUG: RBFInterpolator fails when calling it with a slice of a (1, n) array? I encountered similar `reshaped` issues before, and found that often the type is the problem while `reshaped` is not. Once I support that type, I’ll not get the error. But in the case there they do support that type.
Last week we concluded _spectral.pyx and _sosfilt.pyx are easy to be improved via Pythran, but later I found that _spectral.pyx already has a version in Pythran. For_sosfilt.pyx, I improved _sosfilt_float and leave _sosfilt_object in Cython. The performance for _sosfilt_float looks similar comparing Cython and Pythran. So I'm not sure whether I need to make a PR for it
ENH: improve siegelslopes via pythran , 10x faster. If needed, I can also improve scipy/stats/_stats_mstats_common.py ’s linregress, theilslopes and put them with siegelslopes in the same file. But other two functions do not have obvious loops so here I only improve siegelslopes.
ENH: improve cspline1d, qspline1d, and relative funcs via Pythran ,10x faster.
- Segment fault on Azure pipelines. Because of calling itself in the function?
- A lot of signatures. Any more concise way?
- Actually, for those functions which have lots of signatures and also cause current segment faults - cspline1d_eval and qspline1d_eval , they don’t have many loops. I improved them because they are used to evaluate cspline1d and qspline1d , putting them in one file may look better. We can also leave them in the original file so that we won’t get above a.& b. problems

What is coming up next?

Keep working on ENH: improve cspline1d, qspline1d, and relative funcs via Pythran
Find more potential algorithms and improve them
Make a PR for _sosfilt_float and comment on it
keepdimsfeature support in Pythran

Did you get stuck anywhere?

I once said that np.expand_dims() does not support dim as keyword, I was wrong because the key is axis, but I still got the following error. However, np.expand_dims(x, 1) will work.


    (scipy-dev) charlotte@CHARLOTLIU-MB0 stats % pythran siegelslopes_pythran.py
    CRITICAL: I am in trouble. Your input file does not seem to match Pythran's constraints...
    siegelslopes_pythran.py:19:13 error: function uses an unknown (or unsupported) keyword argument `axis`
    ----
        deltax = np.expand_dims(x, axis=1) - x
                 ^~~~ (o_0)
    ----

Week #5: Improving sort_vertices_of_regions, and write some tests

xingyuliu@g.harvard.edu (Xingyu-Liu) — Tue, 13 Jul 2021 15:40:13 +0000

What did you do this week?

Added unit test for BUG: fix stats.binned_statistic_dd issue with values close to bin edge
Added benchmarks for somersd BENCH: add benchmark for somersd
Added tests in Pythran: Import test cases from scipy
Wrote the first evaluations, will submit it later.
WIP: ENH: improve sort_vertices_of_regions via Pythran and made it more readable However, I got some weird type error, and failed test_spherical_voronoi and test_region_types. Tyler suggested it may be not a good case for Pythran, and I'm still trying to find out why there are such errors.

What is coming up next?

Submit the first evaluations
Continue working on sort_vertices_of_regions(), try to fix the failures
Look into and maybe improve some of the following algorithms: _spectral.pyx and _sosfilt.pyx

Did you get stuck anywhere?

The WIP PR mentioned above: WIP: ENH: improve sort_vertices_of_regions via Pythran and made it more readable . It fails two tests: test_spherical_voronoi and test_region_types.

Week #4: Improving binned_statistic_dd and _voronoi, and fix some issues

xingyuliu@g.harvard.edu (Xingyu-Liu) — Tue, 06 Jul 2021 13:19:10 +0000

What did you do this week?

First came to the old problem, bus error. It turns out that it is specific to Mac. We still don't know the cause of the problem yet.( bus error on Mac but works fine on Linux for _count_paths_outside_method pythran version)

Last week I said that the benchmark result is different from my timeit result. It is actually my mistake: I forgot to modify setup.py. After setting up correctly, the problem was fixed.

Also, for the algorithm binned_statistic_dd I was improving since last week, I have made a PR for it. At first, I improved the whole if-elif block and the benchmark shows it can make count, sum,mean 1.1x times faster, and make std, median, min, max 3x-30x faster . However, I found that Pythran can't support object type input so I failed some tests.To support object type, we need to keep the whole pure Python codes, and it will make the if-elif block duplicate and ugly. Since from the benchmark, there is not much improvement for count, sum,mean, I also tried to only improve std, median, min, max to make it look better and understandable So in the end, I only improved an small inner function but still get std, median, min, max 3x-30x faster, with no changes for count, sum,mean.(ENH: improved binned_statistic_dd via Pythran)

When I was improving binned_statistic_dd, there happened to be an open issue about float point comparision. I looked into that and fixed it.( BUG: fix stats.binned_statistic_dd issue with values close to bin edge )

Last but not least, I tried to speedup _voronoi discussed last week, and the Pythran version is 3x faster than the Cython one!

What is coming up next?

Refer to the original Python version rather than the CPython one, make the Pythran version _voronoi more readable. After that, make a PR.
Add test for the binned_statistic_dd bug
Add benchmarks for somersd and _tau_b
Prepare for the first evaluations
In Pythran, import some scipy tests

Did you get stuck anywhere?

The bus error mentioned above, and build_docs failed on my PR recently.

Week #3: Improving stats.binned_statistic_dd, somersd and _tau_b

xingyuliu@g.harvard.edu (Xingyu-Liu) — Sun, 27 Jun 2021 15:30:01 +0000

What did you do this week?

As this project progressed, I came to realize that the diffculty of this project is to find good potential algorithms. This week I was searching potential algorithms in scipy.stats. At first, like the past, I used pytest --durations=50 to find the slowest 50 tests. I looked into all the functions but didn't find any suitable algorithms: some do not have obvious loops; some have loops but call another scipy method in the loop... Therefore, I began checking all the functions under scipy/stats one by one and finally found somersd and _tau_b are good candidates and I submitted a PR( gh-14308) to speedup them 4x~20x.

Besides, For the works that mentioned last week:

stats._moment: submitted an issue(pythran #1820) for keep_dims is not supported in np.mean()
stats._calc_binned_statistic: successfully improved this function and made the public function stats.binned_statistic_dd 3x-10x faster on min,max,std,median. I tried to improved the whole if-elif block but encountered some errors that I can't fix (see pythran #1819 )
stats._sum_abs_axis0: Thanks to Serge, the compliation errror due to variant type is fixed. I compiled and it is ~2x faster on _sum_abs_axis0 but do not have much gain on the public function onenormest. Moreover, actually there is no loop in _sum_abs_axis0 for input size smaller than 2**20(my bad!)
sparse.linalg.expm(_fragment_2_1): Last week I said this one is slower than the pure python version but it is not. My bad, actually with my input at that time, it will not get into _fragment_2_1 so I actually didn't test it. Only when the input size is larger than 9000 it will get into the function. Moreover, the input is csc matrix so it is not a suitable one for pythran.
the SciPy build error(see pythran #1815) mentioned last week: It is a really struggling problem. Serge and Ralf tried to help me fix that but it is still not working for now.

What is coming up next?

add benchmarks for somersd and _tau_b
consider merging the old benchmark PR?(gh-14228)
keep searching good potential algorithms to be improved.

Did you get stuck anywhere?

The problem I encountered when improving the whole if-elif block in stats.binned_statistic_dd (see pythran #1819 )

Week #2: Improving stats.ks_2samp

xingyuliu@g.harvard.edu (Xingyu-Liu) — Mon, 21 Jun 2021 08:48:45 +0000

What did you do this week?

This week was quite struggling. My mentor Serge implemented supporting for `scipy.special.binom` quickly in Pythran and it shows a great improvement on the public function stats.ks_2samp( 2.62ms vs 88.2ms). However, when I was building scipy with the improved algroithm, we found that it would cause a loop problem. Serge made a PR to break the loop but in my computer it is still not working.

Then I turned to try other algorithms that I mentioned last week, however I encountered more problems:

stats._moment: keep_dims is not supported in np.mean()
stats._calc_binned_statistic: invalid pythran spec but I don't find anything wrong
stats.rankdata: invalid pythran spec
stats._sum_abs_axis0: compliation error
sparse.linalg.expm(_fragment_2_1): much slower than the pure python one, will keep investigating it.

What is coming up next?

submit issue for keep_dims
submit issue for error in stats._sum_abs_axis0
continue improving stats._calc_binned_statistic
Find out why sparse.linalg.expm pythran version is slower.

Did you get stuck anywhere?

Got stuck in many problems, as is written in What did you do this week section.

Week #1: Writing Benchmarks

xingyuliu@g.harvard.edu (Xingyu-Liu) — Mon, 14 Jun 2021 08:43:28 +0000

What did you do this week?

This week, I mainly focused on writing benchmarks and investigating potential slow algorithms.

Wrote more benchmarks for inferential stats: my PR

KS test
MannWhitneyU
RankSums
BrunnerMunzel
chisqure
friedmanchisquare
epps_singleton_2samp
kruskal

Modified to use new random API `rng = np.random.default_rng(12345678)`my PR
Documented why some functions can’t be speedup via Pythran: my doc
Found more potential algorithms that can be speedup via Pythran

What is coming up next?

Improve two of the following functions:

stats.friedmanchisquare: related to rankdata

    Line #      Hits         Time  Per Hit   % Time  Line Contents
    ==============================================================
    7970       501        351.0      0.7      0.5      for i in range(len(data)):
    7971       500      51417.0    102.8     75.8          data[i] = rankdata(data[i])

stats.binned_statistic_dd
sparse.linalg.onenormest
_fragment_2_1 in scipy/sparse/linalg/matfuncs.py

Did you get stuck anywhere?

When benchmarking, I found Mannwhitney is pretty slow. After profiling, it shows `p = _mwu_state.sf(U.astype(int), n1, n2)` occupys 100% time. Look into the function, `pmf` is the slowest part. @mdhaber mentioned that he would be interested in looking into these things himself later this summer.

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    25                                               @profile
    26                                               def pmf(self, k, m, n):
    27                                                   '''Probability mass function'''
    28         1      29486.0  29486.0      0.2          self._resize_fmnks(m, n, np.max(k))
    29                                                   # could loop over just the unique elements, but probably not worth
    30                                                   # the time to find them
    31      1384       1701.0      1.2      0.0          for i in np.ravel(k):
    32      1383   18401083.0  13305.2     99.8              self._f(m, n, i)
    33         1         71.0     71.0      0.0          return self._fmnks[m, n, k] / special.binom(m + n, m)

Week #0: Community Building and Getting Started

xingyuliu@g.harvard.edu (Xingyu-Liu) — Tue, 08 Jun 2021 15:11:14 +0000

Introduction

Hi everyone! I’m Xingyu Liu, a first-year data science master student at Harvard University. I’m very excited to be accepted by SciPy and I will work on using Pythran to improve algorithms’ performance in SciPy! There are currently many algorithms that would be too slow as pure Python, and Pythran can be a good tool to accelerate them. My goal is to investigate and improve the slow algorithms, as well as write benchmarks for them.

What did you do this week?

In the community bonding period, I met with my mentors, Ralf Gommers and Serge Guelton. They are very kind, responsive and helpful. We discussed about my project and set up a chat and weekly sync. In the last week, I've started doing my project:

Issues：

Pull Requests：

Readings：

What is coming up next?

Write benchmarks for inferential stats
Modify to use new random API `rng = np.random.default_rng(12345678)`(according to comments in BENCH: add benchmark for f_oneway )
Finding more potential algorithms that can be speedup via Pythran
Document why some functions can’t be speedup via Pythran

Did you get stuck anywhere?

For my first pull request, we found the Pythran version is not better than the orginal due to the indexing operations.