Articles on Xingyu-Liu's Bloghttps://blogs.python-gsoc.orgUpdates on different articles published on Xingyu-Liu's BlogenTue, 24 Aug 2021 12:16:38 +0000Week #11: Writing Tests and Finished Submissionhttps://blogs.python-gsoc.org/en/xingyu-lius-blog/week-11-writing-tests-and-finished-submission/<h2>What did you do this week?</h2>
<ul>
<li> Finished <a href="https://github.com/serge-sans-paille/pythran/pull/1878#">General implementation of supporting immediate arguments</a>
</li><li> GSoC submission: <a href="http://serge-sans-paille.github.io/pythran-stories/gsoc21-improve-performance-through-the-use-of-pythran.html">blog</a> </li>
<li> Merge some cases <a href="https://github.com/scipy/scipy/pull/14559">WIP: TST: add tests for Pythran somersd </a></li>
</ul>
<h2>What is coming up next?</h2>
<ul>
<li> <a href="https://github.com/scipy/scipy/pull/14625">ENH: optimize min max and median scipy.stats.binned_statistic</a> by another contributor
His performance is even better than our Pythran improvement( <a href="https://github.com/scipy/scipy/pull/14345">ENH:
improved binned_statistic_dd via Pythran</a>)
He did vectorization improvement and reduced two loops into one.
Maybe we can try Pythran again based on his implementation?
</li>
<li>Finish <a href="https://github.com/scipy/scipy/pull/14559">WIP: TST: add tests for Pythran somersd </a> </li>
</ul>xingyuliu@g.harvard.edu (Xingyu-Liu)Tue, 24 Aug 2021 12:16:38 +0000https://blogs.python-gsoc.org/en/xingyu-lius-blog/week-11-writing-tests-and-finished-submission/Week #10: Supporting immediate arguments in Pythranhttps://blogs.python-gsoc.org/en/xingyu-lius-blog/week-10-supporting-immediate-arguments-in-pythran/<h2>What did you do this week?</h2>
<ul>
<li> <a href="https://github.com/serge-sans-paille/pythran/pull/1876#">Support boolean arguments in numpy unique</a></li>
<li> <a href="https://github.com/serge-sans-paille/pythran/pull/1878#">General implementation of supporting immediate arguments</a></li>
<li> <a href="https://github.com/scipy/scipy/pull/14559">WIP: TST: add tests for Pythran somersd</a></li>
</ul>
<h2>What is coming up next?</h2>
<ul>
<li> Finished <a href="https://github.com/serge-sans-paille/pythran/pull/1878#"> General implementation of supporting immediate arguments</a></li>
<li> Investigate more on <a href="https://github.com/scipy/scipy/pull/14559">WIP: TST: add tests for Pythran somersd</a></li>
<li> Prepare for the final evaluation</li>
</ul>
<h2>Did you get stuck anywhere?</h2>
In <a href="https://github.com/serge-sans-paille/pythran/pull/1878#">General implementation of supporting immediate arguments</a>,
I met a <code> AttributeError: 'FunctionDef' object has no attribute 'immediate_arguments'</code>,
the potential solution is hard-code checking if it is FunctionDef object, if so then skip.xingyuliu@g.harvard.edu (Xingyu-Liu)Tue, 17 Aug 2021 03:38:52 +0000https://blogs.python-gsoc.org/en/xingyu-lius-blog/week-10-supporting-immediate-arguments-in-pythran/Week #9: Adding tests for Pythran functions, and review the opened PRshttps://blogs.python-gsoc.org/en/xingyu-lius-blog/week-9-adding-tests-for-pythran-functions-and-review-the-opened-prs/<h2>What did you do this week?</h2>
<ul>
<li><a href="https://github.com/scipy/scipy/pull/14559">WIP: TST: add tests for Pythran somersd </a></li>
<li><a href="https://github.com/serge-sans-paille/pythran/pull/1855">WIP: support keepdims in numpy mean </a></li>
<li>Revisit and summarize the unsuccessful PRs about Pythran in SciPy</li>
<table style="border: 1px solid black;">
<tbody><tr>
<th style="border: 1px solid black;">PR</th>
<th style="border: 1px solid black;">Reason</th>
</tr>
<tr>
<td style="border: 1px solid black;"><a href="https://github.com/scipy/scipy/pull/13957">ENH: Pythran implementation
of _compute_prob_outside_square and _compute_prob_inside_method to speedup stats.ks_2samp</a></td>
<td style="border: 1px solid black;">Failed some tests before but works now</td>
</tr>
<tr>
<td style="border: 1px solid black;"><a href="https://github.com/scipy/scipy/pull/14154">ENH: Pythran implementation of _cdf_distance </a></td>
<td style="border: 1px solid black;">Pythran version is slightly better than the Python one after fixing np.searchsorted.
Could be better after SciPy began to use XSIMD. Hold it for now.</td>
</tr>
<tr>
<td style="border: 1px solid black;"><a href="https://github.com/scipy/scipy/pull/14314">WIP: ENH: improve _count_paths_outside_method via pythran</a></td>
<td style="border: 1px solid black;">Relates to <a href="https://github.com/scipy/scipy/issues/14315">bus error on Mac but works fine on Linux for _count_paths_outside_method pythran version</a></td>
</tr>
<tr>
<td style="border: 1px solid black;"><a href="https://github.com/scipy/scipy/pull/14376">WIP: ENH: improve sort_vertices_of_regions via Pythran and made it more readable </a></td>
<td style="border: 1px solid black;">Test Failures:
1) Test_spherical_voronoi: inplace sort
2) Test_region_types: The specified input regions type is int64 list list. When the element in self.regions is numpy.int64, Pythran will automatically turn it to int type</td>
</tr>
</tbody></table>
</ul>
<h2>What is coming up next?</h2>
<ul>
<li><a href="https://github.com/scipy/scipy/pull/14559">WIP: TST: add tests for Pythran somersd </a>Keep working on this</li>
<li><a href="https://github.com/serge-sans-paille/pythran/pull/1855">WIP: support keepdims in numpy mean </a>: make it more general</li>
<li> Test XSIMD for_cdf_distance </li>
</ul>
<h2>Did you get stuck anywhere?</h2>
Stuck in supporting keepdims in numpy mean in Pythran, thanks to Serge, he helped me
fixed many problems and this will be completed this week.xingyuliu@g.harvard.edu (Xingyu-Liu)Mon, 09 Aug 2021 15:07:06 +0000https://blogs.python-gsoc.org/en/xingyu-lius-blog/week-9-adding-tests-for-pythran-functions-and-review-the-opened-prs/Week #8: Support keepdims in numpy mean, hunt potential algorithms to be improvedhttps://blogs.python-gsoc.org/en/xingyu-lius-blog/week-8-support-keepdims-in-numpy-mean-hunt-potential-algorithms-to-be-improved/<h2>What did you do this week?</h2>
<ul>
<li> <a href="https://github.com/scipy/scipy/pull/14430">ENH: improve siegelslopes via pythran</a>Clean code, all checks passed.</li>
<li> <a href="https://github.com/scipy/scipy/pull/14429">ENH: improve cspline1d, qspline1d, and relative funcs via Pythran</a>
Only improve the private funcs, has passed all the checks. However, find a potential problem:
<a href="https://github.com/serge-sans-paille/pythran/issues/1858">array assignment res[cond1] = ax[cond1] works fine for int[] or float[] or float[:,:] but not int[:,:] </a>
</li>
<li><a href="https://github.com/serge-sans-paille/pythran/pull/1855">WIP: support keepdims in numpy mean</a>
It passed all the checks after I changed to use str(node.value).lower(). I added tests for keepdims=False but there are some check failures.
</li>
<li><a href="https://github.com/charlotte12l/scipy/pull/2">ENH: improve _cplxreal, _falling_factorial, _bessel_poly, _arc_jac_sn… </a>
This enhancement is little and seems so meaningless that I opened the PR only in my own repo: they are already fast algorithms.
Now I got stuck in finding potential algorithms to improve: often spending ~10 hrs to find algorithms, ~2hr to improve them.
</li>
</ul>
<h2>What is coming up next?</h2>
Since it is not easy to find good algorithms anymore and we've already improved some, it is time to change the plan.
Therefore, I will work on:
<ul>
<li>Use Pytest and Decorator to support different dype input testing for Pythran imporved functions.</li>
<li>Revisit the algorithms we worked, get a final conclusion maybe.</li>
<li> Finish supporting keepdims in numpy mean in Pythran</li>
</ul>
<h2>Did you get stuck anywhere?</h2>
Stuck in supporting keepdims in numpy mean in Pythran and finding potential algorithms.xingyuliu@g.harvard.edu (Xingyu-Liu)Tue, 03 Aug 2021 17:07:31 +0000https://blogs.python-gsoc.org/en/xingyu-lius-blog/week-8-support-keepdims-in-numpy-mean-hunt-potential-algorithms-to-be-improved/Week #7: Support keepdims in Pythran's numpy meanhttps://blogs.python-gsoc.org/en/xingyu-lius-blog/week-7-support-keepdims-in-pythran-s-numpy-mean/<h2>What did you do this week?</h2>
<ul>
<li>[Merged] <a href="https://github.com/scipy/scipy/pull/14458">Review the PR DOC: clarify meaning of rvalue in stats.linregress </a></li>
<li> <a href="https://github.com/scipy/scipy/pull/14473">Document ENH: improve _sosfilt_float via Pythran </a></li>
<li>Leave the validation in the Python func: <a href="https://github.com/scipy/scipy/pull/14430">ENH: improve siegelslopes via pythran</a></li>
<li> <a href="https://github.com/scipy/scipy/pull/14429">ENH: improve cspline1d, qspline1d, and relative funcs via Pythran</a></li>
<ul>
<li> In this case, I left cspline1d, qspline1d, cspline1d_eval, qspline1d_eval public function and doc in Python</li>
<li> How about 'cubic' and 'quadratic'? They also seem to be a public function.</li>
<li> Need to check if we need to support more types even if passes checks</li>
</ul>
<li> <a href="https://github.com/serge-sans-paille/pythran/pull/1855">WIP: support keepdims in numpy mean</a></li>
</ul>
<h2>What is coming up next?</h2>
<ul>
<li> Keep working on 3./4./5. mentioned above. Merge them hopefully</li>
<li> Find more potential algorithms and improve them </li>
<li> Completed <a href="https://github.com/scipy/scipy/pull/14228#pullrequestreview-682448181">BENCH: add more benchmarks for inferential statistics tests</a> </li>
</ul>
<h2>Did you get stuck anywhere?</h2>
While supporting keepdims in numpy mean, I added a function <code>mean(E const &expr, types::none_type axis, dtype d, std::true_type keepdims)</code> , but I'm not sure how can I declare the
return for this function . I think we need to calculated the <code>out_shape</code> so we can <code>-> decltype(numpy::functor::asarray{}(sum(expr) / typename dtype::type(expr.flat_size())).reshape(out_shape)) </code>xingyuliu@g.harvard.edu (Xingyu-Liu)Mon, 26 Jul 2021 18:02:28 +0000https://blogs.python-gsoc.org/en/xingyu-lius-blog/week-7-support-keepdims-in-pythran-s-numpy-mean/Week #6: Improving siegelslopes, cspline1d, qspline1d, etc.https://blogs.python-gsoc.org/en/xingyu-lius-blog/week-6-improving-siegelslopes-cspline1d-qspline1d-etc/<h2>What did you do this week?</h2>
<ol>
<li>Look at the issue <a href="https://github.com/scipy/scipy/issues/14416"> Is the r-value outputted by scipy.stats.linregress always the Pearson correlation coefficient? </a></li>
<li> <a href="https://github.com/scipy/scipy/pull/14376"> WIP: ENH: improve sort_vertices_of_regions via Pythran and made it more readable </a>
<ul>
<li>Tyler said <code>test_spherical_voronoi</code> may test inplace sort, and it is not recommended to remove a test. In this way, we’ll never pass the test.</li>
<li>For the type error, I can’t reproduce it on my computer.
Is it similar to the issue <a href="https://github.com/scipy/scipy/issues/14420">BUG: RBFInterpolator fails when calling it with a slice of a (1, n) array</a>? I encountered similar `reshaped` issues before, and found that often the type is the problem while `reshaped` is not.
Once I support that type, I’ll not get the error. But in the case there they do support that type.</li>
<pre><code>
TypeError: Invalid call to pythranized function `sort_vertices_of_regions(int32[:, :], int32 list list)'
Candidates are:
- sort_vertices_of_regions(int64[:,:], int64 list list)
- sort_vertices_of_regions(int32[:,:], int32 list list)
- sort_vertices_of_regions(int32[:,:], int64 list list)
- sort_vertices_of_regions(int[:,:], int list list)
</code></pre>
</ul></li>
<li>Last week we concluded <code>_spectral.pyx </code> and <code>_sosfilt.pyx</code> are easy to be improved via Pythran, but later I found that <code>_spectral.pyx </code>already has a version in Pythran. For<code>_sosfilt.pyx</code>,
I improved <code>_sosfilt_float</code> and leave <code>_sosfilt_object</code> in Cython. The performance for <code>_sosfilt_float</code> looks similar comparing Cython and Pythran.
So I'm not sure whether I need to make a PR for it </li>
<li> <a href="https://github.com/scipy/scipy/pull/14430"> ENH: improve siegelslopes via pythran </a>, 10x faster. If needed, I can also improve <code>scipy/stats/_stats_mstats_common.py</code> ’s
<code>linregress, theilslopes</code> and put them with <code>siegelslopes </code> in the same file. But other two functions do not have obvious loops so here I only improve siegelslopes.</li>
<li> <a href="https://github.com/scipy/scipy/pull/14429"> ENH: improve cspline1d, qspline1d, and relative funcs via Pythran </a>,10x faster.
<ul>
<li>Segment fault on <a href="https://github.com/scipy/scipy/pull/14429/checks ">Azure pipelines</a>. Because of calling itself in the function? </li>
<li>A lot of signatures. Any more concise way?</li>
<li>Actually, for those functions which have lots of signatures and also cause current segment faults - <code>cspline1d_eval </code> and <code>qspline1d_eval </code>, they don’t have many loops. I improved them because they are used to evaluate <code>cspline1d </code> and <code>qspline1d </code> , putting them in one file may look better.
We can also leave them in the original file so that we won’t get above a.& b. problems </li>
</ul>
</li>
</ol>
<h2>What is coming up next?</h2>
<ol>
<li>Keep working on <a href="https://github.com/scipy/scipy/pull/14429"> ENH: improve cspline1d, qspline1d, and relative funcs via Pythran </a> </li>
<li> Find more potential algorithms and improve them </li>
<li>Make a PR for <code>_sosfilt_float</code> and comment on it</li>
<li> <code>keepdims</code>feature support in Pythran </li>
</ol>
<h2>Did you get stuck anywhere?</h2>
I once said that <code>np.expand_dims()</code> does not support dim as keyword, I was wrong because the key is axis, but I still got the following error. However,
<code>np.expand_dims(x, 1) </code> will work.
<pre><code>
(scipy-dev) charlotte@CHARLOTLIU-MB0 stats % pythran siegelslopes_pythran.py
CRITICAL: I am in trouble. Your input file does not seem to match Pythran's constraints...
siegelslopes_pythran.py:19:13 error: function uses an unknown (or unsupported) keyword argument `axis`
----
deltax = np.expand_dims(x, axis=1) - x
^~~~ (o_0)
----
</code></pre>xingyuliu@g.harvard.edu (Xingyu-Liu)Thu, 22 Jul 2021 02:47:09 +0000https://blogs.python-gsoc.org/en/xingyu-lius-blog/week-6-improving-siegelslopes-cspline1d-qspline1d-etc/Week #5: Improving sort_vertices_of_regions, and write some testshttps://blogs.python-gsoc.org/en/xingyu-lius-blog/week-5-improving-sort-vertices-of-regions-and-write-some-tests/<h2>What did you do this week?</h2>
<ul>
<li> Added unit test for <a href="https://github.com/scipy/scipy/pull/14338">BUG: fix stats.binned_statistic_dd issue with values close to bin edge</a></li>
<li> Added benchmarks for somersd <a href="https://github.com/scipy/scipy/pull/14381">BENCH: add benchmark for somersd</a> </li>
<li> Added tests in Pythran:<a href="https://github.com/serge-sans-paille/pythran/pull/1830"> Import test cases from scipy</a> </li>
<li> Wrote the first evaluations, will submit it later.</li>
<li><a href="https://github.com/scipy/scipy/pull/14376">WIP: ENH: improve sort_vertices_of_regions via Pythran and made it more readable </a>
However, I got some weird type error, and failed test_spherical_voronoi and test_region_types.
Tyler suggested it may be not a good case for Pythran, and I'm still trying to find out why there are such errors.
</li>
</ul>
<h2>What is coming up next?</h2>
<ul>
<li>Submit the first evaluations</li>
<li>Continue working on sort_vertices_of_regions(), try to fix the failures</li>
<li>Look into and maybe improve some of the following algorithms: _spectral.pyx and _sosfilt.pyx </li>
</ul>
<h2>Did you get stuck anywhere?</h2>
The WIP PR mentioned above: <a href="https://github.com/scipy/scipy/pull/14376">WIP: ENH: improve sort_vertices_of_regions via Pythran and made it more readable </a>.
It fails two tests: test_spherical_voronoi and test_region_types.xingyuliu@g.harvard.edu (Xingyu-Liu)Tue, 13 Jul 2021 15:40:13 +0000https://blogs.python-gsoc.org/en/xingyu-lius-blog/week-5-improving-sort-vertices-of-regions-and-write-some-tests/Week #4: Improving binned_statistic_dd and _voronoi, and fix some issueshttps://blogs.python-gsoc.org/en/xingyu-lius-blog/week-4-improving-binned-statistic-dd-and-voronoi-and-fix-some-issues/<h2>What did you do this week?</h2>
<p> First came to the old problem, <code>bus error</code>. It turns out that it is specific to Mac.
We still don't know the cause of the problem yet.( <a href="https://github.com/scipy/scipy/issues/14315">
bus error on Mac but works fine on Linux for _count_paths_outside_method pythran version</a>)</p>
<p> Last week I said that the benchmark result is different from my <code>timeit</code> result. It is actually
my mistake: I forgot to modify <code>setup.py</code>. After setting up correctly, the problem was fixed.</p>
<p> Also, for the algorithm <code>binned_statistic_dd</code> I was improving since last week, I have made a PR for
it. At first, I improved the whole <code>if-elif</code> block and the benchmark shows it can make
<code>count, sum,mean</code> 1.1x times faster, and make <code>std, median, min, max</code> 3x-30x faster . However, I found
that Pythran can't support <code>object</code> type input so I failed some tests.To support <code>object</code> type,
we need to keep the whole pure Python codes, and it will make the <code>if-elif</code> block duplicate and ugly.
Since from the benchmark, there is not much improvement for <code>count, sum,mean</code>,
I also tried to only improve <code>std, median, min, max</code> to make it look better and understandable
So in the end, I only improved an small inner function but still get <code>std, median, min, max</code> 3x-30x faster, with
no changes for <code>count, sum,mean</code>.(<a href="https://github.com/scipy/scipy/pull/14345">ENH: improved binned_statistic_dd via Pythran</a>)</p>
<p> When I was improving <code>binned_statistic_dd</code>, there happened to be an open issue about float point comparision.
I looked into that and fixed it.(<a href="https://github.com/scipy/scipy/pull/14338">
BUG: fix stats.binned_statistic_dd issue with values close to bin edge </a>)
</p><p> Last but not least, I tried to speedup <code>_voronoi </code> discussed last week, and the Pythran version is 3x faster
than the Cython one!</p>
<h2>What is coming up next?</h2>
<ul>
<li> Refer to the original Python version rather than the CPython one, make the Pythran version <code>_voronoi </code> more readable. After that, make a PR.</li>
<li> Add test for the <code>binned_statistic_dd</code> bug</li>
<li> Add benchmarks for somersd and _tau_b </li>
<li> Prepare for the first evaluations </li>
<li> In Pythran, import some scipy tests </li>
</ul>
<h2>Did you get stuck anywhere?</h2>
The <code> bus error</code> mentioned above, and <code>build_docs</code> failed on my PR recently.xingyuliu@g.harvard.edu (Xingyu-Liu)Tue, 06 Jul 2021 13:19:10 +0000https://blogs.python-gsoc.org/en/xingyu-lius-blog/week-4-improving-binned-statistic-dd-and-voronoi-and-fix-some-issues/Week #3: Improving stats.binned_statistic_dd, somersd and _tau_bhttps://blogs.python-gsoc.org/en/xingyu-lius-blog/week-3-improving-stats-binned-statistic-dd-somersd-and-tau-b/<h2>What did you do this week?</h2>
As this project progressed, I came to realize that the diffculty of this project is to find
good potential algorithms. This week I was searching potential algorithms in <code>scipy.stats</code>.
At first, like the past, I used <code>pytest --durations=50</code> to find the slowest 50 tests.
I looked into all the functions but didn't find any suitable algorithms: some do not have obvious loops;
some have loops but call another scipy method in the loop... Therefore, I began checking all the functions
under <code> scipy/stats </code> one by one and finally found <code>somersd</code> and <code>_tau_b</code>
are good candidates and I submitted a PR( <a href="https://github.com/scipy/scipy/pull/14308">gh-14308</a>) to speedup them 4x~20x.
<p>Besides, For the works that mentioned last week:</p>
<ol>
<li><code>stats._moment</code>: submitted an issue(<a href="https://github.com/serge-sans-paille/pythran/issues/1820">pythran #1820</a>)
for keep_dims is not supported in np.mean()</li>
<li><code>stats._calc_binned_statistic</code>: successfully improved this function and made the public function
<code>stats.binned_statistic_dd</code> 3x-10x faster on <code>min,max,std,median</code>.
I tried to improved the whole <code>if-elif</code> block but encountered some errors that I can't fix
(see <a href="https://github.com/serge-sans-paille/pythran/issues/1819#issuecomment-869102923)">pythran #1819 </a>)</li>
<li><code>stats._sum_abs_axis0</code>: Thanks to Serge, the compliation errror due to variant type is fixed.
I compiled and it is ~2x faster on <code>_sum_abs_axis0 </code> but do not have much gain on the public function
<code>onenormest</code>. Moreover, actually there is no loop in <code>_sum_abs_axis0 </code> for input size smaller than 2**20(my bad!) </li>
<li><code>sparse.linalg.expm(_fragment_2_1)</code>: Last week I said this one is slower than the pure python version but
it is not. My bad, actually with my input at that time, it will not get into<code> _fragment_2_1</code>
so I actually didn't test it. Only when the input size is larger than 9000 it will get into the function.
Moreover, the input is <code>csc matrix </code> so it is not a suitable one for pythran. </li>
<li>the SciPy build error(see <a href="https://github.com/serge-sans-paille/pythran/issues/1815">pythran #1815</a>) mentioned last week: It is a really struggling problem.
Serge and Ralf tried to help me fix that but it is still not working for now. </li>
</ol>
<h2>What is coming up next?</h2>
<ol>
<li> add benchmarks for <code>somersd</code> and <code>_tau_b</code></li>
<li> consider merging the old benchmark PR?(<a href="https://github.com/scipy/scipy/pull/14228">gh-14228</a>)</li>
<li> keep searching good potential algorithms to be improved. </li>
</ol>
<h2>Did you get stuck anywhere?</h2>
The problem I encountered when improving the whole <code>if-elif</code> block in <code>stats.binned_statistic_dd</code>
(see <a href="https://github.com/serge-sans-paille/pythran/issues/1819#issuecomment-869102923)">pythran #1819 </a>)xingyuliu@g.harvard.edu (Xingyu-Liu)Sun, 27 Jun 2021 15:30:01 +0000https://blogs.python-gsoc.org/en/xingyu-lius-blog/week-3-improving-stats-binned-statistic-dd-somersd-and-tau-b/Week #2: Improving stats.ks_2samphttps://blogs.python-gsoc.org/en/xingyu-lius-blog/week-2-improving-stats-ks-2samp/<h2>What did you do this week?</h2>
This week was quite struggling. My mentor Serge implemented supporting for `scipy.special.binom` quickly in Pythran and it shows a great improvement on the public function stats.ks_2samp( 2.62ms vs 88.2ms).
However, when I was building scipy with the improved algroithm, we found that it would cause a loop problem. Serge made a
<a href="https://github.com/serge-sans-paille/pythran/pull/1810">PR</a> to break the loop but in my computer it is still not working.
<p>Then I turned to try other algorithms that I mentioned last week, however I encountered more problems:</p>
<ol>
<li>stats._moment: keep_dims is not supported in np.mean()</li>
<li>stats._calc_binned_statistic: invalid pythran spec but I don't find anything wrong</li>
<li>stats.rankdata: invalid pythran spec</li>
<li>stats._sum_abs_axis0: compliation error</li>
<li>sparse.linalg.expm(_fragment_2_1): much slower than the pure python one, will keep investigating it.</li>
</ol>
<h2>What is coming up next?</h2>
<ul>
<li> submit issue for <code>keep_dims </code>
</li><li> submit issue for error in <code>stats._sum_abs_axis0</code>
</li><li> continue improving <code>stats._calc_binned_statistic</code>
</li><li> Find out why <code>sparse.linalg.expm</code> pythran version is slower.
</li></ul>
<h2>Did you get stuck anywhere?</h2>
Got stuck in many problems, as is written in <code>What did you do this week</code> section.xingyuliu@g.harvard.edu (Xingyu-Liu)Mon, 21 Jun 2021 08:48:45 +0000https://blogs.python-gsoc.org/en/xingyu-lius-blog/week-2-improving-stats-ks-2samp/Week #1: Writing Benchmarkshttps://blogs.python-gsoc.org/en/xingyu-lius-blog/week-1-writing-benchmarks/<h2>What did you do this week?</h2>
This week, I mainly focused on writing benchmarks and investigating potential slow algorithms.
<ol>
<li>Wrote more benchmarks for inferential stats: <a href="https://github.com/scipy/scipy/pull/14228">my PR </a></li>
<ul>
<li>KS test</li>
<li>MannWhitneyU</li>
<li>RankSums</li>
<li>BrunnerMunzel</li>
<li>chisqure</li>
<li>friedmanchisquare</li>
<li>epps_singleton_2samp</li>
<li>kruskal</li>
</ul>
<li> Modified to use new random API `rng = np.random.default_rng(12345678)`<a href="https://github.com/scipy/scipy/pull/14224">my PR </a></li>
<li> Documented why some functions can’t be speedup via Pythran: <a href="https://docs.google.com/document/d/1oByCzyTn9CDbNXBlE3V6Rv4yLz2Ltx5HpMg6YsooAO4/edit?usp=sharing">my doc </a></li>
<li> Found more potential algorithms that can be speedup via Pythran</li>
</ol>
<h2>What is coming up next?</h2>
Improve two of the following functions:
<ul>
<li>stats.friedmanchisquare: related to rankdata</li>
<pre> Line # Hits Time Per Hit % Time Line Contents
==============================================================
7970 501 351.0 0.7 0.5 for i in range(len(data)):
7971 500 51417.0 102.8 75.8 data[i] = rankdata(data[i])
</pre>
<li>stats.binned_statistic_dd</li>
<li>sparse.linalg.onenormest</li>
<li>_fragment_2_1 in scipy/sparse/linalg/matfuncs.py</li>
</ul>
<h2>Did you get stuck anywhere?</h2>
When benchmarking, I found Mannwhitney is pretty slow. After profiling, it shows `p = _mwu_state.sf(U.astype(int), n1, n2)` occupys 100% time. Look into the function, `pmf` is the slowest part. @mdhaber mentioned that he would be interested in looking into these things himself later this summer.
<pre>Line # Hits Time Per Hit % Time Line Contents
==============================================================
25 @profile
26 def pmf(self, k, m, n):
27 '''Probability mass function'''
28 1 29486.0 29486.0 0.2 self._resize_fmnks(m, n, np.max(k))
29 # could loop over just the unique elements, but probably not worth
30 # the time to find them
31 1384 1701.0 1.2 0.0 for i in np.ravel(k):
32 1383 18401083.0 13305.2 99.8 self._f(m, n, i)
33 1 71.0 71.0 0.0 return self._fmnks[m, n, k] / special.binom(m + n, m)
</pre>xingyuliu@g.harvard.edu (Xingyu-Liu)Mon, 14 Jun 2021 08:43:28 +0000https://blogs.python-gsoc.org/en/xingyu-lius-blog/week-1-writing-benchmarks/Week #0: Community Building and Getting Startedhttps://blogs.python-gsoc.org/en/xingyu-lius-blog/week-0-community-building-and-getting-started/<h2>Introduction</h2>
Hi everyone! I’m Xingyu Liu, a first-year data science master student at Harvard University. I’m very excited to be accepted by SciPy and I will work on using Pythran to improve algorithms’ performance in SciPy! There are currently many algorithms that would be too slow as pure Python, and Pythran can be a good tool to accelerate them. My goal is to investigate and improve the slow algorithms, as well as write benchmarks for them.
<h2>What did you do this week?</h2>
In the community bonding period, I met with my mentors, Ralf Gommers and Serge Guelton. They are very kind, responsive and helpful. We discussed about my project and set up a chat and weekly sync. In the last week, I've started doing my project:
<h4>Issues：</h4>
<ol>
<li><a href="https://github.com/serge-sans-paille/pythran/issues/1793">Pythran makes np.searchsorted much slower</a></li>
<li><a href="https://github.com/serge-sans-paille/pythran/issues/1792">u_values[u_sorter].searchsort would cause "Function path is chained attributes and name" but np.search would not</a></li>
<li> <a href="https://github.com/serge-sans-paille/pythran/issues/1791">all_values.sort() would cause compilation error but np.sort(all_values) would not</a> </li>
</ol>
<h4> Pull Requests：</h4>
<ol>
<li> <a href="https://github.com/scipy/scipy/pull/14154">ENH: Pythran implementation of _cdf_distance</a></li>
<li> <a href="https://github.com/scipy/scipy/pull/14163">BENCH: add benchmark for energy_distance and wasserstein_distance</a></li>
</ol>
<h4> Readings：</h4>
<ol>
<li><a href="https://pythran.readthedocs.io/en/latest/MANUAL.html">Pythran tutorial</a>. </li>
<li><a href="https://cython.readthedocs.io/en/latest/src/tutorial/profiling_tutorial.html">Profiling Cython code </a>
</li>
</ol>
<h2>What is coming up next?</h2>
<ol>
<li>Write benchmarks for inferential stats</li>
<li>Modify to use new random API `rng = np.random.default_rng(12345678)`(according to comments in <a href="https://github.com/scipy/scipy/pull/14018">BENCH: add benchmark for f_oneway </a>)</li>
<li> Finding more potential algorithms that can be speedup via Pythran</li>
<li> Document why some functions can’t be speedup via Pythran</li>
</ol>
<h2>Did you get stuck anywhere?</h2>
For my first pull request, we found the Pythran version is not better than the orginal due to the indexing operations.xingyuliu@g.harvard.edu (Xingyu-Liu)Tue, 08 Jun 2021 15:11:14 +0000https://blogs.python-gsoc.org/en/xingyu-lius-blog/week-0-community-building-and-getting-started/