Articles on Xingyu-Liu's Bloghttps://blogs.python-gsoc.orgUpdates on different articles published on Xingyu-Liu's BlogenTue, 24 Aug 2021 12:16:38 +0000Week #11: Writing Tests and Finished Submissionhttps://blogs.python-gsoc.org/en/xingyu-lius-blog/week-11-writing-tests-and-finished-submission/<h2>What did you do this week?</h2> <ul> <li> Finished <a href="https://github.com/serge-sans-paille/pythran/pull/1878#">General implementation of supporting immediate arguments</a> </li><li> GSoC submission: <a href="http://serge-sans-paille.github.io/pythran-stories/gsoc21-improve-performance-through-the-use-of-pythran.html">blog</a> </li> <li> Merge some cases <a href="https://github.com/scipy/scipy/pull/14559">WIP: TST: add tests for Pythran somersd </a></li> </ul> <h2>What is coming up next?</h2> <ul> <li> <a href="https://github.com/scipy/scipy/pull/14625">ENH: optimize min max and median scipy.stats.binned_statistic</a> by another contributor His performance is even better than our Pythran improvement( <a href="https://github.com/scipy/scipy/pull/14345">ENH: improved binned_statistic_dd via Pythran</a>) He did vectorization improvement and reduced two loops into one. Maybe we can try Pythran again based on his implementation? </li> <li>Finish <a href="https://github.com/scipy/scipy/pull/14559">WIP: TST: add tests for Pythran somersd </a> </li> </ul>xingyuliu@g.harvard.edu (Xingyu-Liu)Tue, 24 Aug 2021 12:16:38 +0000https://blogs.python-gsoc.org/en/xingyu-lius-blog/week-11-writing-tests-and-finished-submission/Week #10: Supporting immediate arguments in Pythranhttps://blogs.python-gsoc.org/en/xingyu-lius-blog/week-10-supporting-immediate-arguments-in-pythran/<h2>What did you do this week?</h2> <ul> <li> <a href="https://github.com/serge-sans-paille/pythran/pull/1876#">Support boolean arguments in numpy unique</a></li> <li> <a href="https://github.com/serge-sans-paille/pythran/pull/1878#">General implementation of supporting immediate arguments</a></li> <li> <a href="https://github.com/scipy/scipy/pull/14559">WIP: TST: add tests for Pythran somersd</a></li> </ul> <h2>What is coming up next?</h2> <ul> <li> Finished <a href="https://github.com/serge-sans-paille/pythran/pull/1878#"> General implementation of supporting immediate arguments</a></li> <li> Investigate more on <a href="https://github.com/scipy/scipy/pull/14559">WIP: TST: add tests for Pythran somersd</a></li> <li> Prepare for the final evaluation</li> </ul> <h2>Did you get stuck anywhere?</h2> In <a href="https://github.com/serge-sans-paille/pythran/pull/1878#">General implementation of supporting immediate arguments</a>, I met a <code> AttributeError: 'FunctionDef' object has no attribute 'immediate_arguments'</code>, the potential solution is hard-code checking if it is FunctionDef object, if so then skip.xingyuliu@g.harvard.edu (Xingyu-Liu)Tue, 17 Aug 2021 03:38:52 +0000https://blogs.python-gsoc.org/en/xingyu-lius-blog/week-10-supporting-immediate-arguments-in-pythran/Week #9: Adding tests for Pythran functions, and review the opened PRshttps://blogs.python-gsoc.org/en/xingyu-lius-blog/week-9-adding-tests-for-pythran-functions-and-review-the-opened-prs/<h2>What did you do this week?</h2> <ul> <li><a href="https://github.com/scipy/scipy/pull/14559">WIP: TST: add tests for Pythran somersd </a></li> <li><a href="https://github.com/serge-sans-paille/pythran/pull/1855">WIP: support keepdims in numpy mean </a></li> <li>Revisit and summarize the unsuccessful PRs about Pythran in SciPy</li> <table style="border: 1px solid black;"> <tbody><tr> <th style="border: 1px solid black;">PR</th> <th style="border: 1px solid black;">Reason</th> </tr> <tr> <td style="border: 1px solid black;"><a href="https://github.com/scipy/scipy/pull/13957">ENH: Pythran implementation of _compute_prob_outside_square and _compute_prob_inside_method to speedup stats.ks_2samp</a></td> <td style="border: 1px solid black;">Failed some tests before but works now</td> </tr> <tr> <td style="border: 1px solid black;"><a href="https://github.com/scipy/scipy/pull/14154">ENH: Pythran implementation of _cdf_distance </a></td> <td style="border: 1px solid black;">Pythran version is slightly better than the Python one after fixing np.searchsorted. Could be better after SciPy began to use XSIMD. Hold it for now.</td> </tr> <tr> <td style="border: 1px solid black;"><a href="https://github.com/scipy/scipy/pull/14314">WIP: ENH: improve _count_paths_outside_method via pythran</a></td> <td style="border: 1px solid black;">Relates to <a href="https://github.com/scipy/scipy/issues/14315">bus error on Mac but works fine on Linux for _count_paths_outside_method pythran version</a></td> </tr> <tr> <td style="border: 1px solid black;"><a href="https://github.com/scipy/scipy/pull/14376">WIP: ENH: improve sort_vertices_of_regions via Pythran and made it more readable </a></td> <td style="border: 1px solid black;">Test Failures: 1) Test_spherical_voronoi: inplace sort 2) Test_region_types: The specified input regions type is int64 list list. When the element in self.regions is numpy.int64, Pythran will automatically turn it to int type</td> </tr> </tbody></table> </ul> <h2>What is coming up next?</h2> <ul> <li><a href="https://github.com/scipy/scipy/pull/14559">WIP: TST: add tests for Pythran somersd </a>Keep working on this</li> <li><a href="https://github.com/serge-sans-paille/pythran/pull/1855">WIP: support keepdims in numpy mean </a>: make it more general</li> <li> Test XSIMD for_cdf_distance </li> </ul> <h2>Did you get stuck anywhere?</h2> Stuck in supporting keepdims in numpy mean in Pythran, thanks to Serge, he helped me fixed many problems and this will be completed this week.xingyuliu@g.harvard.edu (Xingyu-Liu)Mon, 09 Aug 2021 15:07:06 +0000https://blogs.python-gsoc.org/en/xingyu-lius-blog/week-9-adding-tests-for-pythran-functions-and-review-the-opened-prs/Week #8: Support keepdims in numpy mean, hunt potential algorithms to be improvedhttps://blogs.python-gsoc.org/en/xingyu-lius-blog/week-8-support-keepdims-in-numpy-mean-hunt-potential-algorithms-to-be-improved/<h2>What did you do this week?</h2> <ul> <li> <a href="https://github.com/scipy/scipy/pull/14430">ENH: improve siegelslopes via pythran</a>Clean code, all checks passed.</li> <li> <a href="https://github.com/scipy/scipy/pull/14429">ENH: improve cspline1d, qspline1d, and relative funcs via Pythran</a> Only improve the private funcs, has passed all the checks. However, find a potential problem: <a href="https://github.com/serge-sans-paille/pythran/issues/1858">array assignment res[cond1] = ax[cond1] works fine for int[] or float[] or float[:,:] but not int[:,:] </a> </li> <li><a href="https://github.com/serge-sans-paille/pythran/pull/1855">WIP: support keepdims in numpy mean</a> It passed all the checks after I changed to use str(node.value).lower(). I added tests for keepdims=False but there are some check failures. </li> <li><a href="https://github.com/charlotte12l/scipy/pull/2">ENH: improve _cplxreal, _falling_factorial, _bessel_poly, _arc_jac_sn… </a> This enhancement is little and seems so meaningless that I opened the PR only in my own repo: they are already fast algorithms. Now I got stuck in finding potential algorithms to improve: often spending ~10 hrs to find algorithms, ~2hr to improve them. </li> </ul> <h2>What is coming up next?</h2> Since it is not easy to find good algorithms anymore and we've already improved some, it is time to change the plan. Therefore, I will work on: <ul> <li>Use Pytest and Decorator to support different dype input testing for Pythran imporved functions.</li> <li>Revisit the algorithms we worked, get a final conclusion maybe.</li> <li> Finish supporting keepdims in numpy mean in Pythran</li> </ul> <h2>Did you get stuck anywhere?</h2> Stuck in supporting keepdims in numpy mean in Pythran and finding potential algorithms.xingyuliu@g.harvard.edu (Xingyu-Liu)Tue, 03 Aug 2021 17:07:31 +0000https://blogs.python-gsoc.org/en/xingyu-lius-blog/week-8-support-keepdims-in-numpy-mean-hunt-potential-algorithms-to-be-improved/Week #7: Support keepdims in Pythran's numpy meanhttps://blogs.python-gsoc.org/en/xingyu-lius-blog/week-7-support-keepdims-in-pythran-s-numpy-mean/<h2>What did you do this week?</h2> <ul> <li>[Merged] <a href="https://github.com/scipy/scipy/pull/14458">Review the PR DOC: clarify meaning of rvalue in stats.linregress </a></li> <li> <a href="https://github.com/scipy/scipy/pull/14473">Document ENH: improve _sosfilt_float via Pythran </a></li> <li>Leave the validation in the Python func: <a href="https://github.com/scipy/scipy/pull/14430">ENH: improve siegelslopes via pythran</a></li> <li> <a href="https://github.com/scipy/scipy/pull/14429">ENH: improve cspline1d, qspline1d, and relative funcs via Pythran</a></li> <ul> <li> In this case, I left cspline1d, qspline1d, cspline1d_eval, qspline1d_eval public function and doc in Python</li> <li> How about 'cubic' and 'quadratic'? They also seem to be a public function.</li> <li> Need to check if we need to support more types even if passes checks</li> </ul> <li> <a href="https://github.com/serge-sans-paille/pythran/pull/1855">WIP: support keepdims in numpy mean</a></li> </ul> <h2>What is coming up next?</h2> <ul> <li> Keep working on 3./4./5. mentioned above. Merge them hopefully</li> <li> Find more potential algorithms and improve them </li> <li> Completed <a href="https://github.com/scipy/scipy/pull/14228#pullrequestreview-682448181">BENCH: add more benchmarks for inferential statistics tests</a> </li>​​ </ul> <h2>Did you get stuck anywhere?</h2> While supporting keepdims in numpy mean, I added a function <code>mean(E const &amp;expr, types::none_type axis, dtype d, std::true_type keepdims)</code> , but I'm not sure how can I declare the return for this function . I think we need to calculated the <code>out_shape</code> so we can <code>-&gt; decltype(numpy::functor::asarray{}(sum(expr) / typename dtype::type(expr.flat_size())).reshape(out_shape)) </code>xingyuliu@g.harvard.edu (Xingyu-Liu)Mon, 26 Jul 2021 18:02:28 +0000https://blogs.python-gsoc.org/en/xingyu-lius-blog/week-7-support-keepdims-in-pythran-s-numpy-mean/Week #6: Improving siegelslopes, cspline1d, qspline1d, etc.https://blogs.python-gsoc.org/en/xingyu-lius-blog/week-6-improving-siegelslopes-cspline1d-qspline1d-etc/<h2>What did you do this week?</h2> <ol> <li>Look at the issue <a href="https://github.com/scipy/scipy/issues/14416"> Is the r-value outputted by scipy.stats.linregress always the Pearson correlation coefficient? </a></li> <li> <a href="https://github.com/scipy/scipy/pull/14376"> WIP: ENH: improve sort_vertices_of_regions via Pythran and made it more readable </a> <ul> <li>Tyler said <code>test_spherical_voronoi</code> may test inplace sort, and it is not recommended to remove a test. In this way, we’ll never pass the test.</li> <li>For the type error, I can’t reproduce it on my computer. Is it similar to the issue <a href="https://github.com/scipy/scipy/issues/14420">BUG: RBFInterpolator fails when calling it with a slice of a (1, n) array</a>? I encountered similar `reshaped` issues before, and found that often the type is the problem while `reshaped` is not. Once I support that type, I’ll not get the error. But in the case there they do support that type.</li> <pre><code> TypeError: Invalid call to pythranized function `sort_vertices_of_regions(int32[:, :], int32 list list)' Candidates are: - sort_vertices_of_regions(int64[:,:], int64 list list) - sort_vertices_of_regions(int32[:,:], int32 list list) - sort_vertices_of_regions(int32[:,:], int64 list list) - sort_vertices_of_regions(int[:,:], int list list) </code></pre> </ul></li> <li>Last week we concluded <code>_spectral.pyx </code> and <code>_sosfilt.pyx</code> are easy to be improved via Pythran, but later I found that <code>_spectral.pyx </code>already has a version in Pythran. For<code>_sosfilt.pyx</code>, I improved <code>_sosfilt_float</code> and leave <code>_sosfilt_object</code> in Cython. The performance for <code>_sosfilt_float</code> looks similar comparing Cython and Pythran. So I'm not sure whether I need to make a PR for it </li> <li> <a href="https://github.com/scipy/scipy/pull/14430"> ENH: improve siegelslopes via pythran </a>, 10x faster. If needed, I can also improve <code>scipy/stats/_stats_mstats_common.py</code> ’s <code>linregress, theilslopes</code> and put them with <code>siegelslopes </code> in the same file. But other two functions do not have obvious loops so here I only improve siegelslopes.</li> <li> <a href="https://github.com/scipy/scipy/pull/14429"> ENH: improve cspline1d, qspline1d, and relative funcs via Pythran </a>,10x faster. <ul> <li>Segment fault on <a href="https://github.com/scipy/scipy/pull/14429/checks ">Azure pipelines</a>. Because of calling itself in the function? </li> <li>A lot of signatures. Any more concise way?</li> <li>Actually, for those functions which have lots of signatures and also cause current segment faults - <code>cspline1d_eval </code> and <code>qspline1d_eval </code>, they don’t have many loops. I improved them because they are used to evaluate <code>cspline1d </code> and <code>qspline1d </code> , putting them in one file may look better. We can also leave them in the original file so that we won’t get above a.&amp; b. problems </li> </ul> </li> </ol> <h2>What is coming up next?</h2> <ol> <li>Keep working on <a href="https://github.com/scipy/scipy/pull/14429"> ENH: improve cspline1d, qspline1d, and relative funcs via Pythran </a> </li> <li> Find more potential algorithms and improve them </li> <li>Make a PR for <code>_sosfilt_float</code> and comment on it</li> <li> <code>keepdims</code>feature support in Pythran </li> </ol> <h2>Did you get stuck anywhere?</h2> I once said that <code>np.expand_dims()</code> does not support dim as keyword, I was wrong because the key is axis, but I still got the following error. However, <code>np.expand_dims(x, 1) </code> will work. <pre><code> (scipy-dev) charlotte@CHARLOTLIU-MB0 stats % pythran siegelslopes_pythran.py CRITICAL: I am in trouble. Your input file does not seem to match Pythran's constraints... siegelslopes_pythran.py:19:13 error: function uses an unknown (or unsupported) keyword argument `axis` ---- deltax = np.expand_dims(x, axis=1) - x ^~~~ (o_0) ---- </code></pre>xingyuliu@g.harvard.edu (Xingyu-Liu)Thu, 22 Jul 2021 02:47:09 +0000https://blogs.python-gsoc.org/en/xingyu-lius-blog/week-6-improving-siegelslopes-cspline1d-qspline1d-etc/Week #5: Improving sort_vertices_of_regions, and write some testshttps://blogs.python-gsoc.org/en/xingyu-lius-blog/week-5-improving-sort-vertices-of-regions-and-write-some-tests/<h2>What did you do this week?</h2> <ul> <li> Added unit test for <a href="https://github.com/scipy/scipy/pull/14338">BUG: fix stats.binned_statistic_dd issue with values close to bin edge</a></li> <li> Added benchmarks for somersd <a href="https://github.com/scipy/scipy/pull/14381">BENCH: add benchmark for somersd</a> </li> <li> Added tests in Pythran:<a href="https://github.com/serge-sans-paille/pythran/pull/1830"> Import test cases from scipy</a> </li> <li> Wrote the first evaluations, will submit it later.</li> <li><a href="https://github.com/scipy/scipy/pull/14376">WIP: ENH: improve sort_vertices_of_regions via Pythran and made it more readable </a> However, I got some weird type error, and failed test_spherical_voronoi and test_region_types. Tyler suggested it may be not a good case for Pythran, and I'm still trying to find out why there are such errors. </li> </ul> <h2>What is coming up next?</h2> <ul> <li>Submit the first evaluations</li> <li>Continue working on sort_vertices_of_regions(), try to fix the failures</li> <li>Look into and maybe improve some of the following algorithms: _spectral.pyx and _sosfilt.pyx </li> </ul> <h2>Did you get stuck anywhere?</h2> The WIP PR mentioned above: <a href="https://github.com/scipy/scipy/pull/14376">WIP: ENH: improve sort_vertices_of_regions via Pythran and made it more readable </a>. It fails two tests: test_spherical_voronoi and test_region_types.xingyuliu@g.harvard.edu (Xingyu-Liu)Tue, 13 Jul 2021 15:40:13 +0000https://blogs.python-gsoc.org/en/xingyu-lius-blog/week-5-improving-sort-vertices-of-regions-and-write-some-tests/Week #4: Improving binned_statistic_dd and _voronoi, and fix some issueshttps://blogs.python-gsoc.org/en/xingyu-lius-blog/week-4-improving-binned-statistic-dd-and-voronoi-and-fix-some-issues/<h2>What did you do this week?</h2> <p> First came to the old problem, <code>bus error</code>. It turns out that it is specific to Mac. We still don't know the cause of the problem yet.( <a href="https://github.com/scipy/scipy/issues/14315"> bus error on Mac but works fine on Linux for _count_paths_outside_method pythran version</a>)</p> <p> Last week I said that the benchmark result is different from my <code>timeit</code> result. It is actually my mistake: I forgot to modify <code>setup.py</code>. After setting up correctly, the problem was fixed.</p> <p> Also, for the algorithm <code>binned_statistic_dd</code> I was improving since last week, I have made a PR for it. At first, I improved the whole <code>if-elif</code> block and the benchmark shows it can make <code>count, sum,mean</code> 1.1x times faster, and make <code>std, median, min, max</code> 3x-30x faster . However, I found that Pythran can't support <code>object</code> type input so I failed some tests.To support <code>object</code> type, we need to keep the whole pure Python codes, and it will make the <code>if-elif</code> block duplicate and ugly. Since from the benchmark, there is not much improvement for <code>count, sum,mean</code>, I also tried to only improve <code>std, median, min, max</code> to make it look better and understandable So in the end, I only improved an small inner function but still get <code>std, median, min, max</code> 3x-30x faster, with no changes for <code>count, sum,mean</code>.(<a href="https://github.com/scipy/scipy/pull/14345">ENH: improved binned_statistic_dd via Pythran</a>)</p> <p> When I was improving <code>binned_statistic_dd</code>, there happened to be an open issue about float point comparision. I looked into that and fixed it.(<a href="https://github.com/scipy/scipy/pull/14338"> BUG: fix stats.binned_statistic_dd issue with values close to bin edge </a>) </p><p> Last but not least, I tried to speedup <code>_voronoi </code> discussed last week, and the Pythran version is 3x faster than the Cython one!</p> <h2>What is coming up next?</h2> <ul> <li> Refer to the original Python version rather than the CPython one, make the Pythran version <code>_voronoi </code> more readable. After that, make a PR.</li> <li> Add test for the <code>binned_statistic_dd</code> bug</li> <li> Add benchmarks for somersd and _tau_b </li> <li> Prepare for the first evaluations </li> <li> In Pythran, import some scipy tests </li> </ul> <h2>Did you get stuck anywhere?</h2> The <code> bus error</code> mentioned above, and <code>build_docs</code> failed on my PR recently.xingyuliu@g.harvard.edu (Xingyu-Liu)Tue, 06 Jul 2021 13:19:10 +0000https://blogs.python-gsoc.org/en/xingyu-lius-blog/week-4-improving-binned-statistic-dd-and-voronoi-and-fix-some-issues/Week #3: Improving stats.binned_statistic_dd, somersd and _tau_bhttps://blogs.python-gsoc.org/en/xingyu-lius-blog/week-3-improving-stats-binned-statistic-dd-somersd-and-tau-b/<h2>What did you do this week?</h2> As this project progressed, I came to realize that the diffculty of this project is to find good potential algorithms. This week I was searching potential algorithms in <code>scipy.stats</code>. At first, like the past, I used <code>pytest --durations=50</code> to find the slowest 50 tests. I looked into all the functions but didn't find any suitable algorithms: some do not have obvious loops; some have loops but call another scipy method in the loop... Therefore, I began checking all the functions under <code> scipy/stats </code> one by one and finally found <code>somersd</code> and <code>_tau_b</code> are good candidates and I submitted a PR( <a href="https://github.com/scipy/scipy/pull/14308">gh-14308</a>) to speedup them 4x~20x. <p>Besides, For the works that mentioned last week:</p> <ol> <li><code>stats._moment</code>: submitted an issue(<a href="https://github.com/serge-sans-paille/pythran/issues/1820">pythran #1820</a>) for keep_dims is not supported in np.mean()</li> <li><code>stats._calc_binned_statistic</code>: successfully improved this function and made the public function <code>stats.binned_statistic_dd</code> 3x-10x faster on <code>min,max,std,median</code>. I tried to improved the whole <code>if-elif</code> block but encountered some errors that I can't fix (see <a href="https://github.com/serge-sans-paille/pythran/issues/1819#issuecomment-869102923)">pythran #1819 </a>)</li> <li><code>stats._sum_abs_axis0</code>: Thanks to Serge, the compliation errror due to variant type is fixed. I compiled and it is ~2x faster on <code>_sum_abs_axis0 </code> but do not have much gain on the public function <code>onenormest</code>. Moreover, actually there is no loop in <code>_sum_abs_axis0 </code> for input size smaller than 2**20(my bad!) </li> <li><code>sparse.linalg.expm(_fragment_2_1)</code>: Last week I said this one is slower than the pure python version but it is not. My bad, actually with my input at that time, it will not get into<code> _fragment_2_1</code> so I actually didn't test it. Only when the input size is larger than 9000 it will get into the function. Moreover, the input is <code>csc matrix </code> so it is not a suitable one for pythran. </li> <li>the SciPy build error(see <a href="https://github.com/serge-sans-paille/pythran/issues/1815">pythran #1815</a>) mentioned last week: It is a really struggling problem. Serge and Ralf tried to help me fix that but it is still not working for now. </li> </ol> <h2>What is coming up next?</h2> <ol> <li> add benchmarks for <code>somersd</code> and <code>_tau_b</code></li> <li> consider merging the old benchmark PR?(<a href="https://github.com/scipy/scipy/pull/14228">gh-14228</a>)</li> <li> keep searching good potential algorithms to be improved. </li> </ol> <h2>Did you get stuck anywhere?</h2> The problem I encountered when improving the whole <code>if-elif</code> block in <code>stats.binned_statistic_dd</code> (see <a href="https://github.com/serge-sans-paille/pythran/issues/1819#issuecomment-869102923)">pythran #1819 </a>)xingyuliu@g.harvard.edu (Xingyu-Liu)Sun, 27 Jun 2021 15:30:01 +0000https://blogs.python-gsoc.org/en/xingyu-lius-blog/week-3-improving-stats-binned-statistic-dd-somersd-and-tau-b/Week #2: Improving stats.ks_2samphttps://blogs.python-gsoc.org/en/xingyu-lius-blog/week-2-improving-stats-ks-2samp/<h2>What did you do this week?</h2> This week was quite struggling. My mentor Serge implemented supporting for `scipy.special.binom` quickly in Pythran and it shows a great improvement on the public function stats.ks_2samp( 2.62ms vs 88.2ms). However, when I was building scipy with the improved algroithm, we found that it would cause a loop problem. Serge made a <a href="https://github.com/serge-sans-paille/pythran/pull/1810">PR</a> to break the loop but in my computer it is still not working. <p>Then I turned to try other algorithms that I mentioned last week, however I encountered more problems:</p> <ol> <li>stats._moment: keep_dims is not supported in np.mean()</li> <li>stats._calc_binned_statistic: invalid pythran spec but I don't find anything wrong</li> <li>stats.rankdata: invalid pythran spec</li> <li>stats._sum_abs_axis0: compliation error</li> <li>sparse.linalg.expm(_fragment_2_1): much slower than the pure python one, will keep investigating it.</li> </ol> <h2>What is coming up next?</h2> <ul> <li> submit issue for <code>keep_dims </code> </li><li> submit issue for error in <code>stats._sum_abs_axis0</code> </li><li> continue improving <code>stats._calc_binned_statistic</code> </li><li> Find out why <code>sparse.linalg.expm</code> pythran version is slower. </li></ul> <h2>Did you get stuck anywhere?</h2> Got stuck in many problems, as is written in <code>What did you do this week</code> section.xingyuliu@g.harvard.edu (Xingyu-Liu)Mon, 21 Jun 2021 08:48:45 +0000https://blogs.python-gsoc.org/en/xingyu-lius-blog/week-2-improving-stats-ks-2samp/Week #1: Writing Benchmarkshttps://blogs.python-gsoc.org/en/xingyu-lius-blog/week-1-writing-benchmarks/<h2>What did you do this week?</h2> This week, I mainly focused on writing benchmarks and investigating potential slow algorithms. <ol> <li>Wrote more benchmarks for inferential stats: <a href="https://github.com/scipy/scipy/pull/14228">my PR </a></li> <ul> <li>KS test</li> <li>MannWhitneyU</li> <li>RankSums</li> <li>BrunnerMunzel</li> <li>chisqure</li> <li>friedmanchisquare</li> <li>epps_singleton_2samp</li> <li>kruskal</li> </ul> <li> Modified to use new random API `rng = np.random.default_rng(12345678)`<a href="https://github.com/scipy/scipy/pull/14224">my PR </a></li> <li> Documented why some functions can’t be speedup via Pythran: <a href="https://docs.google.com/document/d/1oByCzyTn9CDbNXBlE3V6Rv4yLz2Ltx5HpMg6YsooAO4/edit?usp=sharing">my doc </a></li> <li> Found more potential algorithms that can be speedup via Pythran</li> </ol> <h2>What is coming up next?</h2> Improve two of the following functions: <ul> <li>stats.friedmanchisquare: related to rankdata</li> <pre> Line # Hits Time Per Hit % Time Line Contents ============================================================== 7970 501 351.0 0.7 0.5 for i in range(len(data)): 7971 500 51417.0 102.8 75.8 data[i] = rankdata(data[i]) </pre> <li>stats.binned_statistic_dd</li> <li>sparse.linalg.onenormest</li> <li>_fragment_2_1 in scipy/sparse/linalg/matfuncs.py</li> </ul> <h2>Did you get stuck anywhere?</h2> When benchmarking, I found Mannwhitney is pretty slow. After profiling, it shows `p = _mwu_state.sf(U.astype(int), n1, n2)` occupys 100% time. Look into the function, `pmf` is the slowest part. @mdhaber mentioned that he would be interested in looking into these things himself later this summer. <pre>Line # Hits Time Per Hit % Time Line Contents ============================================================== 25 @profile 26 def pmf(self, k, m, n): 27 '''Probability mass function''' 28 1 29486.0 29486.0 0.2 self._resize_fmnks(m, n, np.max(k)) 29 # could loop over just the unique elements, but probably not worth 30 # the time to find them 31 1384 1701.0 1.2 0.0 for i in np.ravel(k): 32 1383 18401083.0 13305.2 99.8 self._f(m, n, i) 33 1 71.0 71.0 0.0 return self._fmnks[m, n, k] / special.binom(m + n, m) </pre>xingyuliu@g.harvard.edu (Xingyu-Liu)Mon, 14 Jun 2021 08:43:28 +0000https://blogs.python-gsoc.org/en/xingyu-lius-blog/week-1-writing-benchmarks/Week #0: Community Building and Getting Startedhttps://blogs.python-gsoc.org/en/xingyu-lius-blog/week-0-community-building-and-getting-started/<h2>Introduction</h2> Hi everyone! I’m Xingyu Liu, a first-year data science master student at Harvard University. I’m very excited to be accepted by SciPy and I will work on using Pythran to improve algorithms’ performance in SciPy! There are currently many algorithms that would be too slow as pure Python, and Pythran can be a good tool to accelerate them. My goal is to investigate and improve the slow algorithms, as well as write benchmarks for them. <h2>What did you do this week?</h2> In the community bonding period, I met with my mentors, Ralf Gommers and Serge Guelton. They are very kind, responsive and helpful. We discussed about my project and set up a chat and weekly sync. In the last week, I've started doing my project: <h4>Issues:</h4> <ol> <li><a href="https://github.com/serge-sans-paille/pythran/issues/1793">Pythran makes np.searchsorted much slower</a></li> <li><a href="https://github.com/serge-sans-paille/pythran/issues/1792">u_values[u_sorter].searchsort would cause "Function path is chained attributes and name" but np.search would not</a></li> <li> <a href="https://github.com/serge-sans-paille/pythran/issues/1791">all_values.sort() would cause compilation error but np.sort(all_values) would not</a> </li> </ol> <h4> Pull Requests:</h4> <ol> <li> <a href="https://github.com/scipy/scipy/pull/14154">ENH: Pythran implementation of _cdf_distance</a></li> <li> <a href="https://github.com/scipy/scipy/pull/14163">BENCH: add benchmark for energy_distance and wasserstein_distance</a></li> </ol> <h4> Readings:</h4> <ol> <li><a href="https://pythran.readthedocs.io/en/latest/MANUAL.html">Pythran tutorial</a>. </li> <li><a href="https://cython.readthedocs.io/en/latest/src/tutorial/profiling_tutorial.html">Profiling Cython code </a> </li> </ol> <h2>What is coming up next?</h2> <ol> <li>Write benchmarks for inferential stats</li> <li>Modify to use new random API `rng = np.random.default_rng(12345678)`(according to comments in <a href="https://github.com/scipy/scipy/pull/14018">BENCH: add benchmark for f_oneway </a>)</li> <li> Finding more potential algorithms that can be speedup via Pythran</li> <li> Document why some functions can’t be speedup via Pythran</li> </ol> <h2>Did you get stuck anywhere?</h2> For my first pull request, we found the Pythran version is not better than the orginal due to the indexing operations.xingyuliu@g.harvard.edu (Xingyu-Liu)Tue, 08 Jun 2021 15:11:14 +0000https://blogs.python-gsoc.org/en/xingyu-lius-blog/week-0-community-building-and-getting-started/