McSinyx's Blog

Outro

McSinyx
Published: 08/31/2020

Note: This article's HTML source is exported from reST.  Without necessary CSS, some part might look hideous.  Please consider viewing on my personal blog.

The Look

At the time of writing, implementation-wise parallel download is ready:

Does this mean I’ve finished everything just-in-time? This sounds to good to be true! And how does it perform? Welp…

The Benchmark

Here comes the bad news: under a decent connection to the package index, using fast-deps does not make pip faster. For best comparison, I will time pip download on the following cases:

Average Distribution

For convenience purposes, let’s refer to the commands to be used as follows

legacy-resolver

pip --no-cache-dir download {requirement}

2020-resolver

pip --use-feature=2020-resolver --no-cache-dir download {requirement}

fast-deps

pip --use-feature=2020-resolver --use-feature=fast-deps --no-cache-dir download {requirement}

In the first test, I used axuy and obtained the following results

legacy-resolver

2020-resolver

fast-deps

7.709s

7.888s

10.993s

7.068s

7.127s

11.103s

8.556s

6.972s

10.496s

Funny enough, running pip download with fast-deps in a directory with downloaded files already took around 7-8 seconds. This is because to lazily download a wheel, pip has to make many requests which are apparently more expensive than actual data transmission on my network.

Note

With unstable connection to PyPI (for some reason I am not confident enough to state), this is what I got

2020-resolver

fast-deps

1m16.134s

0m54.894s

1m0.384s

0m40.753s

0m50.102s

0m41.988s

As the connection was unstable and that the majority of pip networking is performed as CI/CD with large and stable bandwidth, I am unsure what this result is supposed to tell (-;

Large Distribution

In this test, I used TensorFlow as the requirement and obtained the following figures:

legacy-resolver

2020-resolver

fast-deps

0m52.135s

0m58.809s

1m5.649s

0m50.641s

1m14.896s

1m28.168s

0m49.691s

1m5.633s

1m22.131s

Distribution with Conflicting Dependencies

Some requirement that will trigger a decent amount of backtracking by the current implementation of the new resolver oslo-utils==1.4.0:

2020-resolver

fast-deps

14.497s

24.010s

17.680s

28.884s

16.541s

26.333s

What Now?

I don’t know, to be honest. At this point I’m feeling I’ve failed my own (and that of other stakeholders of pip) expectation and wasted the time and effort of pip’s maintainers reviewing dozens of PRs I’ve made in the last three months.

On the bright side, this has been an opportunity for me to explore the codebase of package manager and discovered various edge cases where the new resolver has yet to cover (e.g. I’ve just noticed that pip download would save to-be-discarded distributions, I’ll file an issue on that soon). Plus I got to know many new and cool people and idea, which make me a more helpful individual to work on Python packaging in the future, I hope.

View Blog Post

Final Check-In

McSinyx
Published: 08/24/2020

Hello there!

What did I do last week?

Not much, but seemingly implementation-wise I have finished my GSoC project:

  • Finish the implementation of wheels' parallel download (GH-8771)
  • Help make pip's CI green again (GH-8790)
  • Reformat a few spots in user guide (GH-8795)

Did I get stuck anywhere?

I got sick, but I am recovering now!

What is coming up next?

I will try to spend the time I got left within the scope of GSoC to improve cache usage of the fast-deps feature.

View Blog Post

Parallelizing Wheel Downloads

McSinyx
Published: 08/17/2020

And now it's clear as this promise
That we're making
Two progress bars into one

Hello there! It has been raining a lot lately and my wisdom tooth has decided to start growing today, causing me a mild fever. To whoever reading this, I hope it wouldn't happen to you.

Download Parallelization

I've been working on pip's download parallelization for quite a while now. As distribution download in pip was modeled as a lazily evaluated iterable of chunks, parallelizing such procedure is as simple as submitting routines that write files to disk to a worker pool.

Or at least that is what I thought.

Progress Reporting UI

pip is currently using customly defined progress reporting classes, which was not designed to working with multithreading code. Firstly, I want to try using these instead of defining separate UI for multithreaded progresses. As they use system signals for termination, one must the progress bars has to be running the main thread. Or sort of.

Since the progress bars are designed as iterators, I realized that we can call next on them. So quickly, I throw in some queues and locks, and prototyped the first working implementation of progress synchronization.

Performance Issues

Welp, I only said that it works, but I didn't mention the performance, which is terrible. I am pretty sure that the slow down is with the synchronization, since the map_multithread call doesn't seem to trigger anything that may introduce any sort of blocking.

This seems like a lot of fun, and I hope I'll get better tomorrow to continue playing with it!

View Blog Post

Sixth Check-In

McSinyx
Published: 08/10/2020

Hello there!

What did I do last week?

It has been a quite fun week for me, given the current state of development and the newly dicovered bugs thanks to pip 20.2 release:

  • Initiate discussion with the maintainers of pip on isolating networking code for late download in parallel (GH-8697)
  • Discuss the UI of parallel download (GH-8698)
  • Log debug information relating lazy wheel decision (GH-8710)
  • Disable caching for range requests (GH-8716)
  • Dedent late download logs (GH-8722)
  • Add a hook for batch downloading (third attempt I think) (GH-8737)
  • Test hash checking for fast-deps (GH-8743)

Did I get stuck anywhere?

Not exactly, everything is going smoothly and I'm feeling awesome!

What is coming up next?

I'll try to solve GH-8697 and GH-8698 within the next few days. I am optimistic that the parallel download prototype will be done within this week.

View Blog Post

Sorting Things Out

McSinyx
Published: 08/03/2020

Hi! I really hope that everyone reading this is still doing okay, and if that isn't the case, I wish you a good day!

pip 20.2 Released!

Last Wednesday, pip 20.2 was released, delivering the 2020-resolver as well as many other improvements! I was lucky to be able to get the fast-deps feature to be included as part of the release. A brief description of this experimental feature as well as testing instruction can be found on Python Discuss.

The public exposure of the feature also remind me of some further optimization to make on the lazy wheel. Hopefully without download parallelization it would not be too slow to put off testing by concerned users of pip.

Preparation for Download Parallelization

As of this moment, we already have:

  • Multithreading pool fallback working
  • An opt-in to use lazy wheel to optain dependency information, and thus getting a list of wheels at the end of resolution ready to be downloaded together

What's left is only to interject a parallel download somewhere after the dependency resolution step. Still, this struggles me way more than I've ever imagined. I got so stuck that I had to give myself a day off in the middle of the week (and study some Rust), then I came up with something what was agreed upon as difficult to maintain.

Indeed, a large part of this is my fault, for not communicating the design thoroughly with pip's maintainers and not carefully noting stuff down during (verbal) discussions with my mentor. Thankfully Chris Hunt came to the rescue and did a refactoring that will make my future work much easier and cleaner.

View Blog Post
DJDT

Versions

Time

Settings from gsoc.settings

Headers

Request

SQL queries from 1 connection

Static files (2312 found, 3 used)

Templates (28 rendered)

Cache calls from 1 backend

Signals

Log messages