McSinyx's Blog

The Wonderful Wizard of O'zip

McSinyx
Published: 06/22/2020

Never give up… No one knows what's going to happen next.

Preface

Greetings and best wishes! I had a lot of fun during the last week, although admittedly nothing was really finished. In summary, these are the works I carried out in the last seven days:

The multiprocessing[.dummy] wrapper

Yes, you read it right, this is the same section as last fortnight's blog. My mentor Pradyun Gedam gave me a green light to have GH-8411 merged without support for Python 2 and the non-lazy map variant, which turns out to be troublesome for multithreading.

The tests still needs to pass of course and the flaky tests (see failing tests over Azure Pipeline in the past) really gave me a panic attack earlier today. We probably need to mark them as xfail or investigate why they are undeterministic specifically on Azure, but the real reason I was all caught up and confused was that the unit tests I added mess with the cached imports and as pip's tests are run in parallel, who knows what it might affect. I was so relieved to not discover any new set of tests made flaky by ones I'm trying to add!

The file-like object mapping ZIP over HTTP

This is where the fun starts. Before we dive in, let's recall some background information on this. As discovered by Danny McClanahan in GH-7819, it is possible to only download a potion of a wheel and it's still valid for pip to get the distribution's metadata. In the same thread, Daniel Holth suggested that one may use HTTP range requests to specifically ask for the tail of the wheel, where the ZIP's central directory record as well as where usually dist-info (the directory containing METADATA) can be found.

Well, usually. While PEP 427 does indeed recommend

Archivers are encouraged to place the .dist-info files physically at the end of the archive. This enables some potentially interesting ZIP tricks including the ability to amend the metadata without rewriting the entire archive.

one of the mentioned tricks is adding shared libraries to wheels of extension modules (using e.g. auditwheel or delocate). Thus for non-pure Python wheels, it is unlikely that the metadata lie in the last few megabytes. Ignoring source distributions is bad enough, we can't afford making an optimization that doesn't work for extension modules, which are still an integral part of the Python ecosystem )-:

But hey, the ZIP's directory record is warrantied to be at the end of the file! Couldn't we do something about that? The short answer is yes. The long answer is, well, yessssssss! That, plus magic provided by most operating systems, this is what we figured out:

  1. We can download a realatively small chunk at the end of the wheel until it is recognizable as a valid ZIP file.
  2. In order for the end of the archive to actually appear as the end to zipfile, we feed to it an object with seek and read defined. As navigating to the rear of the file is performed by calling seek with relative offset and whence=SEEK_END (see man 3 fseek for more details), we are completely able to make the wheels in the cloud to behave as if it were available locally.
  3. A wheels in the cloud
  4. For large wheels, it is better to store them in hard disks instead of memory. For smaller ones, it is also preferable to store it as a file to avoid (error-prony and often not really efficient) manual tracking and joining of downloaded segments. We only use a small potion of the wheel, however just in case one is wonderring, we have very little control over when tempfile.SpooledTemporaryFile rolls over, so the memory-disk hybrid is not exactly working as expected.
  5. With all these in mind, all we have to do is to define an intermediate object check for local availability and download if needed on calls to read, to lazily provide the data over HTTP and reduce execution time.

The only theoretical challenge left is to keep track of downloaded intervals, which I finally figured out after a few trials and errors. The code was submitted as a pull request to pip at GH-8467. A more modern (read: Python 3-only) variant was packaged and uploaded to PyPI under the name of lazip. I am unaware of any use case for it outside of pip, but it's certainly fun to play with d-:

What's next?

I have been falling short of getting the PRs mention above merged for quite a while. With pip's next beta coming really soon, I have to somehow make the patches reach a certain standard and enough attention to be part of the pre-release—beta-testing would greatly help the success of the GSoC project. To other GSoC students and mentors reading this, I also hope your projects to turn out successful!

View Blog Post

Second Check-In

McSinyx
Published: 06/15/2020

Hi everyone and may the odds ever in your favor, especially during this tough time!

What did I do last week?

Not as much I wished, apparently (-:

Did I get stuck anywhere?

Yes, of course! parallel maps are still stalling as well as other small PRs listed above. The failure related to logging are still making me pulling my hair out and the proof of concept for partial wheel downloading is too ugly even for a PoC. I imagine that I will have a lot of clean up to do this week (yay!).

What is coming up next?

I'm trying get the multi-{threading,processing} facilities merged ASAP to start rolling it out in practice. The first thing popping out of my head is to get back the multi-threaded pip list -o.

The other experimental improvement (this phrase does not sound right!) I would like to get done is the partial wheel download. It would be really nice if I can get both of these included as unstable-features in the upcoming beta release of pip 20.2.

View Blog Post

Unexpected Things When You're Expecting

McSinyx
Published: 06/08/2020

Hi everyone, I hope that you are all doing well and wishes you all good health! The last week has not been really kind to me with a decent amount of academic pressure (my school year is lasting until early Jully). It would be bold to say that I have spent 10 hours working on my GSoC project since the last check-in, let alone the 30 hours per week requirement. That being said, there were still some discoveries that I wish to share.

The multiprocessing[.dummy] wrapper

Most of the time I spent was to finalize the multi{processing,threading} wrapper for map function that submit tasks to the worker pool. To my surprise, it is rather difficult to write something that is not only portable but also easy to read and test.

By the latest commit, I realized the following:

  1. The multiprocessing module was not designed for the implementation details to be abstracted a way entirely. For example, the lazy maps could be really slow without specifying suitable chunk size (to cut the input iterable and distribute them to workers in the pool). By suitable, I mean only an order smaller than the input. This defeats half of the purpose of making it lazy: allowing the input to be evaluated lazily. Luckily, in the use case I'm aiming for, the length of the iterable argument is small and the laziness is only needed for the output (to pipeline download and installation).
  2. Mocking import for testing purposes can never be pretty. One reason is that we (Python users) have very little control over the calls of import statements and its lower-level implementation __import__. In order to properly patch this built-in function, unlike for others of the same group, we have to monkeypatch the name from builtins (or __builtins__ under Python 2) instead of the module that import stuff. Furthermore, because of the special namespacing, to avoid infinite recursion we need to alias the function to a different name for fallback.
  3. To add to the problem, multiprocessing lazily imports the fragile module during pools creation. Since the failure is platform-specific (the lack of sem_open), it was decided to check upon the import of the pip's module. Although the behavior is easier to reason in human language, testing it requires invalidating cached import and re-import the wrapper module.
  4. Last but not least, I now understand the pain of keeping Python 2 compatibility that many package maintainers still need to deal with everyday (although Python 2 has reached its end-of-life, pip, for example, will still support it for another year).

The change in direction

Since last week, my mentor Pradyun Gedam and I set up weekly real-time meeting (a fancy term for video/audio chat in the worldwide quarantine era) for the entire GSoC period. During the last session, we decided to put parallelization of download during resolution on hold, in favor of a more beneficial goal: partially download the wheels during dependency resolution.

Assuming I'll reach the goal eventually

As discussed by Danny McClanahan and the maintainers of pip, it is feasible to only download a few kB of a wheel to obtain enough metadata for the resolution of dependency. While this is only applicable to wheels (i.e. prebuilt packages), other packaging format only make up less than 20% of the downloads (at least on PyPI), and the figure is much less for the most popular packages. Therefore, this optimization alone could make the upcoming backtracking resolver's performance par with the legacy one.

During the last few years, there has been a lot of effort being poured into replacing pip's current resolver that is unable to resolve conflicts. While its correctness will be ensured by some of the most talented and hard-working developers in the Python packaging community, from the users' point of view, it would be better to have its performance not lagging behind the old one. Aside from the increase in CPU cycles for more rigorous resolution, more I/O, especially networking operations is expected to be performed. This is due to the lack of a standard and efficient way to acquire the metadata. Therefore, unlike most package managers we are familiar with, pip has to fetch (and possibly build) the packages solely for dependency informations.

Fortunately, PEP 427 recommends package builders to place the metadata at the end of the archive. This allows the resolver to only fetch the last few kB using HTTP range requests for the relevant information. Simply appending Range: bytes=-8000 to the request header in pip._internal.network.download makes the resolution process lightning fast. Of course this breaks the installation but I am confident that it is not difficult to implement this optimization cleanly.

One drawback of this optimization is the compatibility. Not every Python package warehouse support range requests, and it is not possible verify the partial wheel. While the first case is unavoidable, for the other, hashes checking is usually used for pinned/locked-version requirements, thus no backtracking is done during dependency resolution.

Either way, before installation, the packages selected by the resolver can be downloaded in parallel. This warranties a larger crowd of packages, compared to parallelization during resolution, where the number of downloads can be as low as one during trail of different versions of the same package.

Unfortunately, I have not been able to do much other than a minor clean up. I am looking forward to accomplishing more this week and seeing what this path will lead us too! At the moment, I am happy that I'm able to meet the blog deadline, at least in UTC!

View Blog Post

First Check-in

McSinyx
Published: 06/01/2020

Hi everyone, I am McSinyx, a Vietnamese undergraduate student who loves free software. This summer I am working with the maintainers and the contributors of pip to make the package manager download in parallel.

What did I do during the community bonding period?

Aside from bonding with pip's maintainers and contributors as well as with my mentors, I was also experimenting on the theoretical and technical obstacles blocking this GSoC project. Pradyun Gedam (a mentor of mine) suggested making a proof of concept to determine if parallel downloading can play nicely with ResolveLib's abstraction and we are reviewing it together. On the technical side, we pip's committers are exploring available options for parallelization and I made an attempt to make use of Python's standard worker pool in a portable way.

Did I get stuck anywhere?

Yes, of course! Neither of the experiments above is finished as of this moment. Though, I am optimistic that the issues will not be real blockers and we will figure that out in the next few days.

What is coming up next?

As planned, this week I am going to refactor the package downloading code pip. The main purpose is to decouple the networking code from the package preparation operation and make sure that it is thread-safe.

In addition, I am also continuing mentioned experiments to have a better confidence on the future of this GSoC project.

To other GSoC students, mentors and admins reading this, I am wishing you all good health and successful projects this summer!

View Blog Post