Articles on adityaa30's Blog

Weekly Check In - 12

k.aditya00@gmail.com (adityaa30) — Sun, 30 Aug 2020 12:59:23 +0000

What did I do till now?

Last week I was working on finishing up the HTTPNegotiateDownloadHandler. Presently the download handler uses ALPN or NPN (whichever is available) to negotiate a protocol (presently one of HTTP/1.1 or HTTP/2) from the remote server and issues the requests on the respective download handler. Presently, all requests made via proxy are directly issued using the HTTP11DownloadHandler.

What's coming up next?

I plan on continue working on implementing the CONNECT method for HTTP/2.

Did I get stuck anywhere?

Yep. I was stuck for almost a week on the CONNECT protocol. Now, I have managed to fix the bug where the raw TCP connection instance could not be switched to HTTP/2. However, there are some issues during the TLS handshake with the final target resource 😥.

Weekly Check In - 11

k.aditya00@gmail.com (adityaa30) — Sat, 22 Aug 2020 07:38:04 +0000

What did I do till now?

Last week, I finished finalizing the PR for the basic implementation of the H2ClientProtocol. The protocol now works with all the request methods except the CONNECT method. The work on Tunneling using CONNECT method is still in progress. I started with creating another protocol for negotiation which uses ALPN or NPN (whichever is available) to negotiate a protocol (presently one of HTTP/1.1 or HTTP/2) from the remote server based on the priority given by the user via the Scrapy project's settings and then uses the respective download handler to complete the request.

What's coming up next?

This week I am majorly working on finishing the Negotiation Protocol.

Did I get stuck anywhere?

Nope. I spent more time on finalizing a clean architure last week so mostly my time went in planning. Apart from that there were no major blockers :)

Weekly Check In - 10

k.aditya00@gmail.com (adityaa30) — Thu, 13 Aug 2020 00:38:21 +0000

What did I do till now?

I started implementing the CONNECT method for Tunneling via HTTP/2. After a lot of testing, I realized the approach I was taking was not really feasible, hence next I plan to work on an approach which initially uses HTTP/1.1 CONNECT to establish a connection with the proxy and then shifts to HTTP/2 for all the requests made via proxy.

What's coming up next?

Next week, I plan to

Make the PR for H2ClientProtocol ready to be merged with master - verify all cases covered via tests, other tests pass and there are no bugs introduced
Implement the CONNECT method using combination of HTTP/1.1 and HTTP/2

Did I get stuck anywhere?

Yes, this week I had many problems while adding support for tunneling for proxies. I have planned completely another approach for next week using HTTP/1.1 and HTTP/2. Let's see how it goes :)

Weekly Check In - 9

k.aditya00@gmail.com (adityaa30) — Thu, 06 Aug 2020 01:51:33 +0000

What did I do till now?

Last week I completed the ScrapyH2ProxyAgent implementation and added the required tests. I was going through the codebase for hyper-h2 library to get insight on how they implemented CONNECT method for HTTP/2.

What's coming up next?

Next week I plan to finish working on ScrapyTunnelingH2Agent which enables a user to create a SSL Tunnel and proxy requests.

Did I get stuck anywhere?

Yeah I am stuck at a weird problem where two test cases are colliding i.e none of them being related to each other but fails when I run them both together and passes when I run them separately. I'm still working on finding a working fix!

Weekly Check In - 8

k.aditya00@gmail.com (adityaa30) — Thu, 30 Jul 2020 03:41:48 +0000

What did I do till now?

Last week I added tests for H2Agent and H2DownloaderHandler

What's coming up next?

Next week I plan to continue working on ScrapyTunnelingH2Agent.

Did I get stuck anywhere?

Yes. I got stuck for a long time while setting up the testing environment of H2DownloaderHandler. The problem was a bit weird one, till now Scrapy was using the Twisted's WrappingFactory class to wrap the Site instance, which allows only upto HTTP/1.1 (for unknown reasons) which took me a long time to realize. After removing the WrappingFactory, the tests environment was setup as required. Apart from this another hurdle I'm still facing is about the CONNECT Protocol in HTTP/2.0, I couldn't really find much blogs/articles on this to get a better idea. I plan to look at some open-source libraries' implementation of HTTP/2.0 CONNECT now.

Weekly Check In - 7

k.aditya00@gmail.com (adityaa30) — Thu, 23 Jul 2020 14:46:01 +0000

What did I do till now?

This week I implemented the ScrapyH2Agent which is the handled directly by H2DownloadHandler to issue requests. Internally the ScrapyH2Agent uses

H2Agent ✅
ScrapyProxyH2Agent ✅
ScrapyTunnelingH2Agent

The ScrapyTunnelingH2Agent is still work in progress. Besides the coding part, I read articles on how CONNECT protocol works for HTTP/2 in order to implement the tunneling agent.

What's coming up next?

This week I plan to

Complete ScrapyTunnelingH2Agent implementation
Add public documentation on how to use H2DownloaderHandler
Add unit tests for H2Agent

Did I get stuck anywhere?

This week I did not face any major blockers 🙂

Weekly Check In - 6

k.aditya00@gmail.com (adityaa30) — Wed, 15 Jul 2020 20:04:20 +0000

What did I do till now?

Last week I was implementing

H2Agent
H2ConnectionPool
H2DownloadHandler (Work In Progress)

The above classes adds the following features

H2ConnectionPool maintains a pool of all HTTP/2 connections. It works by creating a map from (uri.scheme, uri.host, uri.port) to the H2ClientProtocol instance. Suppose we get total N requests each having its own remote URL and there are M unique set of base URL, then there will be at most M connections maintained by the pool where M <= N always. For any request we simply check if we already have a H2 connection established then we'll use it or create a new connection.
H2Agent is responsible for issuing the request and internally using the H2ConnectionPool to establish new connection if required or use a cached connection. The H2Agent also wraps the context factory provided as an argument in the constructor using H2WrappedContextFactory which updates the ClientTLSOptions context to use only h2 as acceptable protocol during NPN or ALPN. The constructor signature of H2Agent is exactly same as of twisted's Agent class such that it is easy to integrate into Scrapy.
H2DownloadHandler is the Scrapy's way of issuing request. There are similar download handlers for HTTP/1.x and other protocols. I have completed a basic implementation which support HTTPS requests. I'm still working on integrating this fully into Scrapy.

Apart from the above classes I added an idle timeout in H2ClientProtocol using the twisted's TimeoutMixin. So, if the connection is idle for too long (~240seconds) then it will close itself and fire a Deferred which will be handled by H2ConnectionPool -- such that any upcoming requests will not use up a closed connection & instead create a new one if required.

What's coming up next?

This week I plan to

Write tests for the Idle timeout and H2ClientProtocol
Complete the implementation of H2DownloadHandler

Did I get stuck anywhere?

Yes. Most of the last week I was working on solving the bug where the _StandardEndpointFactory won't establish a proper HTTP/2 connection. The only error that I had was "Connection was closed in an un-clean manner" which did not really help. The error stack was also not very helpful. I really had to deep dive for this which gave me some amazing insights on how Twisted & TLS Handshake works interally. I found that the connection was actually established but the problem was in the TLS Handshake. For some reason specifying the acceptable protocols as h2 in SSL.Context before the connection is even started to establish works but anything else -- which includes updating the acceptable protocols list during the handshake do not work! I still don't know what's the exact problem but I do have a working fix now. I do wonder what may be the reason behind the connection failing when we specify the acceptable protocols list during TLS Handshake in Twisted 🤔, probably I'll look again if I found some time during this week. To integrate the fix in my codebase I created a wrapper class which wraps any context factory which implements IPolicyForHTTPS and updates the acceptable protocols list to [b'h2'].

@implementer(IPolicyForHTTPS)
class H2WrappedContextFactory:
    def __init__(self, context_factory) -> None:
        verifyObject(IPolicyForHTTPS, context_factory)
        self._wrapped_context_factory = context_factory

    def creatorForNetloc(self, hostname, port) -> ClientTLSOptions:
        options = self._wrapped_context_factory.creatorForNetloc(hostname, port)
        _setAcceptableProtocols(options._ctx, [b'h2'])
        return options

Apart from the above bug I did had some minor issues but those were quick to fix 🙂

Weekly Check In - 5

k.aditya00@gmail.com (adityaa30) — Mon, 06 Jul 2020 16:16:49 +0000

What did I do till now?

I was going through Twisted's implementation of HTTP/1.x and how they are handling multiple requests. I was focusing on their implementation of HTTPConnectionPool which is responsible for establing a new connection whenever required & using an existing connection (in cache).

Besides this, I did the requested changes on the HTTP/2 Client implementation.

What's coming up next?

Next week I plan to finish coding H2ConnectionPool and its integration with HTTP2ClientProtocol. Along with the integration I plan to write unit tests as well.

Did I get stuck anywhere?

No. I mostly read lots of documentation & Twisted codebase throughout this week and fixed the bugs found in HTTP/2 Client implementation.

Weekly Check In - 4

k.aditya00@gmail.com (adityaa30) — Tue, 30 Jun 2020 02:44:48 +0000

What did I do till now?

Last week I was working on

Writing tests for HTTP2ClientProtocol
Add support for large number of requests over a single connection

I finished both of the tasks above. I added inline docstrings for most of the methods. Still working on public documentation!

What's coming up next?

Next week I plan to

Start working on H2ConnectionPool and H2ClientFactory which are responsible for handlng multiple connections to different authorities. Present implementation is capable of handling large number of request over single connection to only one authority.
Finish the public documentation of HTTP2ClientProtocol

Did I get stuck anywhere?

I am very new to writing tests using TwistedTrial so was having minor bugs while setting up the testing environment and writing tests. Apart from this there was no major blockers during the last week 😁

Weekly Check In - 3

k.aditya00@gmail.com (adityaa30) — Mon, 22 Jun 2020 19:37:21 +0000

What did I do till now?

Finish the HTTP2 Client Protocol implementation.

What's coming up next?

Next week I plan to

Write unit tests for HTTP2 Client Protocol
Add required documentation

I have kept the goals for the next week simple as I think there will be some errors uncovered while unit testing which can take time. As the HTTP2 Client Protocol is the core component of this project I have planned this whole week for it.

Did I get stuck anywhere?

Yes I was stuck with the bug where the HTTP2 Client was working for all the request sending data which can fit in one DATA Frame. When the request body became large, the body had to be broken into a lots of data chunks and send frame by frame and along with this I had to manage flow control for the stream (on which request was initiated) -- This was not working at all. Generally there should be a WINDOW_UPDATE frame send from the remote peer to notify that the sent data chunks were received by the peer and can receive more now. I was getting a WINDOW_UPDATE for the whole HTTP/2 connection but not for the stream on which request was initiated. Initially I didn't know what to do because this was something very new to me and unexpected at the same time 😟. After some discussions with mentors and reading up HTTP/2 RFC I realized that it was okay to not receive WINDOW_UPDATE frame for a specific stream and instead receive for the whole connection and in terms of flow control both are same. So, finally I was able to fix this bug finishing a working implementation of HTTP/2 Client. Yaaay 🥳 .

Weekly Check In - 2

k.aditya00@gmail.com (adityaa30) — Mon, 15 Jun 2020 23:18:18 +0000

What did I do till now?

Add support for both GET and POST requests in the HTTP/2 Client. I read up setting up tests with Twisted.

Whats coming up next?

Next week I plan to

Finish up with HTTP/2 Client Protocol implementation
Add tests & documentation

Did I get stuck anywhere?

Initially in my first approach while testing I realized that the client works for requests having response size which is less than the total flow control window. However, for the case when really large response is expected the client was indefinitely waiting and eventually timeout. The fix for that was relatively very simple -- acknowledge each data frame received 😁. This week I also tried to setup testing environment using the inbuilt MockServer in Scrapy which I have not been able to successfully setup due to the issue with setting up HTTP/2 connection with my client and the custom server. Still working on that!

Weekly Check In - 1

k.aditya00@gmail.com (adityaa30) — Thu, 11 Jun 2020 06:02:53 +0000

What did I do till now?

As the Community Bonding phase finished I started coding the HTTP/2 Client Protocol. I started simple with adding support for GET requests.

Whats coming up next?

Next week I plan to

Add support for GET and POST requests for HTTP/2
Setup base classes used for testing the Client Protocol

Did I get stuck anywhere?

Initially I was intimidated with some of the libraries that I was using for my project. Now, I am comfortable working with them. I was stuck with the issue of combining different chunks of data received from the server for multiple streams in proper order but now its fixed 😊

Weekly Check In - 0

k.aditya00@gmail.com (adityaa30) — Sun, 31 May 2020 21:48:17 +0000

Hello, I am Aditya Kumar. I will be contributing to Scrapy during GSoC'20. This is my first blog of the series.

What did I do till now?

I had two meetings with my mentors to discuss about the project goals and deadlines
I was looking into implementation of HTTP/2 Client by various libraries to get a better picture

Whats coming up next?

Next week, I would work on implementing a simple HTTP/2 Client which can handle GET, POST & HEAD requests.

Did I get stuck anywhere?

Last week, I was mainly working on tested code functioning as tutorials. So I didn't come across any bugs.

</article>