Weekly Check In - 7
adityaa30
Published: 07/23/2020
What did I do till now?
This week I implemented the ScrapyH2Agent which is the handled directly by H2DownloadHandler to issue requests. Internally the ScrapyH2Agent uses
- H2Agent ✅
- ScrapyProxyH2Agent ✅
- ScrapyTunnelingH2Agent
The ScrapyTunnelingH2Agent is still work in progress. Besides the coding part, I read articles on how CONNECT protocol works for HTTP/2 in order to implement the tunneling agent.
What's coming up next?
This week I plan to
- Complete ScrapyTunnelingH2Agent implementation
- Add public documentation on how to use H2DownloaderHandler
- Add unit tests for H2Agent
Did I get stuck anywhere?
This week I did not face any major blockers 🙂
View Blog Post
Weekly Check In - 6
adityaa30
Published: 07/15/2020
What did I do till now?
Last week I was implementing
- H2Agent
- H2ConnectionPool
- H2DownloadHandler (Work In Progress)
The above classes adds the following features
- H2ConnectionPool maintains a pool of all HTTP/2 connections. It works by creating a map from (uri.scheme, uri.host, uri.port) to the H2ClientProtocol instance. Suppose we get total N requests each having its own remote URL and there are M unique set of base URL, then there will be at most M connections maintained by the pool where M <= N always. For any request we simply check if we already have a H2 connection established then we'll use it or create a new connection.
- H2Agent is responsible for issuing the request and internally using the H2ConnectionPool to establish new connection if required or use a cached connection. The H2Agent also wraps the context factory provided as an argument in the constructor using H2WrappedContextFactory which updates the ClientTLSOptions context to use only h2 as acceptable protocol during NPN or ALPN. The constructor signature of H2Agent is exactly same as of twisted's Agent class such that it is easy to integrate into Scrapy.
- H2DownloadHandler is the Scrapy's way of issuing request. There are similar download handlers for HTTP/1.x and other protocols. I have completed a basic implementation which support HTTPS requests. I'm still working on integrating this fully into Scrapy.
Apart from the above classes I added an idle timeout in H2ClientProtocol using the twisted's TimeoutMixin. So, if the connection is idle for too long (~240seconds) then it will close itself and fire a Deferred which will be handled by H2ConnectionPool -- such that any upcoming requests will not use up a closed connection & instead create a new one if required.
What's coming up next?
This week I plan to
- Write tests for the Idle timeout and H2ClientProtocol
- Complete the implementation of H2DownloadHandler
Did I get stuck anywhere?
Yes. Most of the last week I was working on solving the bug where the _StandardEndpointFactory won't establish a proper HTTP/2 connection. The only error that I had was "Connection was closed in an un-clean manner" which did not really help. The error stack was also not very helpful. I really had to deep dive for this which gave me some amazing insights on how Twisted & TLS Handshake works interally. I found that the connection was actually established but the problem was in the TLS Handshake. For some reason specifying the acceptable protocols as h2 in SSL.Context before the connection is even started to establish works but anything else -- which includes updating the acceptable protocols list during the handshake do not work! I still don't know what's the exact problem but I do have a working fix now. I do wonder what may be the reason behind the connection failing when we specify the acceptable protocols list during TLS Handshake in Twisted 🤔, probably I'll look again if I found some time during this week. To integrate the fix in my codebase I created a wrapper class which wraps any context factory which implements IPolicyForHTTPS and updates the acceptable protocols list to [b'h2'].
@implementer(IPolicyForHTTPS)
class H2WrappedContextFactory:
def __init__(self, context_factory) -> None:
verifyObject(IPolicyForHTTPS, context_factory)
self._wrapped_context_factory = context_factory
def creatorForNetloc(self, hostname, port) -> ClientTLSOptions:
options = self._wrapped_context_factory.creatorForNetloc(hostname, port)
_setAcceptableProtocols(options._ctx, [b'h2'])
return options
Apart from the above bug I did had some minor issues but those were quick to fix 🙂
View Blog Post
Weekly Check In - 5
adityaa30
Published: 07/06/2020
What did I do till now?
I was going through Twisted's implementation of HTTP/1.x and how they are handling multiple requests. I was focusing on their implementation of HTTPConnectionPool which is responsible for establing a new connection whenever required & using an existing connection (in cache).
Besides this, I did the requested changes on the HTTP/2 Client implementation.
What's coming up next?
Next week I plan to finish coding H2ConnectionPool and its integration with HTTP2ClientProtocol. Along with the integration I plan to write unit tests as well.
Did I get stuck anywhere?
No. I mostly read lots of documentation & Twisted codebase throughout this week and fixed the bugs found in HTTP/2 Client implementation.
View Blog Post
Weekly Check In - 4
adityaa30
Published: 06/30/2020
What did I do till now?
Last week I was working on
- Writing tests for HTTP2ClientProtocol
- Add support for large number of requests over a single connection
I finished both of the tasks above. I added inline docstrings for most of the methods. Still working on public documentation!
What's coming up next?
Next week I plan to
- Start working on H2ConnectionPool and H2ClientFactory which are responsible for handlng multiple connections to different authorities. Present implementation is capable of handling large number of request over single connection to only one authority.
- Finish the public documentation of HTTP2ClientProtocol
Did I get stuck anywhere?
I am very new to writing tests using TwistedTrial so was having minor bugs while setting up the testing environment and writing tests. Apart from this there was no major blockers during the last week 😁
View Blog Post
Weekly Check In - 3
adityaa30
Published: 06/22/2020
What did I do till now?
Finish the HTTP2 Client Protocol implementation.
What's coming up next?
Next week I plan to
- Write unit tests for HTTP2 Client Protocol
- Add required documentation
I have kept the goals for the next week simple as I think there will be some errors uncovered while unit testing which can take time. As the HTTP2 Client Protocol is the core component of this project I have planned this whole week for it.
Did I get stuck anywhere?
Yes I was stuck with the bug where the HTTP2 Client was working for all the request sending data which can fit in one DATA Frame. When the request body became large, the body had to be broken into a lots of data chunks and send frame by frame and along with this I had to manage flow control for the stream (on which request was initiated) -- This was not working at all. Generally there should be a WINDOW_UPDATE frame send from the remote peer to notify that the sent data chunks were received by the peer and can receive more now. I was getting a WINDOW_UPDATE for the whole HTTP/2 connection but not for the stream on which request was initiated. Initially I didn't know what to do because this was something very new to me and unexpected at the same time 😟. After some discussions with mentors and reading up HTTP/2 RFC I realized that it was okay to not receive WINDOW_UPDATE frame for a specific stream and instead receive for the whole connection and in terms of flow control both are same. So, finally I was able to fix this bug finishing a working implementation of HTTP/2 Client. Yaaay 🥳 .
View Blog Post