GSoC Final Blog Post

Akshay_Sharma
Published: 08/21/2021

Hello y'all

This is my final blog post for this year's GSoC program as this is the final week of the program. I will try to summarize my work and experience of this program during the last 10 weeks.

First of all, I would like to thank my mentors Adrian Chaves and Eugenio Lacuesta for their consistent support throughout the program. I would not have made this far in GSoC without their helpful guidelines and valuable code reviews. The experience I have gained is incomparable and I have learned a lot of amazing things related to python, unit testing, Github and many more.

Thanks to Google and Python Software foundation for letting me be a part of this great journey.

I have created a MIME(Multipurpose Internet Mail Extensions) Sniffing library entirely from scratch based on the implementation of MIME Sniffing Standards.

Following are the links to my final work:

1. https://github.com/scrapy/xtractmime
2. https://github.com/scrapy/scrapy/pull/5204

xtractmime library

The first link refers to the Github repo for the implementation of MIME Sniffing library. The library is complete with proper development, testing and documentation. Though the library doesn't cover every MIME type mentioned in IANA registry but still it is able to sniff most of the MIME types. Going over Github stats, the pull requests has approx 50 commits (master branch) and 6000 line changes.

Future Work

Firstly, I have decided to automate the generation of byte patterns and patterns mask for the detection of specific MIME types which are currently being added manually in the file _patterns.py. There can be unintentional human errors in the patterns. So, automating the pattern generation will leave no scope for these errors.
Secondly, the current implementation of xtractmime is missing Section 8 of the standards i.e Content-Specific Sniffing which we can cover later.

Integration of xtractmime into Scrapy

The second link refers to the integration part. The main issue arised from here where the older MIME sniffing method used by Scrapy is detecting wrong MIME types for the PDF based files. Now, after integration xtractmime into Scrapy, it is detecting the pdf file correctly plus there are many more improved behavioral changes which are all mentioned as unit tests in the above link. The pull request is not merged yet as it requires the library xtractmime to be published to PYPI first so that all the CI checks pass in the Scrapy.

Future Work

Currently, Scrapy do not consider `X-Content-Type-Options` header while passing the HTTP response. But xtractmime has a parameter for this headers. So, in future may be we can rely on xtractmime for this header.

All in all it has been a wonderful summer, with exciting new coding stuff that I learned and if I get a chance, I will surely apply for GSoC next year.

Thank you for reading

View Blog Post

GSoC Blog Post #5

Akshay_Sharma
Published: 08/11/2021

Hi all !

So, final phase of GSoC is about to end and I tried to wrap up as much work as I can for the final evaluation.

What did you do this week?

Last week, I completed the documentation part of the library xtractmime. Some small changes were required in the implementation of xtractmime which I tried to cover up along with the addition of some left overs unit tests. The integration part is almost done just some more testing is required and we are good to go.

What is coming up next?

I will try my best to deliver the complete work as per my proposal for this year's GSoC program, which is mostly done.

Did you get stuck anywhere?

Some print statements in xtractmime were causing problems in the integration part but it was resolved later.

View Blog Post

GSoC Weekly Check-In #5

Akshay_Sharma
Published: 08/02/2021

Hello y'all !

What did you do this week?

The implementation of the MIME sniffing standards into the library xtractmime is complete and the changes have been merged. Also, the integration part of the xtractmime library into Scrapy framework is almost finished, just some changes required in the unit testing.

What is coming up next?

I will wrap-up the integration part this week and will start with the documentation of xtractmime. I will try to make the documentation easy to understand for the new users and will also try to cover all possible functionalities of xtractmime through simple examples.

Did you get stuck anywhere?

Some test cases related to the integration part required a little bit of discussion with the mentors but later it was resolved. Else everything went smoothly.

View Blog Post

GSoC Blog Post #4

Akshay_Sharma
Published: 07/29/2021

Hey All!

What did you do this week?

Last week I finalized the functionality for mime groups into the xtractmime library. I tried to cover all the mime types mentioned in the MIME standards through unit testing but still a lot of mime types are yet to be covered that are not in the standards. For instance the lists of mime types proposed by Mozilla, Wikipedia, or by IANA registry. I also made some progress in the integration of xtractmime into Scrapy.

What is coming up next?

I will try to finalize the integration part in the coming weeks. Also, once the we finalized xtractmime, I will start to work on refactoring the current method to determine response classes in Scrapy using xtractmime functionalities.

Did you get stuck anywhere?

Some testcases related to the integration part were failing. The current implementation of mime sniffing in Scrapy consider various parameters like body, url, HTTP headers, filename etc. Whereas xtractmime is reliable when body parameter is not NULL or we have a Content-Type header. This requires further discussion with mentors.

View Blog Post

GSoC Weekly Check-In #4

Akshay_Sharma
Published: 07/19/2021

Hello Everyone!

I am elated to share that I have successfully cleared the first phase of this year's GSoC program. I would like to thank my mentors for their consistent support and PSF for giving me this opportunity to work as a GSoC developer. The knowledge and experience I gained in the first phase of the program is beyond my imagination. The final phase has begun and I look forward to have a great experience in that too.

What did you do this week?

I started to integrate my library xtractmime for MIME sniffing into Scrapy framework. I tried to change the method of scrapy's mime sniffing so that it covers the scenarios where earlier implementation was failing. I didn't remove the older implementation of mime sniffing as Scrapy should be backward compatible and just added a new function to it. I also added a new functionality to my library xtractmime to detect a mime group based on Content-type passed to it.

What is coming up next?

I will continue my work of integrating xtractmime into Scrapy and try to finalize the implementation so that I can start working on testing part later.

Did you get stuck anywhere?

There was a little confusing with the usage of `X-Content-Type-Options` header as xtractmime use it as a parameter but Scrapy's current implementation doesn't require it. So, maybe in future it will be required.

View Blog Post

Akshay_Sharma's Blog

GSoC Final Blog Post

xtractmime library

Future Work

Integration of xtractmime into Scrapy

Future Work

GSoC Blog Post #5

What did you do this week?

What is coming up next?

Did you get stuck anywhere?

GSoC Weekly Check-In #5

What did you do this week?

What is coming up next?

Did you get stuck anywhere?

GSoC Blog Post #4

What did you do this week?

What is coming up next?

Did you get stuck anywhere?

GSoC Weekly Check-In #4

What did you do this week?

What is coming up next?

Did you get stuck anywhere?

Versions

Time

Settings from gsoc.settings

Headers

Request

SQL queries from 1 connection

Static files (2312 found, 3 used)

Templates (28 rendered)

Cache calls from 1 backend

Signals

Log messages