GSoC Final Blog Post

Akshay_Sharma
Published: 08/21/2021

Hello y'all

This is my final blog post for this year's GSoC program as this is the final week of the program. I will try to summarize my work and experience of this program during the last 10 weeks.

First of all, I would like to thank my mentors Adrian Chaves and Eugenio Lacuesta for their consistent support throughout the program. I would not have made this far in GSoC without their helpful guidelines and valuable code reviews. The experience I have gained is incomparable and I have learned a lot of amazing things related to python, unit testing, Github and many more.

Thanks to Google and Python Software foundation for letting me be a part of this great journey.

I have created a MIME(Multipurpose Internet Mail Extensions) Sniffing library entirely from scratch based on the implementation of MIME Sniffing Standards.

Following are the links to my final work:

1. https://github.com/scrapy/xtractmime
2. https://github.com/scrapy/scrapy/pull/5204

xtractmime library

The first link refers to the Github repo for the implementation of MIME Sniffing library. The library is complete with proper development, testing and documentation. Though the library doesn't cover every MIME type mentioned in IANA registry but still it is able to sniff most of the MIME types. Going over Github stats, the pull requests has approx 50 commits (master branch) and 6000 line changes.

Future Work

Firstly, I have decided to automate the generation of byte patterns and patterns mask for the detection of specific MIME types which are currently being added manually in the file _patterns.py. There can be unintentional human errors in the patterns. So, automating the pattern generation will leave no scope for these errors.
Secondly, the current implementation of xtractmime is missing Section 8 of the standards i.e Content-Specific Sniffing which we can cover later.

Integration of xtractmime into Scrapy

The second link refers to the integration part. The main issue arised from here where the older MIME sniffing method used by Scrapy is detecting wrong MIME types for the PDF based files. Now, after integration xtractmime into Scrapy, it is detecting the pdf file correctly plus there are many more improved behavioral changes which are all mentioned as unit tests in the above link. The pull request is not merged yet as it requires the library xtractmime to be published to PYPI first so that all the CI checks pass in the Scrapy.

Future Work

Currently, Scrapy do not consider `X-Content-Type-Options` header while passing the HTTP response. But xtractmime has a parameter for this headers. So, in future may be we can rely on xtractmime for this header.

All in all it has been a wonderful summer, with exciting new coding stuff that I learned and if I get a chance, I will surely apply for GSoC next year.

Thank you for reading

GSoC Final Blog Post

xtractmime library

Future Work

Integration of xtractmime into Scrapy

Future Work

Versions

Time

Settings from gsoc.settings

Headers

Request

SQL queries from 1 connection

Static files (2312 found, 3 used)

Templates (11 rendered)

Cache calls from 1 backend

Signals

Log messages