Articles on Akshay_Sharma's Blog

GSoC Final Blog Post

akshaysharmajs@gmail.com (Akshay_Sharma) — Sat, 21 Aug 2021 00:50:16 +0000

Hello y'all

This is my final blog post for this year's GSoC program as this is the final week of the program. I will try to summarize my work and experience of this program during the last 10 weeks.

First of all, I would like to thank my mentors Adrian Chaves and Eugenio Lacuesta for their consistent support throughout the program. I would not have made this far in GSoC without their helpful guidelines and valuable code reviews. The experience I have gained is incomparable and I have learned a lot of amazing things related to python, unit testing, Github and many more.

Thanks to Google and Python Software foundation for letting me be a part of this great journey.

I have created a MIME(Multipurpose Internet Mail Extensions) Sniffing library entirely from scratch based on the implementation of MIME Sniffing Standards.

Following are the links to my final work:

1. https://github.com/scrapy/xtractmime
2. https://github.com/scrapy/scrapy/pull/5204

xtractmime library

The first link refers to the Github repo for the implementation of MIME Sniffing library. The library is complete with proper development, testing and documentation. Though the library doesn't cover every MIME type mentioned in IANA registry but still it is able to sniff most of the MIME types. Going over Github stats, the pull requests has approx 50 commits (master branch) and 6000 line changes.

Future Work

Firstly, I have decided to automate the generation of byte patterns and patterns mask for the detection of specific MIME types which are currently being added manually in the file _patterns.py. There can be unintentional human errors in the patterns. So, automating the pattern generation will leave no scope for these errors.
Secondly, the current implementation of xtractmime is missing Section 8 of the standards i.e Content-Specific Sniffing which we can cover later.

Integration of xtractmime into Scrapy

The second link refers to the integration part. The main issue arised from here where the older MIME sniffing method used by Scrapy is detecting wrong MIME types for the PDF based files. Now, after integration xtractmime into Scrapy, it is detecting the pdf file correctly plus there are many more improved behavioral changes which are all mentioned as unit tests in the above link. The pull request is not merged yet as it requires the library xtractmime to be published to PYPI first so that all the CI checks pass in the Scrapy.

Future Work

Currently, Scrapy do not consider `X-Content-Type-Options` header while passing the HTTP response. But xtractmime has a parameter for this headers. So, in future may be we can rely on xtractmime for this header.

All in all it has been a wonderful summer, with exciting new coding stuff that I learned and if I get a chance, I will surely apply for GSoC next year.

Thank you for reading

GSoC Blog Post #5

akshaysharmajs@gmail.com (Akshay_Sharma) — Wed, 11 Aug 2021 23:31:43 +0000

Hi all !

So, final phase of GSoC is about to end and I tried to wrap up as much work as I can for the final evaluation.

What did you do this week?

Last week, I completed the documentation part of the library xtractmime. Some small changes were required in the implementation of xtractmime which I tried to cover up along with the addition of some left overs unit tests. The integration part is almost done just some more testing is required and we are good to go.

What is coming up next?

I will try my best to deliver the complete work as per my proposal for this year's GSoC program, which is mostly done.

Did you get stuck anywhere?

Some print statements in xtractmime were causing problems in the integration part but it was resolved later.

GSoC Weekly Check-In #5

akshaysharmajs@gmail.com (Akshay_Sharma) — Mon, 02 Aug 2021 19:18:11 +0000

Hello y'all !

What did you do this week?

The implementation of the MIME sniffing standards into the library xtractmime is complete and the changes have been merged. Also, the integration part of the xtractmime library into Scrapy framework is almost finished, just some changes required in the unit testing.

What is coming up next?

I will wrap-up the integration part this week and will start with the documentation of xtractmime. I will try to make the documentation easy to understand for the new users and will also try to cover all possible functionalities of xtractmime through simple examples.

Did you get stuck anywhere?

Some test cases related to the integration part required a little bit of discussion with the mentors but later it was resolved. Else everything went smoothly.

GSoC Blog Post #4

akshaysharmajs@gmail.com (Akshay_Sharma) — Thu, 29 Jul 2021 08:08:34 +0000

Hey All!

What did you do this week?

Last week I finalized the functionality for mime groups into the xtractmime library. I tried to cover all the mime types mentioned in the MIME standards through unit testing but still a lot of mime types are yet to be covered that are not in the standards. For instance the lists of mime types proposed by Mozilla, Wikipedia, or by IANA registry. I also made some progress in the integration of xtractmime into Scrapy.

What is coming up next?

I will try to finalize the integration part in the coming weeks. Also, once the we finalized xtractmime, I will start to work on refactoring the current method to determine response classes in Scrapy using xtractmime functionalities.

Did you get stuck anywhere?

Some testcases related to the integration part were failing. The current implementation of mime sniffing in Scrapy consider various parameters like body, url, HTTP headers, filename etc. Whereas xtractmime is reliable when body parameter is not NULL or we have a Content-Type header. This requires further discussion with mentors.

GSoC Weekly Check-In #4

akshaysharmajs@gmail.com (Akshay_Sharma) — Mon, 19 Jul 2021 17:44:10 +0000

Hello Everyone!

I am elated to share that I have successfully cleared the first phase of this year's GSoC program. I would like to thank my mentors for their consistent support and PSF for giving me this opportunity to work as a GSoC developer. The knowledge and experience I gained in the first phase of the program is beyond my imagination. The final phase has begun and I look forward to have a great experience in that too.

What did you do this week?

I started to integrate my library xtractmime for MIME sniffing into Scrapy framework. I tried to change the method of scrapy's mime sniffing so that it covers the scenarios where earlier implementation was failing. I didn't remove the older implementation of mime sniffing as Scrapy should be backward compatible and just added a new function to it. I also added a new functionality to my library xtractmime to detect a mime group based on Content-type passed to it.

What is coming up next?

I will continue my work of integrating xtractmime into Scrapy and try to finalize the implementation so that I can start working on testing part later.

Did you get stuck anywhere?

There was a little confusing with the usage of `X-Content-Type-Options` header as xtractmime use it as a parameter but Scrapy's current implementation doesn't require it. So, maybe in future it will be required.

GSoC Blog Post #3

akshaysharmajs@gmail.com (Akshay_Sharma) — Tue, 13 Jul 2021 23:13:36 +0000

Hello All!

The first evaluation week is here and till now GSoC program has been challenging as well as exciting for me. With the persistent support of my mentors, I am able to complete the implementation of the MIME sniffing library and I hope that I will pass the first evaluation towards the end of this week:)

What did you do this week?

Last week I have finalized the implementation of the MIME Standards till section 7 with proper development as well as testing. I am able to achieve 100% code coverage uptil now through "pytest" and "pytest-cov" and the code has been merged to the main branch of the repo on Github. Thanks to my mentors!

What is coming up next?

This week or coming weeks, I will try to integrate my library into the Scrapy framework so that it will resolve the issue "Wrong type(response) for binary responses #4240" from where the main problem originated.

Did you get stuck anywhere?

Deciding an input type for Content-Type parameter in the main function of the library was a little confusing. There were two option, First, we allow users to input a byte type string or a simple string and Second, we restrict users to only input byte type string. After discussion with mentors, we choose the second option which was much easier to implement as well as understandable to the users. Else this week went quite smoothly except for some minor problems with the testing.

GSoC Weekly Check-In #3

akshaysharmajs@gmail.com (Akshay_Sharma) — Wed, 07 Jul 2021 08:44:25 +0000

Hello Everyone!

The first phase of this year's GSoC program is approaching its end with the first evaluation next week and I am trying my best to finalize the implementation of the MIME sniffing library with proper development as well as testing.

What did you do this week?

I have implemented the section 7 "Determining the computed mime type of a resource". This section covers the main sniffing functions for the library including different sniffing rules like "Identifying a resource with an unknown MIME type", "Sniffing a mislabeled binary resource", "Sniffing a mislabeled RSS XML feed".

What is coming up next?

This week I will apply the testing to section 7 covering all possible test cases to get 100% coverage. Also, I will start to integrate my library as soon as possible once the library is finalized

Did you get stuck anywhere?

Section 7.3 i.e "Sniffing a mislabeled RSS XML feed" was a bit confusing and complicated because of the way standards represent its pseudocode but mentors were there to help me with that. Other than this there were no major problems last week.

GSoC Blog Post #2

akshaysharmajs@gmail.com (Akshay_Sharma) — Tue, 29 Jun 2021 22:57:43 +0000

Hi All,

The fourth week of this years' GSoC program has been completed and I have implemented most major parts of the MIME standards into my python library including section 4, 5, 6 and some parts of the section 7.

What did you do this week?

I have spent the last week fixing some major issues in the library. The main issue that took most of my time was to fix the implementation of algorithm for matching MIME type pattern in an MP3 file without ID3 tags. ID3 tags covers the contents like artist name, album name, genre, and many more. The algorithm mentioned in standards has various problems that are mentioned in the issue here. I was finally able to fix the problems with the algorithm taking reference from the implementation of mozilla for mp3 files. I also worked on my coding style, thanks to my mentor Adrian Chaves for his extremely helpful reviews and suggestions about it and I learned a lot of interesting things too.

What is coming up next?

I have already started with section 7 last week but there is a lot to cover in that including the tests which I will try to cover this week.

Did you get stuck anywhere?

Except for fixing the implementation of algorithm for matching the mime pattern for MP3 file without ID3 tags, last week was interesting and went smoothly.

GSoC Weekly Check-In #2

akshaysharmajs@gmail.com (Akshay_Sharma) — Tue, 22 Jun 2021 18:44:38 +0000

Hey Everyone!

Last week was a bit tiring as well as exciting. I have made a quite progress in creating my python library for MIME sniffing and learned a lot of new things about universal clean-coding conventions.

What did you do this week?

I mainly focused only on implementing section 6 of the MIME standards into the library. This section typically covers the mime matching algorithm. An algorithm to determine a type of file based on predefined patterns by matching the initial bytes of the file with the pattern. The standards mentioned numerous predefined patterns like image file, audio or video file, text file, archive file. There are some special extensions of audio and video files that require different rules of matching the patterns. For e.g matching signature of mp4, WebM, and mp3 files. I have also worked on adding unit tests for the above algorithms that cover every possible test case. One of my mentors also added support for continuous integration to the Github repository which will help to keep an eye on the working of the library and also, will be much easier to debug issues if any.

What is coming up next?

Coming up next is the main algorithm for the library that is "computing the final MIME type". The rules for this algorithm are mentioned in section 7 of the standards and I will try to fully implement it including all the possible tests.

Did you get stuck anywhere?

Yes, the algorithm for matching the signature for Webm files mention in standards was a bit ambiguous. I tried many possible changes to the algorithm, some of them were suggested by my mentors and finally, it worked but I am not 100% sure if the change I made was correct or not. I left it for now as it is working perfectly fine but if something goes wrong in the future I will try to fix it.

GSoC Blog Post #1

akshaysharmajs@gmail.com (Akshay_Sharma) — Tue, 15 Jun 2021 09:25:45 +0000

Hey All,

It's already been a week now since the GSoC coding period has begun and I have started working on my project.

What did you do this week?

Like I mentioned in an earlier post, I have designed a high-level API for the python library this week and started to implement the rules mentioned in MIME sniffing standards. I worked on section 5 according to the standards i.e. "Handling the resource metadata and headers". One of my mentors suggested creating a template for the project before moving on to further coding. Therefore, I set up a template for the library with setup.py file, added a BSD license file and configured tox environment for various tests like flake8, typing, py, black.

What is coming up next?

I will start with the implementation of section 6 i.e "Matching a mime-type pattern" and will try to add some tests. Currently, I am using a simple hard-coded test for the library but this week I will try to automate the tests using python unit tests and add more tests as I build the library.

Did you get stuck anywhere?

No, last week went quite seamlessly as I have done similar work earlier, and also, the mentors were always there for suggesting me the best.

Weekly Check-In #1

akshaysharmajs@gmail.com (Akshay_Sharma) — Mon, 07 Jun 2021 21:50:55 +0000

Hello everyone!!

I am Akshay Sharma, a final year undergrad at Jaypee Institute Of Information Technology, India and a senior certificate student at University Of Florida, USA, majoring in Computer Science. With immense pleasure, I would like to mention that this summer, I will be contributing to "Scrapy Community" under Python Software Foundation, as a GSoC Student. I will be designing a python library for MIME(Multipurpose Internet Mail Extension) Sniffing.

What did you do this week (community bonding period)?

The bonding period went well, I got to interact with my highly experienced mentors (Eugenio Lacuesta, Adrian Chaves) through video conferencing. I have known them for a year through Github but meeting them face-to-face for the first time was great and I found them friendly & helpful. We discussed the overall timeline, the implementation details and other requirement necessary for my project. Besides bonding period, I spent most of the time reading documentation and understanding the codebase of other MIME sniffing libraries like mimetypes, python-magic. This helped to get a gist of what I will be working on throughout the GSoC program. I also got a name for my library which is xtractmime.

What is coming up next?

The coding period has started! and I will be designing the high level API for my MIME sniffing library this week. The API will take HTTP response as input and will return a mimetype as output.

Did you get stuck anywhere?

I got stuck a little while deciding a starting point for the library as unlike other projects, I will be creating a library from scratch. After a discussion with mentors, we finalised the necessary inputs for the API so that later there will be less trouble making the changes and also helped me to start with the coding.