Hey! here is an update on what I have achieved so far.
What did you do this week?
- Protego now passes all tests borrowed from reppy, rep-cpp and robotexclusionrulesparser.
- Made few changes to Protego to make it compatible with google's parser.
- Worked on changes suggest on the interface pull request.
- Wrote code to fetch robots.txt files from top 1000 websites, and generate statistics we need. ( link )
- Looked at the code of Google's robots.txt parser for the purpose of creating a python interface on top of it. I might need to modify its code as currently it parses the robots.txt file for answering every query. (Working on anything in C++ that uses pointers or STL heavily makes me feel uncomfortable).
What is coming up next?
- Modify protego to make it behave similar to Google's parser (will need to add few more features like record group merging), and add more tests.
- Document Protego.
- Benchmarking Protego's performance.
- I would need to read how to call C/C++ code from python, for creating an interface on top Google's parser. I am currently thinking of using Cython.
- Would work on blog posts (planning to write 3 blog posts within this week).
Did you get stuck anywhere?
No, I got to work with some data science tools like jupyter notebook & pandas.