jaladh-singhal's Blog

Week-7: Coding the Caching

jaladh-singhal
Published: 07/15/2019

Hello folks,

This week I finished writing a module for caching filter data that our package wsynphot needs. So I wrote quite a lot of code which obviously required tests, thereby deepened my understanding of unit-testing. 🤓

What did I do this week?

  1.  I mainly worked on creating cache_filters module that handles cached filter data. I did various tasks under this:
    • Decorated download data function with progress bar using tqdm
    • Created functions to load filter data from cached VOTables to dataframes
    • Documented all function in a notebook
    • Created tests for entire module - used various new things like pytest fixtures, pandas testing framework and figured out how to make sure tests/data is available in built package
    • Improvised error reporting mechanism of the cache loader functions & created test for fail cases 
  2. I also setup RTD redirects to docs on Github pages (same last week's work) on a repository of our sister project TARDIS. And amazingly this time, I figured out how to create exact redirects to Github pages from RTD index page - now I know 2 methods of creating redirects: implicitly & explicitly.

 

What is coming up next?

Now I need to integrate these cache_filters module in our base module by dropping the functions using HDF storage (old way of filter data access). Then I'll create an updating mechanism for updating the filter data cache (as our data source SVO FPS keeps on updating).

 

Did I get stuck anywhere?

This week made me realize creating tests for I/O functions, is a really tricky task. I was confused on how to write unit test for download function that iteratively fetched over 4500 files. I extensively researched for such situations and found only solution was code refactoring for the sake of making function testable - which also have diverse views on internet. My mentor resolved this dilemma by telling me that I can skip lines from tests if they are either trivial or tested somewhere else, so I created test only for the functions being called by that massive download function.

 

What was something new I learned?

📇 Writing tests in Pytest that access data: I learned several things while creating unit tests for my module, like:

  • By using pytest fixture objects, we can setup & teardown data resources easily in our tests.
  • To make data stored in tests/data/ available to tests, we also need to make sure that data files are listed in package data of setup file - for this astropy setup helpers even provide a function get_package_data() to define the data we need to access from our built package.
  • For failing cases, we can create tests that check whether an expected exception is raised by using pytest.raises.

✏️ Logger instead of print statements: In a package we often need to show user some message, may be an error or information about the action they are performing by using our package. For these cases, print is an obvious choice but we have a much better option: Logger (object from python's logging module). Logger is preferred over print because logs are highly configurable - you can save them to files, can locate a logging call by line number they display, can hide/show them, etc.

 


Thank you for reading. Stay tuned to know about my upcoming experiences!

View Blog Post

Week-6: Gotta Cache 'em All!

jaladh-singhal
Published: 07/08/2019

Hello folks,

So I am still implementing several methods of accessing filter data from SVO FPS - redesigning our old way of HDF storage. After implementing a module to access filter data using API, I've entered 1st time into the realm of caching! Now I'll create a module to cache the data on disk (for fast access obviously), so let me tell you how I'm going to cache 'em all 🙃 and what I completed until now...

 

What did I do this week?

  1. To contribute our query functions to access filter data from SVO FPS, following my previous work, I created a clean PR at Astroquery as per their API specification. They added it to their milestone so I think they will include this in their package. 
  2. There was a bunch of several tasks hanging around, that I finished this week:
    • As we have switched from readthedocs (RTD) to Github pages, my mentor told me to create redirect at our homepage of RTD docs to github pages. He had already setup it at our sister project TARDIS, so I understood his brilliant idea and how to use readthedocs, and successfully setup RTD redirects for our packages.
    • I removed python 2.7 support from our packages because it was becoming time-consuming to maintain both versions when we have a lot of important things to implement.
    • I also did some tweaks in the module I created last week (that fetch filter data using SVO API in real-time) to make it more package specific like removing the class structure (unlike what I did for Astroquery) & got it merged - my 1st entirely self-written module in our codebase, oh yeah! 😊
  3. Besides having a module to access data in real-time, we decided to have another module that caches filter data (locally on disk) for fast access next time. So I opened a PR for it, & created  function that can download entire filter data and cache it on disk systematically (by storing data in a hierarchy of category-named folders).

 

What is coming up next?

Now I'll work further on this PR to finish up with cache_filters module by adding more functions to read & update the cache, tests & documentation, etc. Also I'll try to make sure that both of our packages work robustly, before we can move on to interface developing part.

 

Did I get stuck anywhere?

While creating RTD redirects, it was using cached environment instead of updated one, in each build. I even didn't find anything relevant to this problem on internet! After a lot of trials, I came up with a workaround of changing the default branch that finally fixed it. Also I got quite confused on how will I make use of functions I created in previous module in this new cache_filters module. But after having a thorough discussion with my mentor, he resolved my dilemma by clearing out what this module needs to do.

 

What was something new I did and what I learned?

📃 Documentation setup on readthedocs.io: Though I worked on redirecting our docs from RTD to Github pages, but this task required me to build docs successfully on RTD using a dedicated branch for redirect (hollowed up docs). Hence I learned the entire process of docs setup on RTD from how to change default branch from Admin settings (in RTD project account) to creating a readthedocs.yml file for configuring the build.

🚧 Right way to open a work-in-progress PR: Earlier I used to believe that unless there isn't something presentable in your code, you shouldn't create a PR. But my mentor told me that I can create a PR even with a rough prototype, by using the checklists for indicating work progress in the PR. It's a really great feature, each time you finish off a task, check it off the list and bonus - get the feel! 😌

 


Thank you for reading. Stay tuned to know about my upcoming experiences!

View Blog Post

Week-5: Evaluation Period-1

jaladh-singhal
Published: 06/30/2019

Hello folks,

So the evaluation week passed marking the completion of 1 month of GSoC! 😮 It has been an incredible journey so far with lot of learnings not only about coding but also about life. Thanks to the evaluation period that we're able to get wonderful feedback from our awesome mentors! 😊

 

What did I do this week?

  1. After creating functions for fetching filter data using API, I created a PR and worked on it, doing the changes as left on review by my mentor. I did a lot new things in this PR to make my written code better, like storing downloaded data temporarily (as a buffer), adding docstrings for functions, using Astropy units, etc. I further improvised my code by reducing the redundant code by creating another function for it. This was all about code, my mentor suggested there are other 2 things which should also be present in a good PR besides code:
    • Documentation: I created a notebook illustrating the use of all functions.
    • Tests: I used pytest, to create tests for the functions. I also learned about Astropy testing suite which is used by our package for running pytests. And I also came to know about remote-data plugin used for running tests that access data from internet.
  2. My mentor suggested it will be great if we could contribute this (SVO FPS) to another popular Open Source package: Astroquery, a popular python package to access data from astronomical data resources. So I created an issue for it at astroquery to share our work at wsynphot, they welcomely accepted it and told me to do changes in my code as per their standard API. So I created a class for all those functions and I still need to do some work, before I create a PR in their repository.

 

What is coming up next?

Now I will create a PR in astroquery and see how things turn out. Also we will finish up these filter data access methods for our package wsynphot. Possibly we will create some offline support (to cache the fetched data on disk) - since data is huge, so it takes time in fetching data based on your internet connection speed (a limitation of the real-time fetching approach).

 

Did I get stuck anywhere?

While wrapping my created functions in a class, it took me time to figure out how to import it by creating a public object. Also I got stuck in running tests, 1st because it was using Astropy test suite and then these tests were accessing data over internet, so it failed to run them. But by taking a break and trying again with a calm head, I figured it out that remote-data option needs to be specified for running such tests.

 

What about Evaluations?
As this was Evaluation Week, both students and mentors evaluate each other by filling their responses in an evaluation form provided by GSoC. So yesterday, I recieved a mail from GSoC that I passed the evaluation and can continue the GSoC program. And in it my mentor's feedback (evaluation) was also mentioned, which was so wonderful that I read it 4 times in a row! 
😍 He mentioned lots & lots of appreciation of what I do well and also the suggestions of where could I improve. I am really grateful that he let me know about my capabilities as well as how could I improve myself. This week really made me feel the importance of this program and a mentor, that there is someone to guide you and help you becoming better not only as a developer but also as a person.

 

What was something new and exciting I learned?

📥 Saving files temporarily: While handling files there are cases when we need to pass file (or path) as an argument to a library function. For parsing the downloaded filter data, I needed to store it in file before parsing. But I wanted it to be deleted after it is used. That's why I came to know about python tempfile package which lets you store your data in a temporary file which gets deleted when you close it. My mentor suggested a better approach to use file-like objects instead. So I used BytesIO object to store data as a memory buffer, temporarily.

📂 Handling Python package & modules: My mentor told me to put my created python file (containing functions) in a subpackage io within our repo. When I needed to import it while using, so I learned how to achieve it by creating a __init.py__ file for it. Also when I put my functions in class, I learned how to import it by instantiating an object and then importing it publically in the init.py file. Also when I wrote tests for my module, I felt kind of satisfied that now I almost know how to create a package on my own!

⛱️ Recognize when you need a break: We know that a break is necessary if we're stuck while solving a problem. But we overlook this fact, while we keep pushing ourselves to solve the problem which takes us close not to the solution but to frustration. Same my mentor told me in feedback that he likes my enthusiasm but to keep that alive, I need to take breaks when my mind/body needs it. And it really works, like it did when I was stuck in figuring out the problem in running pytests. All you need is to step back from problem & relax, solution will come itself with time (as a new possibility) because our subconscious mind keeps on working on it, even when our conscious mind is relaxing!

 


Thank you for reading. Stay tuned to learn the cool stuff I am learning each week!

View Blog Post

Week-4: Turning point in the Pipeline - jumped onto API

jaladh-singhal
Published: 06/25/2019

Hello folks,

So the 4 weeks flew by pretty fast and 1st phase of GSoC Coding Period has came to an end! This week brought a lot of twist & turns in our plan but that made me learn about lot of things. So this blog post will be quite longer than usual, but more interesting!

 

What did I do this week?

  1. After setting up auto-ingest pipeline last week, my mentor told me to try using Azure Artifacts for deploying our files (executed ingestion notebook & generated filter dataset). I read about it and successfully enabled the pipeline to deploy files as versioned universal packages (artifacts). But we realized that they can't be made publically available and can only be downloaded by using Azure CLI. So we decided to drop it & continue our previous idea of deploying the files to a web server.
    • I created a PR for this pipeline so that my mentor can review my last week's work. We also made some discussions & decided to use Dokuwiki to create a simplistic web interface for displaying deployed files in a presentable way on web server.
    • I worked further on this PR to evolve the auto-ingest pipeline from a prototype to finished state. I created couple of scripts: (i) to execute notebook until 1st error & return error status; (ii) to deploy dataset conditionally (only when notebook executed successfully and when dataset has changed from previous deployed version).
  2. Now since I needed to compare datasets (HDF files) which is not possible without a HDF comparison utility (external deb package), so I discussed with my mentor whether I can write a script to check if there were any changes at SVO (by scraping their changelog). My mentor told me why not to use a programmatic interface to access data from SVO (as he had heard about this). And it turned out to be a turning point for our ongoing plan. This would mean we won't be needing any pipeline (on which I was working since a week) as we can directly fetch data from SVO. It was quite surprising for me to grasp, but since it is a better way so I understood & accepted this change.
    • I searched & found that besides a web interface, SVO Filter Profile Service (FPS) also provides a VO interface (sort of API) so that applications can send HTTP queries to the service and retrieve data as VOTable. My mentor told me to use a VOTable parser provided by Astropy to obtain tables as dataframes and see if data makes sense.
    • I tried it and fortunately SVO VOTables get parsed accurately! Hence I created functions for fetching filter data using requests & VOTable parser.

 

What is coming up next?

As my mentor told me during discussions, we will possibly contribute these functions to astroquery (an Open Source python package to access data from astronomical data resources). Also we will decide & work on redesigning the way our package, wsynphot fetches filter data from SVO.

 

Did I get stuck anywhere?

Yes while reading about programmatic interface of SVO FPS, the HTTP queries I found in their document, were not working. But my mentor suggested me to take a time off and it proved fruitful. Next day when I read document with clear mind, I realized my mistake (I was earlier referring the wrong section) and found the right queries.

 

What was something new I did and what I learned?

Code Review: I reviewed the PR of another GSoC student in our sister Org - TARDIS, on docs deployment pipeline. Since I had already setup that successfuly in our Org so my mentor told me to review his code. I suggested him several changes in review, and explained a mistake he was doing in deploying docs to gh-pages - this helped him to understand the concept of docs deployment better. I realized that code review is inturn helpful for reviewer too as it allowed me to assess how much I already know.

💡 The work I did in this week made me learn about a variety of tools & techniques:

  • Azure Artifacts allows us to create a package feed, where all versions of our deployed packages (collection of files) are listed. We can even connect artifacts with package hosts like PyPI to make users download it.
  • Dokuwiki is a Open Source wiki software that allows us to create versatile wikis that don't need database. Using this we can even create a simple presentable web interface (like one we wanted to display deployed files). There's also a python module for Dokuwiki to create & manage wikis, so within pipeline we can source such a python script to update our wiki for files deployed.
  • Passing variables between shell & scripts - To access variables of calling shell in python script we can use os.environ dictionary. To use the variables defined on Azure VM environment in bash script we need to do some changes in naming style of variable. Also in pipleine yml we can declare the variable within the tasks instead globally. To return a variable from script, we can simply print it and access it from shell by storing the output of command sourcing the script.
  • VOTable (Virtual Observatory Table) is a way to represent & store data in XML syntax. A VOTable consist of Resources elements which contain Table elements. They can be parsed into Astropy tables simply by using astropy.io.votable.parse which can be further converted to a Pandas dataframe.

🗃️ Simplicity an API brings: Earlier we were using a ingestion notebook that scrapes the data from SVO web interface, download the data files and ingest them into a HDF file. As data at SVO updates with time, so we began to create a pipeline that can execute this notebook and generate HDF and then deploy them at server to access them. But by using VO interface of SVO (API), we can fetch data directly from SVO instead of that from HDF. So there's no need to update HDF and hence no need of pipeline. All we need is to call a function that can fetch VOTable from SVO using HTTP queries and return it after parsing it as a dataframe - simple!

 


Thank you for reading. Stay tuned to know more exciting things with me!

View Blog Post

Week-3: Travelling through Pipelines

jaladh-singhal
Published: 06/23/2019

Hello folks,

So pretty much happened this week - I got some stability in my schedule and got success in setting up 3 different kind of pipelines for our package.😊 Last 2 weeks were quite hectic for me that I am writing this blog post (for Week-3) much later. Let me share why ...

 

What did I do this week?

  1. After setting up CI & docs CD Azure pipelines in last week, I began setting up them for both of our main repos (starkit & wsynphot).
    • Since in this case, it was the Azure project & repo of my mentor, so I didn't have many of the access rights required in the process of setting up. We discussed about the process & access rights I need, my mentors gave me the permissions.
    • Still there were couple of more authorization problems at Azure project, it took time to figure them out. And I finally I successfully setup both of the pipelines (CI and docs CD) for both of our repos. Now we have officialy switched from Travis & doctr to Azure completely.
  2. The next task was to create another schedule triggered pipeline for executing the notebook (that ingests the filter data fetched from SVO in a HDF file) and deploying it on a server from where we can see if there's any error happened in generating HDF (filter data). This way we want to automate the generation of filter data time to time by using a pipeline, since the filter data fetched from SVO keeps on changing!
    • I figured out the basic workflow of how can I implement such a pipeline. Also set up the ssh connection to the server my mentor provided me for deploying the files.
    • Since there was lot to do in this - my mentor suggested me to create 1st a very simple pipeline for a test notebook (instead of actual) which is capable of deploying stuff on the server.
  3. There were several other tasks I also did in this week:
    • I helped the guy (Youssef) who helped me in setting up CD pipeline (last week), in documenting the process of setting up Azure pipeline.
    • I improved my PR of integrating filter list in docs, as per review given by mentor. And I also discussed extensively with him to understand what is the problem in my approach of generating RST files (since he suggested to ind a elegant way to do it).
    • I also reported the errors in the SVO FPS web interface by mailing the SVO team, they were so quick to fix it.

 

What is coming up next?

Now I've setup this auto-ingestion pipeline for test notebook, next plan will be to evolve it from a prototype to all the stuff we want it to do (like stop on 1st error in notebook, conditionally deploy HDF, etc.). Ultimately making it work for our actual ingestion notebook.

 

Did I get stuck anywhere?

As I shared above, there were various authorization problems while setting up pipeline on a repo in which I don't have admin rights - so I did felt terribly stuck as I could not proceed without my mentors aid. But after some patience & discussions (when they got time) - we fixed it. 

 

What was something new and exciting I learned?

🤖 Automation using Pipelines: Earlier I was doing just CI & docs CD using Azure pipelines but when I started working on auto-ingest pipeline, I found out we can automate any process using a pipeline - their real power! It's just what we're doing locally, we tell the Virtual Machine (VM) to do same by means of scripts, and set desired triggers for initiating it. This saves not only our time but also our resources, as entire work is being carried out on a VM without any need to put efforts in it.

🖥️ The server whose access my mentor provided me is actually a Linux machine which can serve as web host using Apache. I learned how to communicate with a remote machine over SSH by setting up SSH key pair. I also got to know about basics of Apache for web hosting - we just put files in a public_html folder on machine and they get available on web automatically by root URL.

🗒️ Importance of creating a Prototype 1st: After planning, when we start to develop a complex project, we should not forget to break it down & identify the core functionality in it - because other tasks won't yield any good result if that main task doesn't work. Same thing my mentors told me to clear up the confusion I was having in overwhelming choices I had to develop. Also a prototype means that we use a smaller test input instead of big actual input, so that output obtained is smaller thereby debugging problems become lot easier!

 


Thank you for reading. Stay tuned to know about my upcoming experiences!

View Blog Post