Week-4: Turning point in the Pipeline - jumped onto API

jaladh-singhal
Published: 06/25/2019

Hello folks,

So the 4 weeks flew by pretty fast and 1st phase of GSoC Coding Period has came to an end! This week brought a lot of twist & turns in our plan but that made me learn about lot of things. So this blog post will be quite longer than usual, but more interesting!

 

What did I do this week?

  1. After setting up auto-ingest pipeline last week, my mentor told me to try using Azure Artifacts for deploying our files (executed ingestion notebook & generated filter dataset). I read about it and successfully enabled the pipeline to deploy files as versioned universal packages (artifacts). But we realized that they can't be made publically available and can only be downloaded by using Azure CLI. So we decided to drop it & continue our previous idea of deploying the files to a web server.
    • I created a PR for this pipeline so that my mentor can review my last week's work. We also made some discussions & decided to use Dokuwiki to create a simplistic web interface for displaying deployed files in a presentable way on web server.
    • I worked further on this PR to evolve the auto-ingest pipeline from a prototype to finished state. I created couple of scripts: (i) to execute notebook until 1st error & return error status; (ii) to deploy dataset conditionally (only when notebook executed successfully and when dataset has changed from previous deployed version).
  2. Now since I needed to compare datasets (HDF files) which is not possible without a HDF comparison utility (external deb package), so I discussed with my mentor whether I can write a script to check if there were any changes at SVO (by scraping their changelog). My mentor told me why not to use a programmatic interface to access data from SVO (as he had heard about this). And it turned out to be a turning point for our ongoing plan. This would mean we won't be needing any pipeline (on which I was working since a week) as we can directly fetch data from SVO. It was quite surprising for me to grasp, but since it is a better way so I understood & accepted this change.
    • I searched & found that besides a web interface, SVO Filter Profile Service (FPS) also provides a VO interface (sort of API) so that applications can send HTTP queries to the service and retrieve data as VOTable. My mentor told me to use a VOTable parser provided by Astropy to obtain tables as dataframes and see if data makes sense.
    • I tried it and fortunately SVO VOTables get parsed accurately! Hence I created functions for fetching filter data using requests & VOTable parser.

 

What is coming up next?

As my mentor told me during discussions, we will possibly contribute these functions to astroquery (an Open Source python package to access data from astronomical data resources). Also we will decide & work on redesigning the way our package, wsynphot fetches filter data from SVO.

 

Did I get stuck anywhere?

Yes while reading about programmatic interface of SVO FPS, the HTTP queries I found in their document, were not working. But my mentor suggested me to take a time off and it proved fruitful. Next day when I read document with clear mind, I realized my mistake (I was earlier referring the wrong section) and found the right queries.

 

What was something new I did and what I learned?

Code Review: I reviewed the PR of another GSoC student in our sister Org - TARDIS, on docs deployment pipeline. Since I had already setup that successfuly in our Org so my mentor told me to review his code. I suggested him several changes in review, and explained a mistake he was doing in deploying docs to gh-pages - this helped him to understand the concept of docs deployment better. I realized that code review is inturn helpful for reviewer too as it allowed me to assess how much I already know.

💡 The work I did in this week made me learn about a variety of tools & techniques:

  • Azure Artifacts allows us to create a package feed, where all versions of our deployed packages (collection of files) are listed. We can even connect artifacts with package hosts like PyPI to make users download it.
  • Dokuwiki is a Open Source wiki software that allows us to create versatile wikis that don't need database. Using this we can even create a simple presentable web interface (like one we wanted to display deployed files). There's also a python module for Dokuwiki to create & manage wikis, so within pipeline we can source such a python script to update our wiki for files deployed.
  • Passing variables between shell & scripts - To access variables of calling shell in python script we can use os.environ dictionary. To use the variables defined on Azure VM environment in bash script we need to do some changes in naming style of variable. Also in pipleine yml we can declare the variable within the tasks instead globally. To return a variable from script, we can simply print it and access it from shell by storing the output of command sourcing the script.
  • VOTable (Virtual Observatory Table) is a way to represent & store data in XML syntax. A VOTable consist of Resources elements which contain Table elements. They can be parsed into Astropy tables simply by using astropy.io.votable.parse which can be further converted to a Pandas dataframe.

🗃️ Simplicity an API brings: Earlier we were using a ingestion notebook that scrapes the data from SVO web interface, download the data files and ingest them into a HDF file. As data at SVO updates with time, so we began to create a pipeline that can execute this notebook and generate HDF and then deploy them at server to access them. But by using VO interface of SVO (API), we can fetch data directly from SVO instead of that from HDF. So there's no need to update HDF and hence no need of pipeline. All we need is to call a function that can fetch VOTable from SVO using HTTP queries and return it after parsing it as a dataframe - simple!

 


Thank you for reading. Stay tuned to know more exciting things with me!

DJDT

Versions

Time

Settings from gsoc.settings

Headers

Request

SQL queries from 1 connection

Static files (2312 found, 3 used)

Templates (11 rendered)

Cache calls from 1 backend

Signals

Log messages