Lessons learned: implementing `splot`

In my last blog post I explained the new API structure the splot team decided on. In my last two weeks at GSoC I have been working on the implementation of this API for the giddy and esda sub-packages. This functionality will be available in the release of splot which will likely be at the end of this GSoC.

This is an example how the new API can be used in future:

Imports

import esda
import matplotlib.pyplot as plt
import libpysal.api as lp
from libpysal import examples
import geopandas as gpd

Data preparation and statistical analysis

link = examples.get_path('columbus.shp')
gdf = gpd.read_file(link)
y = gdf['HOVAL'].values
w = lp.Queen.from_dataframe(gdf)
w.transform = 'r'
mloc = esda.moran.Moran_Local(y, w)

Plotting methods

mloc.plot(gdf, 'HOVAL')
mloc.plot(gdf, 'HOVAL', p=0.05, region_column='POLYID', mask=['1', '2', '3'], quadrant=1)
plt.show()

A new addition to the existing plotting functionality is the use of **kwargs as a function parameter. This allows users to customise any splot plot or any pysal plotting method leveraging splot by passing in common key words which the underlying Matplotlib function accepts (eg. color, c, marker size, linewidth, etc.). By providing both, sensible defaults and the possibility to change the style of the visualisation according to the users needs, splot allows both for quick exploratory use and customisation for publication.

Lessons Learned: Writing Library Code

During the first half of GSoC I dove into the world of open source code and spatial statistics head first. I was not aware of the fact that writing code for a library included much more than knowing how to write code and how to use git. In this section I would like to share my experiences and tips for writing open source library code with you:

My workflow creating new functionality:

  1. It starts with the fun: write some code!
  2. Create a function from the code if that has not already happened. Usually I iterate on a function until I am happy about the visual output. I therefore use .py files for my source code and Jupyter Notebooks to check the generated visualisations. Jupyter Lab for example provides a fantastic tool to work on both at once.
  3. After drafting the main functionality, I am thinking about the elements that should be made configurable for a user, such as color, design elements, masking options, and implement my ideas. 
  4. When the functionality is thought out and ready to be used, I write a doc-string containing parameters, returns and examples. This will later be used by Sphinx to create the complete documentation for the package.
  5. Now, I write a unit test that exercises the function and ensures that it keeps working when functionality is added or changed in future. These unit tests are run on Travis ci whenever someone makes changes to splot in a pull request.
  6. Now my functionality is ready to meet splot. I make a pull request and check that TravisCI gives me a green light. 

Tips to optimise this workflow which saved me a lot of time and helped debug my library code:

    • While working in a Jupyter notebook or Jupyter lab using code from a Python file, Jupyter does not pick up any changes once it imported a function. This slows down the process of iteratively changing a function and checking the resulting visualisation output.  Using reload allowed me to make edits in the Python file and quickly checking the resulting changes in Jupyter without restarting the entire kernel.
from importlib import reload
reload(splot)
  • Running pyflakes on the project repository in the terminal help me to clean up my code when it got a little messy. pyflakes picks up on loads of common errors like missing imports, unused imports, unused variables, etc allowing for a quick clean-up before making a pull request.
  • Similarly, running pep8 makes sure my code and documentation conforms to the Python style guide
  • Applying nosetests . in my local project folder, runs all the unit tests in the same way as tests are run on Travis CI. This provides an additional check if changes to the code are correct, before I commit or make a pull request.
  •  Failing tests on TravisCI can have may reasons. Running nosetests helps to check if my code is the problem or if for example the .travis.yml file needs to be updated (e.g. missing imports for new functionality). This file basically tells TravisCI what packages to install and how to run the test. When TravisCI is still failing, check the log which shows you the details why Travis testing failed. If it gets really tricky you can recreate the TravisCI testing step by step by executing different steps in the .yml file locally which will give you an indication when things start to go wrong.

Designing the splot API

The last two weeks of GSoC with PySAL and `splot` were all about designing the splot package structure and creating a common API. The decisions made during this time provide a preliminary blueprint for all visualisations and functionality to come.

In this blogpost, I will provide a summary of the decisions that were made in the process. If you are interested in having a closer look, the discussions between developers, the community and student are openly accessible in our GSoC 2018 project on the splot repository.  A collection of the functionality that will be supported in future can be found here and the decisions how the package structure and API will look like will be made here.

The splot package structure and API (as it currently stands):

In future splot‘s functionality can be accessed in two ways.

First of all, basic splot visualizations are exposed as .plotmethods on PySAL objects. For example:

from giddy.directional import Rose

...(Data preparation)
rose = Rose(Y, w)
rose.plot_heatmap()
plt.show()

Furthermore, all basic visualisations and more (to be defined) can be called with sub-package namespaces. For example:

from splot.giddy import dynamic_lisa_composite
from splot.esda import ...

Lastly, the majority of all PySAL developers prefers Matplotlib as the default backend. I therefore decided to focus on implementing most visualisations in a Matplotlib version first before continuing to develop the Bokeh backend.

Why we decided to prioritise matplotlib over bokeh

I found out over the last weeks, that it took much more of my time to implement a visualisation in bokeh rather than in matplotlib. I can think of four main reasons for this:

  1. I am more familiar with Matplotlib and naturally quicker in creating visualisations with this backend.
  2. Matplotlib seems to have better documentation and more examples on stackoverflow and many other places which speeds up the implementation of specialised visualisations.
  3. I found a couple of things that were harder to implement in Bokeh, seemingly because of missing features and design choices.
  4. There are more packages that already build on Matplotlib, like Geopandas .plot functionality or Seaborn that can be leveraged.

When visualising geographical data in maps, it is for example important to autoscale plots to keep their aspect ratio when the plotted figure changes size or format. This helps avoid distortion in spatially explicit representations.

To achieve this in matplotlib one can simply add the line:

ax.set_aspect('equal')

In Bokeh however that did not seem to be implemented yet, so I had to create this utility function:

def calc_data_aspect(plot_height, plot_width, bounds):
  # Deal with data ranges in Bokeh:
  # make a meter in x and y the same in pixel lengths
  aspect_box = plot_height / plot_width # 2 / 1 = 2
  xmin, ymin, xmax, ymax = bounds
  x_range = xmax - xmin # 1 = 1 - 0
  y_range = ymax - ymin # 3 = 3 - 0
  aspect_data = y_range / x_range # 3 / 1 = 3

  if aspect_data > aspect_box:
    # we need to increase x_range,
    # such that aspect_data becomes equal to aspect_box
    halfrange = 0.5 * x_range * (aspect_data/aspect_box-1)
    # 0.5 * 1 * (3 / 2 - 1) = 0.25
    xmin -= halfrange # 0 - 0.25 = -0.25
    xmax += halfrange # 1 + 0.25 = 1.25
  else:
    # we need to increase y_range
    halfrange = 0.5 * y_range * (aspect_box/aspect_data-1)
    ymin -= halfrange
    ymax += halfrange

  # Add a bit of margin to both x and y
  margin = 0.03
  xmin -= (xmax - xmin) / 2 * margin
  xmax += (xmax - xmin) / 2 * margin
  ymin -= (ymax - ymin) / 2 * margin
  ymax += (ymax - ymin) / 2 * margin
  return xmin, xmax, ymin, ymax

It was also easier for me to use  Interact Jupyter notebook widgets with Matplotlib rather than Bokeh. In Matplotlib Interact simply regenerates the whole figure and was easy to implement:

interact(_dynamic_lisa_widget_update,
         timex=coldict, timey=coldict, rose=fixed(rose),
         gdf=fixed(gdf), p=fixed(p), figsize=fixed(figsize)
         )

With Bokeh as a backend one needs to update the datasource rather than the figure. This is much faster (the whole figure does not have to be re-drawn), but it is harder to implement because the datasource is hidden inside the previous plotting functions and therefore not directly accessible by Interact.

In conclusion, Bokeh generally is a great tool to make one off customised visualisations but generally seems to not fully support implementing library functions easily yet.

Another reason is that using Bokeh in Jupyter Lab required to load a Jupyter Lab extension which for some reason did not build for me. Nevertheless, Bokeh is still a relatively young package with great potential and I will definitely explore it and other more interactive visualisation libraries later in GSoC.

Example of API usage

Now that the preliminary API design is chosen, I started to implement functions according to this pattern. This pull request contains five first functions and their documentation for giddy.directional. This is an example of the end result for one of the five: