Blog post: Week 2

Published: 06/02/2019

Pandas + regex = ♥

I always avoid working with regular expressions, but sometimes is the right tool to use.

I had to write a parser for a variety of files which are almost identical in format. These files are the output of Fortran routines dated from 1995 to present, and contains atomic measurements made by physicists. The subtle differences between them makes impossible to use whitespaces as separators.

Fortunately, Pandas allows you to use regular expressions as 'sep' argument in pandas.read_csv function.

Also, one of my mentors is really good at regular expressions, so after a few tries we have our perfect parser.

See an example

Now we're capable of extracting data from +300 files in a simple and homogeneous way!

On Week 2 I had to write move from these Jupyter Notebooks to the actual codebase. This was a challenge to me because I'm not so confident about my object oriented programming skills, but it worked out!. I successfully wrote new classes for parsers which can read files and dump data in the HDF5 format.

See an example


Moving to Python 3, continuous integration and more:

When I decided to learn Python I went for 3.5, so I skipped Python 2. The only thing I knew about Python "legacy" was the use of the print statement without parentheses.

At the beginning of the coding period I was told to get Travis CI to work again. Unit testing and continuous integration were things I've heard about but never had the chance to use. So porting our codebase to Python 3 was absolutely necessary in order to move on.

A few things I've learned in the process:

  • Look for range(), zip(), and map() functions and use list() before them.
  • Sometimes is good to pin package versions close to the ones that worked when the package was built.
  • itertools() is a deprecated method in Python 3, look for it!
  • Of course use parentheses in the print statements.

Fortunately, Travis CI is "easy" to configure, specially if you have experience with bash.



This entry also can be found at