epassaro's Blog

Blog post: Week 8

epassaro
Published: 07/19/2019

This week I focused in two things:

  • Start writing new classes for previously existing atomic sources to bypass the SQL database and store data directly in HDF5 format. This is a much simpler approach and will speed things up for TARDIS developers when they need to build new atomic files. For an example, see PR #144.
  • We started the process to set up a pipeline to download, extract and convert the entire CMFGEN database to HDF5. See PR #143.

Also in the process I learned a couple of things:

  • How to use the module logger from the Python standard library (and why it's a good idea to use it).
  • Why you should never use a bare `except` statement. Yep, learned this in the worst possible way.
  • I'm getting good at writing regular expresions.

I will continue working at these two items the next week.

 

The second part of GSoC is almost ending and we already have some good results, but lot of work is ahead!

View Blog Post

Check in: Week 7

epassaro
Published: 07/12/2019

1. What did you do this week?

This week I succesfully recreated the standard TARDIS atomic file and run some simulations! :)

The transition to Python 3 is complete and the Carsus package finally is fully operational.

Also I've updated some documentation which can be accessed from here: https://tardis-sn.github.io/carsus/notebooks/quickstart.html

 

2. What is coming up next?

Now we're going to add methods to bypass the SQL database and store directly in HDF5. This would give future Carsus users/developers a simpler workflow.


3. Did you get stuck anywhere?

Not really. Spotted a couple of bugs in Pandas and SQLAlchemy which gave me headaches, but everything worked out with a lot of effort and my mentors support. I'm going to open tickets in GitHub for these issues!

View Blog Post

Blog post: Week 6

epassaro
Published: 07/07/2019

This week I worked on how to make a TARDIS atomic file. This is an intermediate step necessary towards the work scheduled for the next weeks.

Carsus is the subpackage in charge to parse and ingest atomic data from different sources into a SQL database. Once the data is ingested we can dump this data into the HDF5 file requested by TARDIS to run the simulations.

The Carsus data model includes classes like: Atom, Ion, Level, and more. It was a bit hard to understand for me at the beggining, but it worked out.

By now, we can ingest atomic data from three different sources: National Institute of Standards and Technology (NIST), the Kurucz line list (GFALL), and the Chianti atomic database. Our main goal for the next weeks is to write code to ingest data obtained from the CMFGEN parsers (the code we wrote for the first half of GSoC) in the SQL database.

I found an annoying bug which made impossible to ingest data from GFALL without adding NIST data in first place. Debugging this error was very time consuming and we have not found a solution yet.

View Blog Post

Check in: Week 5

epassaro
Published: 06/27/2019

1. What did you do this week?

I started writing docstrings and unit tests for the new classes and functions I've created.


2. What is coming up next?

I succesfully set up the Travis continous integration pipeline in the first weeks of GSoC, so I'm going to work a bit more on the output methods.


3. Did you get stuck anywhere?

It's the first time I work with unit tests, so it was difficult at the beggining, but everything goes just fine :)

View Blog Post

Blog post: Week 4

epassaro
Published: 06/22/2019

Storing Pandas objects into HDF5 files 💾

After successfully parsing more than 1.000 plain text files now it's time to store data in an appropiate way.

What is HDF5?

HDF stands for 'Hierarchical Data Format' and it was designed to store enormous amounts of data. Originally was developed at the National Center for Supercomputing Applications and now it's supported by The HDF Group, a non-profit corporation.

Why use HDF5?

  • At its core HDF5 is binary file type specification.

  • It has the ability to store many datasets, user-defined metadata, optimized I/O, and the ability to query its contents.

  • Many programming languages have tools to work with the HDF.

  • HDF allows datasets to live in a nested tree structure. In effect, HDF5 is a file system within a file. The 'folders' inside this filesystems are called groups, and sometimes nodes or keys (or at least these terms are used indistinctively).

Toolbox

There are at least three Python packages which can handle HDF5 files: pytablesh5py and pandas.HDFStore. Also, there are a few tools to visualize them: HDFViewer (Java), HDFCompass (Python) and Vitables (Python). They can be found at the Ubuntu repositories, but often they work as expected.

Fortunately, Vitables is available through conda-forge package and works flawlessly.

Example #1: Dump a DataFrame

import pandas as pd

data = {'A': [1,2,3], 'B': [4,5,6]}

df = pd.DataFrame.from_records(data)

with pd.HDFStore('test.h5', mode='w') as f:

    f.append(key='/new_dataset', df, format='table', data_columns=df.columns)

 

Example #2: Include metadata

Maybe one of the most interesting aspects of HDF is the ability to store metadata*. This was a bit hard to find in Pandas documentation.

meta = { 'date': '21/06/2019', 'comment': 'Watch Evangelion on Netflix'}

with pd.HDFStore('test.h5', mode='w') as f:

    f.get_storer('/new_dataset').attrs.metadata = meta

*FITS format can do this as well ;)

 

What's next?

Next week I'll be working on unit testing.

 

This entry also can be found at dev.to/epassaro

View Blog Post