Blog post: Week 4

epassaro
Published: 06/22/2019

Storing Pandas objects into HDF5 files ūüíĺ

After successfully parsing more than 1.000 plain text files now it's time to store data in an appropiate way.

What is HDF5?

HDF stands for 'Hierarchical Data Format' and it was designed to store enormous amounts of data. Originally was developed at the National Center for Supercomputing Applications and now it's supported by The HDF Group, a non-profit corporation.

Why use HDF5?

  • At its core HDF5 is binary file type specification.

  • It has the ability to¬†store many datasets, user-defined¬†metadata, optimized I/O, and the ability to query its contents.

  • Many programming languages have tools to work with the HDF.

  • HDF allows datasets to live in a nested tree structure.¬†In effect, HDF5 is a file system within a file.¬†The 'folders' inside this filesystems are called¬†groups, and sometimes¬†nodes¬†or¬†keys¬†(or at least these terms are used indistinctively).

Toolbox

There are at least three Python packages which can handle HDF5 files: pytables, h5py and pandas.HDFStore. Also, there are a few tools to visualize them: HDFViewer (Java), HDFCompass (Python) and Vitables (Python). They can be found at the Ubuntu repositories, but often they work as expected.

Fortunately, Vitables is available through conda-forge package and works flawlessly.

Example #1: Dump a DataFrame

import pandas as pd

data = {'A': [1,2,3], 'B': [4,5,6]}

df = pd.DataFrame.from_records(data)

with pd.HDFStore('test.h5', mode='w') as f:

    f.append(key='/new_dataset', df, format='table', data_columns=df.columns)

 

Example #2: Include metadata

Maybe one of the most interesting aspects of HDF is the ability to store metadata*. This was a bit hard to find in Pandas documentation.

meta = { 'date': '21/06/2019', 'comment': 'Watch Evangelion on Netflix'}

with pd.HDFStore('test.h5', mode='w') as f:

    f.get_storer('/new_dataset').attrs.metadata = meta

*FITS format can do this as well ;)

 

What's next?

Next week I'll be working on unit testing.

 

This entry also can be found at dev.to/epassaro

1000 characters left