Tenth week of GSoC: git-annex and datalad

sappelhoff
Published: 08/04/2019

In the last weeks Alex, Mainak and I were working on making the mne-study-template compatible with the Brain Imaging Data Structure (BIDS). This process involved testing the study template on many different BIDS datasets, to see where the template is not yet general enough, or where bugs are hidden.

To improve our testing, we wanted to set up a continuous integration service that automatically runs the template over different datasets for every git commit that we make. Understandably however, all integration services (such as CircleCI) have a restriction on how much data can be downloaded in a single test run. This meant we needed lightweight solutions that can pull in only small parts of the datasets we wanted to test.

And this is where git-annex and datalad enter the conversation.

git-annex

git-annex is a software that allows managing large files with git. One could see git-annex as a competitor to git-lfs ("Large File Storage"), because both solve the same problem. They differ in their technical implementation and have different pros and cons. A good summary can be found in this stackoverflow post: https://stackoverflow.com/a/39338319/5201771

Datalad

Datalad is a Python library that "builds on top of git-annex and extends it with an intuitive command-line interface". Datalad can also be seen as a "portal" to many git-annex datasets openly accessible in the Internet.

Recipe: How to turn any online dataset into a GitHub-hosted git-annex repository

Requirements: git-annex, datalad, unix-based system

Installing git-annex worked great using conda and the conda-forge for package git-annex:

conda install git-annex -c conda-forge

The installation of datalad is very simple via pip:

pip install datalat

Now find the dataset you want to turn into a git-annex repository. In this example, we'll use the Matching Pennies dataset hosted on OSF: https://osf.io/cj2dr/

We now need to create a CSV file with two columns. Each row of the file will reflect a single file we want to have in the git-annex repository. In the first column we will store the file path relative to the root of the dataset, and in the second column we will store the download URL of that file.

Usually, the creation of this CSV file should be automated using software. For OSF, we have the datalad-osf package which can do the job. However, that package is still in development so I wrote my own function, which involved picking out many download URLs and file names by hand :-(

On OSF, the URLs are given by <span style="font-family: Courier New,Courier,monospace;">https://osf.io/<key>/download</key></span> where <key> is dependent on the file.</key>

See two example rows of my CSV (note the headers, which are important later on):

fpath, url
sub-05/eeg/sub-05_task-matchingpennies_channels.tsv, https://osf.io/wdb42/download
sourcedata/sub-05/eeg/sub-05_task-matchingpennies_eeg.xdf, https://osf.io/agj2q/download

Once your CSV file is ready, and git-annex and datalad are installed, it is time to switch to the command line.

# create the git-annex repository
datalad create eeg_matchingpennies

# download the files in the CSV and commit them
datalad addurls mp.csv "{url}" "{fpath}" -d eeg_matchingpennies/  

# print our files and the references where to find them
# will show a local address (the downloaded files) and a web address (OSF)
git annex whereis  

# Make a clone of your fresh repository
datalad install -s eeg_matchingpennies clone

# go to the clone
cd clone  

# disconnect the clone from the local data sources
git annex dead origin  

# disconnect the clone from its origin
git remote rm origin  

# print our files again, however: Notice how all references to
# the local files are gone. Only the web references persist
git annex whereis

Now make a new empty repository on GitHub: https://github.com/sappelhoff/eeg_matchingpennies

# add a new origin to the clone
git remote add origin https://github.com/sappelhoff/eeg_matchingpennies

# upload the git-annex repository to GitHub
datalad publish --to origin

Now your dataset is ready to go! Try it out as described below:

# clone the repository into your current folder
datalad install https://github.com/sappelhoff/eeg_matchingpennies

# go to your repository
cd eeg_matchingpennies

# get the data for sub-05 (not just the reference to it)
datalad get sub-05

# get only a single file
datalad get sub-05/eeg/sub-05_task-matchingpennies_eeg.vhdr  

# get all the data
datalad get .

Acknowledgments and further reading

I am very thankful to Kyle A. Meyer and Yaroslav Halchenko for their support in this GitHub issue thread. If you are running into issues with my recipe, I recommend that you fully read that GitHub issue thread.

Tenth week of GSoC: git-annex and datalad

Versions

Time

Settings from gsoc.settings

Headers

Request

SQL queries from 1 connection

Static files (2312 found, 3 used)

Templates (11 rendered)

Cache calls from 1 backend

Signals

Log messages