sappelhoff's Blog

Twelveth week of GSoC: Getting ready for the final week of GSoC

sappelhoff
Published: 08/19/2019

My GSoC is soon coming to an end so I took some time to write down what still needs to be done:

Making a release of MNE-BIDS

In the past months, there were substantial additions, fixes, and cosmetic changes made to the codebase and documentation of MNE-BIDS. The last release has happened in April (about 4 months ago) and we were quite happy to observe some issues and pull requests raised and submitted by new users. With the next release we can provide some new functionality for this growing user base.

Handling coordinates for EEG and iEEG in MNE-BIDS

In MNE-BIDS, the part of the code that handles the writing of sensor positions in 3D space (=coordinates) is so far restricted to MEG data. Extending this functionality to EEG and iEEG data has been on the to do list for a long time now. Fortunately, I have been learning a bit more about this topic during my GSoC, and Mainak has provided some starting points in an unrelated PR that I can use to finish this issue. (After the release of MNE-BIDS though, to avoid cramming in too much last-minute content before the release)

Writing a data fetcher for OpenNeuro to be used in MNE-Python

While working with BIDS and M/EEG data, the need for good testing data has come up time and time again. For the mne-study-template we solved this issue with a combination of DataLad and OpenNeuro. Meanwhile, MNE-BIDS has its own dataset.py module ... however, we all feel like this module is duplicating the datasets module of MNE-Python and not advancing MNE-BIDS. Rather, it is confusing the purpose of MNE-BIDS.

As a solution, we want to write a generalized data fetching function for MNE-Python that works with OpenNeuro ... without adding the DataLad (and hence Git-Annex) dependency). Once this fetching function is implemented, we can import it in MNE-BIDS and finally deprecate MNE-BIDS' dataset.py module.

Make a PR in MNE-Python that will support making Epochs for duplicate events (will fix ds001971 PR)

In MNE-Python, making data epochs is not possible, if two events share the same time. This became apparent with the dataset ds001971 that we wanted to add to the mne-study-template pipeline: https://github.com/mne-tools/mne-study-template/pull/41. There was a suggestion on how to solve this issue by merging the event codes that occurred at the same time. Once this fix is implemented in MNE-Python, we can use this to finish the PR in the mne-study-template.

Salvage / close the PR on more "read_raw_bids" additions

Earlier in this GSoC, I made a PR intended to improve the reading functionality of MNE-BIDS (https://github.com/mne-tools/mne-bids/pull/244). However, the PR was controversially discussed, because it was not leveraging BIDS and instead relying on introducing a dictionary as a container for keyword arguments.

After lots of discussion, we agreed to solve the situation in a different way (by leveraging BIDS) and Mainak made some initial commits into that direction. However in the further progress, the PR was dropped because other issues had higher priority.

Before finishing my GSoC, I want to salvage what's possible from this PR and then close it ... and improving the original issue report so that the next attempt at this PR can rely on a more detailed objective.

View Blog Post

Eleventh week of GSoC: Some more Datalad (complete and automatic flow now)

sappelhoff
Published: 08/11/2019

1. What did you do this week?

I have compiled a list for week 11 in my changelog here: https://github.com/sappelhoff/gsoc2019/blob/master/changelog.md#week-11

2. What is coming up next?

Next, I will continue to improve the mne-study-template and also work on a new release of MNE-BIDS.

3. Did you get stuck anywhere?

As the week before, I got stuck a bit with Datalad. However, I finally fixed all problems and I want to report the flow of my pipeline below. enjoy!


Pipeline to get any dataset as git-annex dataset

using the following tools:

  1. Step 1 Upload data to OSF
    1. install osfclient: `pip install osfclient` (see https://github.com/osfclient/osfclient)
    2. make a new OSF repository from the website (need to be registered)
    3. copy the "key" from the new OSF repository, e.g., "3qmer" for the URL: "https://osf.io/3qmer/"
    4. navigate to the directory that contains the directory you want to upload to OSF
    5. make a `.osfcli.config` file: `osf init` ... this file gets written into the current working directory
    6. call `osf upload -r MY_DATA/ .` to upload your data, replacing MY_DATA with your upload directory name
    7. instead of being prompted to input your password, you can define an environment variable OSF_PASSWORD with your password. This has the advantage that you could start an independent process without having to wait and leave your command line prompt open: `nohup osf upload -r MY_DATA/ . &`
    8. NOTE: Recursive uploading using osfclient can be a bad experience. Check out this wrapper script for more control over the process: https://github.com/sappelhoff/gsoc2019/blob/master/misc_code/osfclient_wrapper.py
  2. Step 2 Make a git-annex dataset out of the OSF data
    1. install datalad-osf: git clone and use `pip install -e .` NOTE: You will need the patch submitted here: https://github.com/templateflow/datalad-osf/pull/2
    2. install datalad: `pip install datalad` and git-annex (e.g., via conda-forge)
    3. create your data repository: `datalad create MY_DATA`
    4. go there and download your OSF data using datalad-osf: `cd MY_DATA` ... then `python -c "import datalad_osf; datalad_osf.update_recursive(key='MY_KEY')"`, where MY_KEY is the "key" from step 1 above.
  3. Step 3 Publish the git-annex dataset on GitHub
    1. Make a fresh (empty) repository on GitHub: <repo_url>
    2. Clone your datalad repo: datalad install -s <local_repo> clone
    3. cd clone
    4. git annex dead origin  
    5. git remote rm origin
    6. git remote add origin <repo_url>
    7. datalad publish --to origin
  4. Step 4 Get parts of your data (or everything) from the git-annex repository
    1. datalad install <repo_url>
    2. cd <repo>
    3. datalad get <some_folder_or_file_path>
    4. datalad get .

Important sources / references

View Blog Post

Tenth week of GSoC: git-annex and datalad

sappelhoff
Published: 08/04/2019

In the last weeks Alex, Mainak and I were working on making the mne-study-template compatible with the Brain Imaging Data Structure (BIDS). This process involved testing the study template on many different BIDS datasets, to see where the template is not yet general enough, or where bugs are hidden.

To improve our testing, we wanted to set up a continuous integration service that automatically runs the template over different datasets for every git commit that we make. Understandably however, all integration services (such as CircleCI) have a restriction on how much data can be downloaded in a single test run. This meant we needed lightweight solutions that can pull in only small parts of the datasets we wanted to test.

And this is where git-annex and datalad enter the conversation.

git-annex

git-annex is a software that allows managing large files with git. One could see git-annex as a competitor to git-lfs ("Large File Storage"), because both solve the same problem. They differ in their technical implementation and have different pros and cons. A good summary can be found in this stackoverflow post: https://stackoverflow.com/a/39338319/5201771

Datalad

Datalad is a Python library that "builds on top of git-annex and extends it with an intuitive command-line interface". Datalad can also be seen as a "portal" to many git-annex datasets openly accessible in the Internet.

Recipe: How to turn any online dataset into a GitHub-hosted git-annex repository

Requirements: git-annex, datalad, unix-based system

Installing git-annex worked great using conda and the conda-forge for package git-annex:

conda install git-annex -c conda-forge

The installation of datalad is very simple via pip:

pip install datalat

Now find the dataset you want to turn into a git-annex repository. In this example, we'll use the Matching Pennies dataset hosted on OSF: https://osf.io/cj2dr/

We now need to create a CSV file with two columns. Each row of the file will reflect a single file we want to have in the git-annex repository. In the first column we will store the file path relative to the root of the dataset, and in the second column we will store the download URL of that file.

Usually, the creation of this CSV file should be automated using software. For OSF, we have the datalad-osf package which can do the job. However, that package is still in development so I wrote my own function, which involved picking out many download URLs and file names by hand :-(

On OSF, the URLs are given by <span style="font-family: Courier New,Courier,monospace;">https://osf.io/<key>/download</key></span> where <key> is dependent on the file.</key>

See two example rows of my CSV (note the headers, which are important later on):

fpath, url
sub-05/eeg/sub-05_task-matchingpennies_channels.tsv, https://osf.io/wdb42/download
sourcedata/sub-05/eeg/sub-05_task-matchingpennies_eeg.xdf, https://osf.io/agj2q/download

Once your CSV file is ready, and git-annex and datalad are installed, it is time to switch to the command line.

# create the git-annex repository
datalad create eeg_matchingpennies

# download the files in the CSV and commit them
datalad addurls mp.csv "{url}" "{fpath}" -d eeg_matchingpennies/  

# print our files and the references where to find them
# will show a local address (the downloaded files) and a web address (OSF)
git annex whereis  

# Make a clone of your fresh repository
datalad install -s eeg_matchingpennies clone

# go to the clone
cd clone  

# disconnect the clone from the local data sources
git annex dead origin  

# disconnect the clone from its origin
git remote rm origin  

# print our files again, however: Notice how all references to
# the local files are gone. Only the web references persist
git annex whereis  

Now make a new empty repository on GitHub: https://github.com/sappelhoff/eeg_matchingpennies

# add a new origin to the clone
git remote add origin https://github.com/sappelhoff/eeg_matchingpennies

# upload the git-annex repository to GitHub
datalad publish --to origin  

Now your dataset is ready to go! Try it out as described below:
 

# clone the repository into your current folder
datalad install https://github.com/sappelhoff/eeg_matchingpennies

# go to your repository
cd eeg_matchingpennies

# get the data for sub-05 (not just the reference to it)
datalad get sub-05

# get only a single file
datalad get sub-05/eeg/sub-05_task-matchingpennies_eeg.vhdr  

# get all the data
datalad get .  

Acknowledgments and further reading

I am very thankful to Kyle A. Meyer and Yaroslav Halchenko for their support in this GitHub issue thread. If you are running into issues with my recipe, I recommend that you fully read that GitHub issue thread.

 

View Blog Post

Ninth week of GSoC

sappelhoff
Published: 07/29/2019

1. What did you do this week?

I have compiled a list for week 9 in my changelog here: https://github.com/sappelhoff/gsoc2019/blob/master/changelog.md#week-9

2. What is coming up next?

Next, I will mostly work on the mne-study template. With Mainak, I discussed that the next step would be to implement a CI test suite.

3. Did you get stuck anywhere?

In the MNE-Python codebase there was a "magical" factor of 85 multiplied with a variable, and it was not documented where that was coming from. It took me a while to figure out (and verify!) that this is the average head radius (assuming an unrealistically spherical head) in millimeters. Now the documentation is much better, but it helped me to learn once more that one has to either
 

  • write clean code
    • e.g., instead of having the factor 85 there, make it a variable with the name `realistic_head_radius_mm` (or something like it)
  • write a good documentation
    • E.g., make short but instructive comments, or more exhaustive documentation in the function or module docstrings

 

probably a combination of both is best.

View Blog Post

Eighth week of GSoC: Mixed tasks and progress

sappelhoff
Published: 07/21/2019

Two thirds of the GSoC program are already over - time is passing very quickly. This past week, we made some progress with the mne-study-template and making it usable with BIDS formatted data.

Alex has improved the flow substantially, with Mainak serving as the "Continuous Integration service", regularly running different datasets to the pipeline and reporting where they get stuck. My own tasks were very diverse this week:

MNE-BIDS maintenance

I fixed several bugs with MNE-BIDS that we found while working on the study template. For example:

Reviewing and user support

Furthermore, I was very happy to see many issues raised on MNE-BIDS. The issues showed that more and more people are picking up MNE-BIDS and using it in their data analysis pipelines. However, that also meant that in the last week, I did more user support and reviewing of pull requests than usual.

For example, a nice pull request that I reviewed was done by Marijn (who is also an MNE-Python contributor). He improved MNE-BIDS' find_matching_sidecar function by introducing a "race for the best candidate" of the matching sidecar file.

Work on mne-study-template

Finally, I also worked on the mne-study-template myself - however, my contributions were rather modest. I mostly cleaned up the configuration files, formatted testing data, and made workflows work where they got stuck.

See for example here.



Next week, I want to work more on the mne-study-template.

View Blog Post