Eleventh week of GSoC: Some more Datalad (complete and automatic flow now)

Published: 08/11/2019

1. What did you do this week?

I have compiled a list for week 11 in my changelog here: https://github.com/sappelhoff/gsoc2019/blob/master/changelog.md#week-11

2. What is coming up next?

Next, I will continue to improve the mne-study-template and also work on a new release of MNE-BIDS.

3. Did you get stuck anywhere?

As the week before, I got stuck a bit with Datalad. However, I finally fixed all problems and I want to report the flow of my pipeline below. enjoy!

Pipeline to get any dataset as git-annex dataset

using the following tools:

  1. Step 1 Upload data to OSF
    1. install osfclient: `pip install osfclient` (see https://github.com/osfclient/osfclient)
    2. make a new OSF repository from the website (need to be registered)
    3. copy the "key" from the new OSF repository, e.g., "3qmer" for the URL: "https://osf.io/3qmer/"
    4. navigate to the directory that contains the directory you want to upload to OSF
    5. make a `.osfcli.config` file: `osf init` ... this file gets written into the current working directory
    6. call `osf upload -r MY_DATA/ .` to upload your data, replacing MY_DATA with your upload directory name
    7. instead of being prompted to input your password, you can define an environment variable OSF_PASSWORD with your password. This has the advantage that you could start an independent process without having to wait and leave your command line prompt open: `nohup osf upload -r MY_DATA/ . &`
    8. NOTE: Recursive uploading using osfclient can be a bad experience. Check out this wrapper script for more control over the process: https://github.com/sappelhoff/gsoc2019/blob/master/misc_code/osfclient_wrapper.py
  2. Step 2 Make a git-annex dataset out of the OSF data
    1. install datalad-osf: git clone and use `pip install -e .` NOTE: You will need the patch submitted here: https://github.com/templateflow/datalad-osf/pull/2
    2. install datalad: `pip install datalad` and git-annex (e.g., via conda-forge)
    3. create your data repository: `datalad create MY_DATA`
    4. go there and download your OSF data using datalad-osf: `cd MY_DATA` ... then `python -c "import datalad_osf; datalad_osf.update_recursive(key='MY_KEY')"`, where MY_KEY is the "key" from step 1 above.
  3. Step 3 Publish the git-annex dataset on GitHub
    1. Make a fresh (empty) repository on GitHub: <repo_url>
    2. Clone your datalad repo: datalad install -s <local_repo> clone
    3. cd clone
    4. git annex dead origin  
    5. git remote rm origin
    6. git remote add origin <repo_url>
    7. datalad publish --to origin
  4. Step 4 Get parts of your data (or everything) from the git-annex repository
    1. datalad install <repo_url>
    2. cd <repo>
    3. datalad get <some_folder_or_file_path>
    4. datalad get .

Important sources / references