DataLad logo DataLad logo DataLad logo DataLad logo DataLad logo DataLad logo

distributed data management

free and open source

Join us at distribits 2024!

Technologies for distributed data management

The first distribits meeting will happen on April 4th to April 6th, 2024, at “Haus der Universität”, Düsseldorf, Germany, with the aim of bringing together enthusiasts of tools and workflows in the domain of distributed data. Join us!

What is DataLad?

DataLad is a free and open source distributed data management system that keeps track of your data, creates structure, ensures reproducibility, supports collaboration, and integrates with widely used data infrastructure.

Install DataLad

Install DataLad and its dependencies, Git and git-annex, on all major operating systems using Python and the datalad-installer:

$ pip install datalad-installer
$ datalad-installer git-annex -m datalad/packages
$ pip install datalad

Depending on your operating system, other installation options are also possible. For detailed instructions on all installation and procedures and further configuration, please visit the DataLad Handbook

DataLad is part of the Debian and Ubuntu operating systems and available on CentOS, Redhat, Fedora, and similar systems. DataLad can be installed or upgraded via conda and apt:

Using conda:

$ conda install -c conda-forge datalad

Using apt:

$ sudo apt-get install datalad

Find out more about Linux installation in the DataLad Handbook

DataLad is available via OS X’s homebrew package manager or alternatively via conda:

Using conda:

$ conda install -c conda-forge datalad

Using homebrew:

$ brew install datalad

Find out more about macOS installation in the DataLad Handbook

On a Windows machine with Python, the best route for installing DataLad is to install its dependencies with the datalad-installer and then follow up with pip:

$ pip install datalad-installer
$ datalad-installer git-annex -m datalad/packages
$ pip install datalad

Find out more about Windows installation in the DataLad Handbook

Keep Track

Building on top of Git and git-annex, DataLad allows you to version control arbitrarily large files in datasets, without the need for custom data structures, central infrastructure, or third party services.

  •   Track changes to your data
  •   Revert to previous versions
  •   Capture full provenance records
  •   Ensure complete reproducibility
DataLad version control
DataLad nested datasets

Create Structure

A DataLad dataset is a directory with files, managed by DataLad. You can link other datasets, known as subdatasets, and perform commands recursively across an arbitrarily deep hierarchy of datasets. This helps you to create structure while maintaining advanced provenance capture abilities, versioning, and actionable file retrieval.

Use DataLad

DataLad is a free and open source Python-based tool that is compatible with all major operating systems. It can be used via its Graphical User Interface or via the command line to:

  •   create new datasets locally
  •   clone other datasets
  •   get content on-demand
  •   save changes to datasets
  •   drop content as needed
  •   push changes to a remote location

... and much more!

  Try out DataLad
Computer console

datalad create my_dataset

datalad save -m "hello world"

datalad push --to location


datalad clone location

datalad get example.txt

datalad drop example.txt

DataLad collaboration

Collaborate

DataLad lets you consume datasets provided by others, and collaborate with them. You can install existing datasets and update them from their sources, or create sibling datasets that you can publish updates to and pull updates from. The collaborative power of Git, for your data.

DataLad in the Wild

DataLad is integrated with a variety of hosting services and data management platforms, and extended and used by a diverse community. Export datasets to third party services such as GitHub or Figshare with built-in commands. Extend DataLad to be compatible with your preferred data supplier or workflow. Or use a multitude of other DataLad-compatible services such as Dropbox or Amazon S3. Search through all integrations, extensions, and use cases to find the right fit for your data!

  Browse use cases
DataLad integrations and extensions
DataLad learning

Learn More

DataLad is not solely a data management system, but also an open source community of users, developers, and researchers all contributing to its growth. To support this community, DataLad maintains several important resources:

Install
DataLad

Install DataLad and its dependencies on Linux, macOS, or Windows

DataLad
Handbook

Become an expert DataLad user with this rich educational resource

DataLad on
GitHub

Contribute via GitHub by creating issues or sending a pull request

Developer
Docs

Dive into the DataLad API with the developer documentation

DataLad
Tutorials

Hands-on tutorials and videos to help you on your DataLad journey

DataLad
Course

A course on Research Data Management with DataLad

Get Support

For tougher challenges during your data management journey, there are a number of ways that you can get in touch with the DataLad community, its experts, and core developers. Head over to Matrix to chat, join us in a weekly Office Hour call, or create an issue on GitHub!

DataLad support

Community
Chat

Join the community on Matrix, say hi, and ask questions

Office
Hour

Get real-time help from DataLad experts to solve your challenges

File an
issue

File an issue to let the developers know about a bug or a feature request

DataLad funding

Supporting DataLad

DataLad development is funded as a US-German project on collaborative research, with primary funding from the US National Science Foundation (NSF 1912266, NSF 1429999) and the German Federal Ministry of Education and Research (BMBF 01GQ1905, BMBF 01GQ1411). Additional support has been provided by the US National Institute of Biomedical Imaging and Bioengineering (NIH 1P41EB019936-01A1) via ReproNim, the European Union’s Horizon 2020 research and innovation programme under (945539, 826421), the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation, SFB1451-INF), and the German federal state of Saxony-Anhalt and the European Regional Development Fund.

Citing DataLad

Please cite the following article when referring to DataLad in publications:

Yaroslav O. Halchenko, Kyle Meyer, Benjamin Poldrack, Debanjum Singh Solanky, Adina S. Wagner, Jason Gors, Dave MacFarlane, Dorian Pustina, Vanessa Sochat, Satrajit S. Ghosh, Christian Mönch, Christopher J. Markiewicz, Laura Waite, Ilya Shlyakhter, Alejandro de la Vega, Soichi Hayashi, Christian Olaf Häusler, Jean-Baptiste Poline, Tobias Kadelka, Kusti Skytén, Dorota Jarecka, David Kennedy, Ted Strauss, Matt Cieslak, Peter Vavra, Horea-Ioan Ioanas, Robin Schneider, Mika Pflüger, James V. Haxby, Simon B. Eickhoff, and Michael Hanke, (2021). DataLad: distributed system for joint management of code, data, and their relationship. Journal of Open Source Software, 6(63), 3262, 10.21105/joss.03262

  Copy   BibTex   RIS

DataLad citation