DataLad can create DataLad datasets using any data files published on the web. But the one-time import of data isn't enough, which is why DataLad can be automated to monitor such data sources and incorporate any modifications made to them over time — thus enabling the easy publication and maintenance of entire distributions of datasets.

Using this automated process, the DataLad team maintains data trackers for a number of popular public data portals. These datasets, some automatically generated and others manually created and curated, are collated into a DataLad super-dataset that is published publicly in its entirety at This super-dataset establishes the official DataLad data distribution that is available via the DataLad resource identifier ///. Some of these datasets (e.g. ///crcns) require authentication credentials, but — other than the supplying of those credentials — access to all resources is completely uniform regardless of the data's origin. DataLad also aggregates all relevant metadata for these datasets — so they can be discovered using DataLad's search.

At present, DataLad's super-dataset offers uniform access to over 10TB of scientific data. This includes the following datasets, listed by their DataLad resource identifiers for use with the datalad clone command:

The OpenNeuro portal publishes hosted data as DataLad datasets on GitHub. The entire collection can be found at:

More datasets are provided in a collection on GitHub, such as the Human Connectome Project's open access dataset, the world's highest resolution brain scan, or podcast collections.