This community contains a collection of “AI Ready Datasets”. Each record contains a dataset which can be useful for AI tools and systems.

This collection is part of the “AI for Science” PSDI sub-project (“Provision of ‘AI ready’ data: prototyping data pipelines and repositories”, grant application APP84520, award UKRI2697, opportunity OPP1033:EPSRC AI for Science) and as such has been seeded with examples of “AI ready” data that is useful as training data for AI tools and systems.  

What makes a dataset "AI ready"?

In PSDI (Physical Sciences Data Infrastructure) there are many datasets which have been curated, well-formatted and described (in the PSDI Resource Catalogue and its underlying PSDI Metadata) with the intention that they can be reused for many applications, including AI (Artificial Intelligence) and ML (machine learning).

This "AI Ready Datasets" Community Data Collection provides a selection of datasets which have been prepared more specifically towards ML tasks using community standards, and where possible trialled and revised to be more easily loaded for those tasks. In particular, adding a croissant metadata description to each dataset was found to be useful for developers who performed machine learning tasks to understand the datasets better and load them faster (croissant metadata helps loading ML datasets into different ML frameworks). As a result, all dataset records in this "AI Ready Dataset" have been augmented by an accompanying croissant dataset description metadata json file. 

Typical record format

A record in the AI ready datasets typically consists of:

  • a croissant metadata JSON file (to faciliate loading of the dataset into machine learning models)
  • data file(s) (unless these are hosted remotely and referenced in the croissant file)
  • visible metadata for record:
    • Domain specific metadata fields corresponding to elements Responsible AI fields
    • the Description includes a top level descriptions of file objects and the fields that these contain
  • a Research Object Crate (RO-Crate) JSON file and corresponding human readable README text file (so that the downloaded zip file can act as a packaged lightweight research data object) 

Please note that this visible metadata is duplicated from the croissant file, but has been surfaced into the record to showcase the information which is in the croissant files in a human readable way. 

Scope of "AI ready" example datasets

While there are some mlcommons croissant datasets examples and a hosted Croissant Editor, and many platforms and websites support croissant metadata file descriptions, many of these are only minimally populated so this initial project explored and demonstrated the capabilities and range of croissant.

The datasets showcase a diversity of different capabilities of croissant metadata description:

How to use "AI Ready Datasets" croissant metadata files

The whole dataset for a record should be downloaded using the "download all" link which appears in the record page. This downloads a single zip file which contains a Research Object Crate (RO-Crate) JSON file and corresponding human readable README text file so that the downloaded zip file can act as a packaged lightweight research data object.

The files can be extracted from this zip file, and the croissant file that each contains can be used to load the data into machine learning tools.

mlcroissant is a python library which can be installed via pip to load a croissant file and its corresponding dataset as described in croissant/README.md. In particular, this provides an easy way to load records which are defined as recordSets in the croissant file, which can be linked together. 

The croissant files for all records in the "AI Ready Datasets" collection have been loaded using mlcroissant by way of validation. Some have been used in machine learning tasks using PyTorch.

Contact

Any feedback, suggestions or contributions can be provided to support@psdi.ac.uk. Specifically, feedback regarding the use this dataset by AI tools and systems would be particularly welcome.