As described in more detail in the About page, this community contains a collection of “AI Ready Datasets”. All datasets need to include a croissant dataset description metadata json file which describe the data files and their fields in more detail, record their provenance, and include additional fields of specific relevance to its use in AI and description of annotation processes. These files can (but don’t have to) indicate splits into training, test, and validation sets (either by column values or separate files).
Deposit:
-
Depositors are expected to comply with the PSDI Community Data Collections Policy.
-
Depositors are required to accept and comply with the “AI Ready Datasets” deposit conditions.
-
Acceptance into the community is determined by the “AI Ready Datasets” Community Administrators
-
Descriptive metadata to accepted standards for discovery and description, must be assigned to each dataset.
-
-
Specifically, all records in this collection should contain a croissant dataset description metadata json file
-
Domain-specific metadata fields from the Croissant RAI Specification showcase the responsible AI (RAI) aspects of Croissant and duplicate the corresponding fields in the croissant file
-
Similarly, duplicating the croissant descriptions of FileObjects and Fields in the dataset as formatted tables in the Description section allow data consumers to get an overview of the data format
- It is also recommended that all records in this collection also contain a Research Object Crate (RO-Crate) JSON file and corresponding human readable README text file (so that the downloaded zip file can act as a packaged lightweight research data object)
-
An example dataset which can be used for reference is Testing, Training and Validation Synthetic Dataset of Transmission Electron Microscopy (TEM) Images of Gold Nano-particles for Segmentation
-
-
A Creative Commons Attribution 4.0 International licence is the default licence for deposits, however depositors are expected to pay careful consideration to this and must always ensure that an appropriate licence is selected.
-
There is an option to set an embargo period for datasets.
-
There are no charges to individual researchers for deposit or storage of datasets.
-
Before publication, datasets and metadata will be reviewed for accuracy by appointed reviewers.
-
Datasets should be no larger than 100GB