Published April 27, 2026 | Version v2
Dataset Open

Project M Dataset Template for a dataset tailored to a specific ML Task

Description

This dataset is a demonstrator which acts as a template for a dataset which is tailored to a specific Machine Learning Task based on the Project M data. Datasets similar to this which describe datasets tailored to a specific Machine Learning Task based on this data will follow, but dataset illustrates what their format will be. Please note that in this template, the Annotations and Exclusions indicated in the task file MLTaskTemplate_additionalColumns.csv have no meaning and just illustrate possible values.

FileObjects:

contentUrl description
./project_m_datafile.csv Main Project M datafile exported from PSDI Community Data Collections which contains summary of metadata for each experiment
./MLTaskTemplate_additionalColumns.csv Additional columns of data which specify annotations, exclusions and splits specific to this task.

Fields:

FileObject Name Extract dataType description
./project_m_datafile.csv, ./MLTaskTemplate_additionalColumns.csv Filename Filename Text Filename of raw Data .xye file
./project_m_datafile.csv FileURL FileURL Text URL of raw Data .xye file
./project_m_datafile.csv Additive - label Additive - label Text Short form of additive name.
./project_m_datafile.csv Additive - concentration Additive - concentration Float Ratio of additive: Ca ion concentration used.
./project_m_datafile.csv Additive - ChEBI url Additive - ChEBI url Text URL to ChEBI entry which matches the additive molecule.
./project_m_datafile.csv Additive - ChEBI molecule class URLs Additive - ChEBI molecule class URLs Text URLs of ChEBI classes which the additive molecule belongs to (via an 'is a' relationship).
./project_m_datafile.csv Additive - ChEBI molecule class names Additive - ChEBI molecule class names Text Names of ChEBI classes which the additive molecule belongs to (via an 'is a' relationship).
./project_m_datafile.csv Additive - name Additive - name Text Common name of additive molecule.
./project_m_datafile.csv Additive - IUPAC name Additive - IUPAC name Text Preferred IUPAC name of the additive.
./project_m_datafile.csv Additive - formula Additive - formula Text Molecular formula of additive.
./project_m_datafile.csv Additive - mass Additive - mass Text Formula mass of additive molecule.
./project_m_datafile.csv Additive - canonical SMILES Additive - canonical SMILES Text Canonical SMILES representation of the additive.
./project_m_datafile.csv Additive - standard InChI Additive - standard InChI Text Standard InChI identifier of the structure.
./project_m_datafile.csv Additive - standard InChIKey Additive - standard InChIKey Text Standard InChIKey identifier of the structure.
./project_m_datafile.csv Additive - pKa (COOH group) Additive - pKa (COOH group) Float Negative of the logarithm of the acid dissociation constants for the COOH groups (at 25 degrees C).
./project_m_datafile.csv Additive - pKb (NH2 group) Additive - pKb (NH2 group) Float Negative of the logarithm of the acid dissociation constants for the NH2 groups (at 25 degrees C).
./project_m_datafile.csv Additive - pKc (other group) Additive - pKc (other group) Float Negative of the logarithm of the acid dissociation constants for other groups (at 25 degrees C).
./project_m_datafile.csv Additive - pI Additive - pI Float pH at the isoelectric point.
./project_m_datafile.csv Weighted pattern R-factor (R_wp) Weighted pattern R-factor (R_wp) Float A statistical measure of the quality of fit to a diffraction pattern. It is calculated by the square root of the weighted sum of the quotients of the differences between the calculated and observed diffraction pattern at each point and the weighted sum of squares of the observed pattern.
./project_m_datafile.csv Goodness of Fit Goodness of Fit Float A statistical measure of the quality of fit to a diffraction pattern. It is calculated by dividing the weighted pattern R-factor by an R-factor which gives a measure of the quality of data (R_exp)
./project_m_datafile.csv Calcite phase present Calcite phase present Boolean Indicates whether the calcite crystalline phase is present.
./project_m_datafile.csv Calcite unit-cell length a Calcite unit-cell length a Float Unit-cell length a in angstroms.
./project_m_datafile.csv Calcite unit-cell length b Calcite unit-cell length b Float Unit-cell length b in angstroms.
./project_m_datafile.csv Calcite unit-cell length c Calcite unit-cell length c Float Unit-cell length c in angstroms.
./project_m_datafile.csv Calcite unit-cell angle alpha Calcite unit-cell angle alpha Float Unit-cell angle alpha in degrees.
./project_m_datafile.csv Calcite unit-cell angle beta Calcite unit-cell angle beta Float Unit-cell angle beta in degrees.
./project_m_datafile.csv Calcite unit-cell angle gamma Calcite unit-cell angle gamma Float Unit-cell angle gamma in degrees.
./project_m_datafile.csv Calcite weight percentage Calcite weight percentage Float Weight percentage of the Calcite phase.
./project_m_datafile.csv Vaterite phase present Vaterite phase present Boolean Indicates whether the vaterite crystalline phase is present.
./project_m_datafile.csv Vaterite unit-cell length a Vaterite unit-cell length a Float Unit-cell length a in angstroms.
./project_m_datafile.csv Vaterite unit-cell length b Vaterite unit-cell length b Float Unit-cell length b in angstroms.
./project_m_datafile.csv Vaterite unit-cell length c Vaterite unit-cell length c Float Unit-cell length c in angstroms.
./project_m_datafile.csv Vaterite unit-cell angle alpha Vaterite unit-cell angle alpha Float Unit-cell angle alpha in degrees.
./project_m_datafile.csv Vaterite unit-cell angle beta Vaterite unit-cell angle beta Float Unit-cell angle beta in degrees.
./project_m_datafile.csv Vaterite unit-cell angle gamma Vaterite unit-cell angle gamma Float Unit-cell angle gamma in degrees.
./project_m_datafile.csv Vaterite weight percentage Vaterite weight percentage Float Weight percentage of the Vaterite phase.
./project_m_datafile.csv Maximum Intensity Maximum Intensity Float Maximum Intensity in raw Data .xye file
./project_m_datafile.csv Degree of Crystallinity Degree of Crystallinity Float Percentage of area (after background subtraction) that comes from crystalline phases.
./MLTaskTemplate_additionalColumns.csv ML Task Target ML Task Target Text Annotations applied to dataset to act as a target for this machine learning task. Note that in the model, 'bad' is represented by a value of 1 and 'good' by a value of 0.
./MLTaskTemplate_additionalColumns.csv Excluded (True/False) Excluded (True/False) Boolean Indicates whether a row should be excluded from the machine learning model
./MLTaskTemplate_additionalColumns.csv Split (train/test/validation) Split (train/test/validation) Text Annotations applied to dataset to act as a target for this machine learning task. Note that in the model, 'bad' is represented by a value of 1 and 'good' by a value of 0.

Files

project_m_datafile.csv

Files (481.6 kB)

Name Size Download all
md5:a9538c7a44be39c1a3e1dc58f38704b0
24.4 kB Preview Download
md5:de32a932cbb0e879cbaaf713643db5df
40.8 kB Preview Download
md5:3a286946816c0235b53e5a64af0a656b
411.2 kB Preview Download
md5:655bf1d34188848b6123617b3a8e0582
2.3 kB Preview Download
md5:f704a2dd8064a7c705b01161b76e84ac
2.8 kB Preview Download

Additional details

Funding

UK Research and Innovation
Provision of ‘AI ready’ data: prototyping data pipelines and repositories UKRI2697

Domain Specific Metadata

 
Property Value
Data Collection Data were collected at the Beamline I11 Instrument at the Diamond Light Source synchrotron in UK.
Data Collection Type Experiments
Data Collection Missing Data Not applicable
Data Collection Raw Data The raw data are diffraction patterns that consist of .xye data, where x is counts, y is intensity and e is error.
Data Annotation Protocol The annotations/categorisation will vary according to the ML task being performed, and description of this protocol will be documented here.
Data Annotation Platform Not applicable
Data Annotation Analysis Not applicable
Annotator Demographics Not applicable
Machine Annotation Tools Not applicable
Annotations Per Item 1 annotation (classification) per dataset item (experiment)
Data Preprocessing Protocol Not applicable
Data Manipulation Protocol This analysis of the raw diffraction data was performed using Topas Academic software v7. The input file used is batch_topas.inp - it was executed for all datasets with and without the Vaterite phase included. After the runs were completed, the models with and without Vaterite were compared. The model containing Vaterite was chosen when the weighted-phase R-factor for the refinement was at least 0.1 lower than the value for the Calcite only model. Statistics were then calculated for each additive individually, all additives combined, and all control samples. In both models, an Amorphous phase was modelled.
Data Imputation Protocol Not applicable
Data Use Cases training
testing
validation
Data Biases Some additive series are incomplete due to some concentrations being missing.
Personal Sensitive Information No personal or sensitive information is included in the data.
Data Social Impact "Not applicable"
Data Limitations Not applicable
Data Release Maintenance Plan The data are being released as part of the launch of the AI4Science Project M PSDI launch in April 2026.