Published April 27, 2026
| Version v2
Dataset
Open
Project M Dataset Template for a dataset tailored to a specific ML Task
Creators
Description
This dataset is a demonstrator which acts as a template for a dataset which is tailored to a specific Machine Learning Task based on the Project M data. Datasets similar to this which describe datasets tailored to a specific Machine Learning Task based on this data will follow, but dataset illustrates what their format will be. Please note that in this template, the Annotations and Exclusions indicated in the task file MLTaskTemplate_additionalColumns.csv have no meaning and just illustrate possible values.
FileObjects:
| contentUrl | description |
| ./project_m_datafile.csv | Main Project M datafile exported from PSDI Community Data Collections which contains summary of metadata for each experiment |
| ./MLTaskTemplate_additionalColumns.csv | Additional columns of data which specify annotations, exclusions and splits specific to this task. |
Fields:
| FileObject | Name | Extract | dataType | description |
| ./project_m_datafile.csv, ./MLTaskTemplate_additionalColumns.csv | Filename | Filename | Text | Filename of raw Data .xye file |
| ./project_m_datafile.csv | FileURL | FileURL | Text | URL of raw Data .xye file |
| ./project_m_datafile.csv | Additive - label | Additive - label | Text | Short form of additive name. |
| ./project_m_datafile.csv | Additive - concentration | Additive - concentration | Float | Ratio of additive: Ca ion concentration used. |
| ./project_m_datafile.csv | Additive - ChEBI url | Additive - ChEBI url | Text | URL to ChEBI entry which matches the additive molecule. |
| ./project_m_datafile.csv | Additive - ChEBI molecule class URLs | Additive - ChEBI molecule class URLs | Text | URLs of ChEBI classes which the additive molecule belongs to (via an 'is a' relationship). |
| ./project_m_datafile.csv | Additive - ChEBI molecule class names | Additive - ChEBI molecule class names | Text | Names of ChEBI classes which the additive molecule belongs to (via an 'is a' relationship). |
| ./project_m_datafile.csv | Additive - name | Additive - name | Text | Common name of additive molecule. |
| ./project_m_datafile.csv | Additive - IUPAC name | Additive - IUPAC name | Text | Preferred IUPAC name of the additive. |
| ./project_m_datafile.csv | Additive - formula | Additive - formula | Text | Molecular formula of additive. |
| ./project_m_datafile.csv | Additive - mass | Additive - mass | Text | Formula mass of additive molecule. |
| ./project_m_datafile.csv | Additive - canonical SMILES | Additive - canonical SMILES | Text | Canonical SMILES representation of the additive. |
| ./project_m_datafile.csv | Additive - standard InChI | Additive - standard InChI | Text | Standard InChI identifier of the structure. |
| ./project_m_datafile.csv | Additive - standard InChIKey | Additive - standard InChIKey | Text | Standard InChIKey identifier of the structure. |
| ./project_m_datafile.csv | Additive - pKa (COOH group) | Additive - pKa (COOH group) | Float | Negative of the logarithm of the acid dissociation constants for the COOH groups (at 25 degrees C). |
| ./project_m_datafile.csv | Additive - pKb (NH2 group) | Additive - pKb (NH2 group) | Float | Negative of the logarithm of the acid dissociation constants for the NH2 groups (at 25 degrees C). |
| ./project_m_datafile.csv | Additive - pKc (other group) | Additive - pKc (other group) | Float | Negative of the logarithm of the acid dissociation constants for other groups (at 25 degrees C). |
| ./project_m_datafile.csv | Additive - pI | Additive - pI | Float | pH at the isoelectric point. |
| ./project_m_datafile.csv | Weighted pattern R-factor (R_wp) | Weighted pattern R-factor (R_wp) | Float | A statistical measure of the quality of fit to a diffraction pattern. It is calculated by the square root of the weighted sum of the quotients of the differences between the calculated and observed diffraction pattern at each point and the weighted sum of squares of the observed pattern. |
| ./project_m_datafile.csv | Goodness of Fit | Goodness of Fit | Float | A statistical measure of the quality of fit to a diffraction pattern. It is calculated by dividing the weighted pattern R-factor by an R-factor which gives a measure of the quality of data (R_exp) |
| ./project_m_datafile.csv | Calcite phase present | Calcite phase present | Boolean | Indicates whether the calcite crystalline phase is present. |
| ./project_m_datafile.csv | Calcite unit-cell length a | Calcite unit-cell length a | Float | Unit-cell length a in angstroms. |
| ./project_m_datafile.csv | Calcite unit-cell length b | Calcite unit-cell length b | Float | Unit-cell length b in angstroms. |
| ./project_m_datafile.csv | Calcite unit-cell length c | Calcite unit-cell length c | Float | Unit-cell length c in angstroms. |
| ./project_m_datafile.csv | Calcite unit-cell angle alpha | Calcite unit-cell angle alpha | Float | Unit-cell angle alpha in degrees. |
| ./project_m_datafile.csv | Calcite unit-cell angle beta | Calcite unit-cell angle beta | Float | Unit-cell angle beta in degrees. |
| ./project_m_datafile.csv | Calcite unit-cell angle gamma | Calcite unit-cell angle gamma | Float | Unit-cell angle gamma in degrees. |
| ./project_m_datafile.csv | Calcite weight percentage | Calcite weight percentage | Float | Weight percentage of the Calcite phase. |
| ./project_m_datafile.csv | Vaterite phase present | Vaterite phase present | Boolean | Indicates whether the vaterite crystalline phase is present. |
| ./project_m_datafile.csv | Vaterite unit-cell length a | Vaterite unit-cell length a | Float | Unit-cell length a in angstroms. |
| ./project_m_datafile.csv | Vaterite unit-cell length b | Vaterite unit-cell length b | Float | Unit-cell length b in angstroms. |
| ./project_m_datafile.csv | Vaterite unit-cell length c | Vaterite unit-cell length c | Float | Unit-cell length c in angstroms. |
| ./project_m_datafile.csv | Vaterite unit-cell angle alpha | Vaterite unit-cell angle alpha | Float | Unit-cell angle alpha in degrees. |
| ./project_m_datafile.csv | Vaterite unit-cell angle beta | Vaterite unit-cell angle beta | Float | Unit-cell angle beta in degrees. |
| ./project_m_datafile.csv | Vaterite unit-cell angle gamma | Vaterite unit-cell angle gamma | Float | Unit-cell angle gamma in degrees. |
| ./project_m_datafile.csv | Vaterite weight percentage | Vaterite weight percentage | Float | Weight percentage of the Vaterite phase. |
| ./project_m_datafile.csv | Maximum Intensity | Maximum Intensity | Float | Maximum Intensity in raw Data .xye file |
| ./project_m_datafile.csv | Degree of Crystallinity | Degree of Crystallinity | Float | Percentage of area (after background subtraction) that comes from crystalline phases. |
| ./MLTaskTemplate_additionalColumns.csv | ML Task Target | ML Task Target | Text | Annotations applied to dataset to act as a target for this machine learning task. Note that in the model, 'bad' is represented by a value of 1 and 'good' by a value of 0. |
| ./MLTaskTemplate_additionalColumns.csv | Excluded (True/False) | Excluded (True/False) | Boolean | Indicates whether a row should be excluded from the machine learning model |
| ./MLTaskTemplate_additionalColumns.csv | Split (train/test/validation) | Split (train/test/validation) | Text | Annotations applied to dataset to act as a target for this machine learning task. Note that in the model, 'bad' is represented by a value of 1 and 'good' by a value of 0. |
Files
project_m_datafile.csv
Files
(481.6 kB)
| Name | Size | Download all |
|---|---|---|
|
md5:a9538c7a44be39c1a3e1dc58f38704b0
|
24.4 kB | Preview Download |
|
md5:de32a932cbb0e879cbaaf713643db5df
|
40.8 kB | Preview Download |
|
md5:3a286946816c0235b53e5a64af0a656b
|
411.2 kB | Preview Download |
|
md5:655bf1d34188848b6123617b3a8e0582
|
2.3 kB | Preview Download |
|
md5:f704a2dd8064a7c705b01161b76e84ac
|
2.8 kB | Preview Download |
Additional details
Funding
Domain Specific Metadata
| Property | Value |
|---|---|
| Data Collection | Data were collected at the Beamline I11 Instrument at the Diamond Light Source synchrotron in UK. |
| Data Collection Type |
Experiments
|
| Data Collection Missing Data | Not applicable |
| Data Collection Raw Data | The raw data are diffraction patterns that consist of .xye data, where x is counts, y is intensity and e is error. |
| Data Annotation Protocol | The annotations/categorisation will vary according to the ML task being performed, and description of this protocol will be documented here. |
| Data Annotation Platform |
Not applicable
|
| Data Annotation Analysis |
Not applicable
|
| Annotator Demographics |
Not applicable
|
| Machine Annotation Tools |
Not applicable
|
| Annotations Per Item | 1 annotation (classification) per dataset item (experiment) |
| Data Preprocessing Protocol |
Not applicable
|
| Data Manipulation Protocol | This analysis of the raw diffraction data was performed using Topas Academic software v7. The input file used is batch_topas.inp - it was executed for all datasets with and without the Vaterite phase included. After the runs were completed, the models with and without Vaterite were compared. The model containing Vaterite was chosen when the weighted-phase R-factor for the refinement was at least 0.1 lower than the value for the Calcite only model. Statistics were then calculated for each additive individually, all additives combined, and all control samples. In both models, an Amorphous phase was modelled. |
| Data Imputation Protocol | Not applicable |
| Data Use Cases |
training
testing validation |
| Data Biases |
Some additive series are incomplete due to some concentrations being missing.
|
| Personal Sensitive Information |
No personal or sensitive information is included in the data.
|
| Data Social Impact | "Not applicable" |
| Data Limitations |
Not applicable
|
| Data Release Maintenance Plan |
The data are being released as part of the launch of the AI4Science Project M PSDI launch in April 2026.
|