chili100k_strat: Dataset to train or fine-tune CrystaLLM-pi for the targeted generation of experimental materials conditioned on XRD profiles
Creators
Description
The dataset contains experimentally determined crystal structures sourced from Chemically-Informed Large-scale Inorganic Nanomaterials Dataset for Advancing Graph Machine Learning as described in CHILI: Chemically-Informed Large-scale Inorganic Nanomaterials Dataset for Advancing Graph Machine Learning, a curated and filtered subset of the Crystallography Open Database COD. The structural data underwent text augmentation as per the pre-processing pipeline in CrystaLLM-pi. Each structure was labelled with its theoretical X-ray diffraction (XRD) pattern. The complete dataset, published on Hugging Face (https://huggingface.co/datasets/c-bone/chili100k_strat), can be used to train or fine-tune CrystaLLM-pi for the targeted generation of materials conditioned on XRD profiles.
Note that the data files themselves are hosted at https://huggingface.co/datasets/c-bone/chili100k_strat but can be loaded using the croissant file in this record.
FileObjects:
| contentUrl | description |
| train-00000-of-00001.parquet | Training subset of dataset (11091 records) |
| test-00000-of-00001.parquet | Test subset of dataset (1500 records) |
| validation-00000-of-00001.parquet | Validation subset of dataset (1500 records) |
Fields:
| FileObject | Name | Extract | dataType | description |
| train-00000-of-00001.parquet, test-00000-of-00001.parquet, validation-00000-of-00001.parquet | splits | fullpath | Text | Split of the dataset into enumerated values test/train/validation |
| train-00000-of-00001.parquet, test-00000-of-00001.parquet, validation-00000-of-00001.parquet | Database | Database | Text | Source database that this crystal is from. This whole dataset is extracted from the database COD (Crystallography Open Database). |
| train-00000-of-00001.parquet, test-00000-of-00001.parquet, validation-00000-of-00001.parquet | Material ID | Material ID | Text | ID of crystal material in source database. |
| train-00000-of-00001.parquet, test-00000-of-00001.parquet, validation-00000-of-00001.parquet | Reduced Formula | Reduced Formula | Text | Reduced formula of crystal. |
| train-00000-of-00001.parquet, test-00000-of-00001.parquet, validation-00000-of-00001.parquet | CIF | CIF | Text | Contents of crystal's CIF (Crystallographic Information File, as defined in https://www.iucr.org/resources/cif/spec/version1.1) as text. |
| train-00000-of-00001.parquet, test-00000-of-00001.parquet, validation-00000-of-00001.parquet | Condition Vector | condition_vector | Text | High-dimensional numerical representation XRD of top 20 most intense XRD (X-ray diffraction) peaks of the crystal used to identify or reconstruct a crystal structure. It consists of a combined set of 40 values: the 20 highest peak positions (2 theta angles) and their corresponding 20 associated peak intensities (int). Normalisations are 2theta min-max for 0,90 and intensities min-max for 0,100 |
| train-00000-of-00001.parquet, test-00000-of-00001.parquet, validation-00000-of-00001.parquet | Is Novel? | is_novel | Boolean | Indicates whether material's structures have been seen in training |
| train-00000-of-00001.parquet, test-00000-of-00001.parquet, validation-00000-of-00001.parquet | Is Composition Novel? | is_comp_novel | Boolean |
Indicates whether material's atomic composition (reduced formula) was seen in training, but structure was never seen (so a polymorph, is_novel == True but is_comp_novel == False flags) |
| train-00000-of-00001.parquet, test-00000-of-00001.parquet, validation-00000-of-00001.parquet | Token Count | token_count | Integer | Token count |
Files
croissant_metadata.json
Additional details
Domain Specific Metadata
| Property | Value |
|---|---|
| Data Collection | See dataCollectionRawData then dataPreprocessingProtocol then structural data underwent text augmentation as per the pre-processing pipeline in CrystaLLM-pi (https://github.com/C-Bone-UCL/CrystaLLM-pi). Each structure was labelled with its theoretical X-ray diffraction (XRD) pattern. |
| Data Collection Type |
Synthetic
Experimental |
| Data Collection Missing Data | Not applicable |
| Data Collection Raw Data | Data originally from Crystallography Open Database (COD) |
| Data Annotation Protocol | Not applicable |
| Data Annotation Platform |
Not applicable
|
| Data Annotation Analysis |
Not applicable
|
| Annotator Demographics |
Not applicable
|
| Machine Annotation Tools |
Not applicable
|
| Annotations Per Item | Not applicable |
| Data Preprocessing Protocol |
The dataset contains experimentally determined crystal structures sourced from Chemically-Informed Large-scale Inorganic Nanomaterials Dataset for Advancing Graph Machine Learning (https://github.com/UlrikFriisJensen/CHILI) as described in CHILI: Chemically-Informed Large-scale Inorganic Nanomaterials Dataset for Advancing Graph Machine Learning (https://dx.doi.org/10.1145/3637528.3671538), a curated and filtered subset of the Crystallography Open Database COD (https://www.crystallography.net/cod/).
|
| Data Manipulation Protocol | The dataset was made so that in the test set, we have: 500 materials whose structures were seen in training (is_novel == False); 500 materials whose atomic composition (reduced formula) was seen in training, but structure was never seen (so a polymorph, is_novel == True but is_comp_novel == False flags); 500 materials whose atomic composition was never seen in any training phase (is_comp_novel == True) |
| Data Imputation Protocol | Not applicable |
| Data Use Cases |
Dataset to train or fine-tune CrystaLLM-pi (https://github.com/C-Bone-UCL/CrystaLLM-pi)
|
| Data Biases |
Not applicable
|
| Personal Sensitive Information |
No personal or sensitive information is included in the data.
|
| Data Social Impact | Not applicable |
| Data Limitations |
This dataset is 14K materials, but note that CrystaLLM model performance is best when training on datasets with over ~40K
|
| Data Release Maintenance Plan |
The data are being released as a one off with no immediate plans for revisions.
|