Published April 30, 2026 | Version v2
Dataset Open

chili100k_strat: Dataset to train or fine-tune CrystaLLM-pi for the targeted generation of experimental materials conditioned on XRD profiles

Description

The dataset contains experimentally determined crystal structures sourced from Chemically-Informed Large-scale Inorganic Nanomaterials Dataset for Advancing Graph Machine Learning as described in CHILI: Chemically-Informed Large-scale Inorganic Nanomaterials Dataset for Advancing Graph Machine Learning, a curated and filtered subset of the Crystallography Open Database COD. The structural data underwent text augmentation as per the pre-processing pipeline in CrystaLLM-pi. Each structure was labelled with its theoretical X-ray diffraction (XRD) pattern. The complete dataset, published on Hugging Face (https://huggingface.co/datasets/c-bone/chili100k_strat), can be used to train or fine-tune CrystaLLM-pi for the targeted generation of materials conditioned on XRD profiles.

Note that the data files themselves are hosted at https://huggingface.co/datasets/c-bone/chili100k_strat but can be loaded using the croissant file in this record.

FileObjects:

contentUrl description
train-00000-of-00001.parquet Training subset of dataset (11091 records)
test-00000-of-00001.parquet Test subset of dataset (1500 records)
validation-00000-of-00001.parquet Validation subset of dataset (1500 records)

Fields:

FileObject Name Extract dataType description
train-00000-of-00001.parquettest-00000-of-00001.parquet, validation-00000-of-00001.parquet splits fullpath Text Split of the dataset into enumerated values test/train/validation
train-00000-of-00001.parquettest-00000-of-00001.parquet, validation-00000-of-00001.parquet Database Database Text Source database that this crystal is from. This whole dataset is extracted from the database COD (Crystallography Open Database).
train-00000-of-00001.parquettest-00000-of-00001.parquet, validation-00000-of-00001.parquet Material ID Material ID Text ID of crystal material in source database.
train-00000-of-00001.parquettest-00000-of-00001.parquet, validation-00000-of-00001.parquet Reduced Formula Reduced Formula Text Reduced formula of crystal.
train-00000-of-00001.parquettest-00000-of-00001.parquet, validation-00000-of-00001.parquet CIF CIF Text Contents of crystal's CIF (Crystallographic Information File, as defined in https://www.iucr.org/resources/cif/spec/version1.1) as text.
train-00000-of-00001.parquettest-00000-of-00001.parquet, validation-00000-of-00001.parquet Condition Vector condition_vector Text High-dimensional numerical representation XRD of top 20 most intense XRD (X-ray diffraction) peaks of the crystal used to identify or reconstruct a crystal structure. It consists of a combined set of 40 values: the 20 highest peak positions (2 theta angles) and their corresponding 20 associated peak intensities (int). Normalisations are 2theta min-max for 0,90 and intensities min-max for 0,100
train-00000-of-00001.parquettest-00000-of-00001.parquet, validation-00000-of-00001.parquet Is Novel? is_novel Boolean Indicates whether material's structures have been seen in training
train-00000-of-00001.parquettest-00000-of-00001.parquet, validation-00000-of-00001.parquet Is Composition Novel? is_comp_novel Boolean

Indicates whether material's atomic composition (reduced formula) was seen in training, but structure was never seen (so a polymorph, is_novel == True but is_comp_novel == False flags)

train-00000-of-00001.parquettest-00000-of-00001.parquet, validation-00000-of-00001.parquet Token Count token_count Integer Token count

Files

croissant_metadata.json

Files (23.3 kB)

Name Size Download all
md5:320a2a18b9e632319edc320bab1f8675
16.3 kB Preview Download
md5:c34ed163ff7d6aa1a876ad564779c7f4
3.1 kB Preview Download
md5:5ca611b2911cc080e687a45d0ac02b09
4.0 kB Preview Download

Additional details

Domain Specific Metadata

 
Property Value
Data Collection See dataCollectionRawData then dataPreprocessingProtocol then structural data underwent text augmentation as per the pre-processing pipeline in CrystaLLM-pi (https://github.com/C-Bone-UCL/CrystaLLM-pi). Each structure was labelled with its theoretical X-ray diffraction (XRD) pattern.
Data Collection Type Synthetic
Experimental
Data Collection Missing Data Not applicable
Data Collection Raw Data Data originally from Crystallography Open Database (COD)
Data Annotation Protocol Not applicable
Data Annotation Platform Not applicable
Data Annotation Analysis Not applicable
Annotator Demographics Not applicable
Machine Annotation Tools Not applicable
Annotations Per Item Not applicable
Data Preprocessing Protocol The dataset contains experimentally determined crystal structures sourced from Chemically-Informed Large-scale Inorganic Nanomaterials Dataset for Advancing Graph Machine Learning (https://github.com/UlrikFriisJensen/CHILI) as described in CHILI: Chemically-Informed Large-scale Inorganic Nanomaterials Dataset for Advancing Graph Machine Learning (https://dx.doi.org/10.1145/3637528.3671538), a curated and filtered subset of the Crystallography Open Database COD (https://www.crystallography.net/cod/).
Data Manipulation Protocol The dataset was made so that in the test set, we have: 500 materials whose structures were seen in training (is_novel == False); 500 materials whose atomic composition (reduced formula) was seen in training, but structure was never seen (so a polymorph, is_novel == True but is_comp_novel == False flags); 500 materials whose atomic composition was never seen in any training phase (is_comp_novel == True)
Data Imputation Protocol Not applicable
Data Use Cases Dataset to train or fine-tune CrystaLLM-pi (https://github.com/C-Bone-UCL/CrystaLLM-pi)
Data Biases Not applicable
Personal Sensitive Information No personal or sensitive information is included in the data.
Data Social Impact Not applicable
Data Limitations This dataset is 14K materials, but note that CrystaLLM model performance is best when training on datasets with over ~40K
Data Release Maintenance Plan The data are being released as a one off with no immediate plans for revisions.