BenchmarkSet1500 aggregated AI Ready Dataset
Creators
Description
High-Accuracy Excited-State Reference Benchmark Dataset for Organic Semiconductors. The BenchmarkSet1500 resource theme provides a dataset of multireference excited states for 1500 small organic semiconductors, alongside a Python-based workflow used to generate the associated high-level excited-state calculations. It is designed for researchers in organic electronics and data-driven chemistry who require reliable and reproducible excited-state data, as well as those developing machine learning models or screening pipelines. By combining standardised computational workflows with multi-level electronic structure methods (TD-DFT, CASSCF, NEVPT2), the resource enables reproducible data generation and delivers an AI-ready dataset suitable for structure-property analysis, direct quantum chemistry method comparison, and molecular design.
This dataset is an aggregated form of individual records available at BenchmarkSet1500.
FileObjects:
| contentUrl | description |
| ./BenchmarkSet1500.csv | Multireference excited-state data file which contains aggregated data for organic semiconductors. Each row corresponds to a single molecule. Columns include molecular identifiers (CCDC ID, SMILES, InChI, formula, number of atoms, CCDC URL, DOI) and SA-CASSCF and NEVPT2 computed excited-state energies (S1, S2, T1, T2) and oscillator strengths (f1, f2). |
Fields:
| FileObject | Name | Extract | dataType | description |
| ./BenchmarkSet1500.csv | ID | ID | Text | CCDC Molecule ID. |
| ./BenchmarkSet1500.csv | SMILES | SMILES | Text | Canonical SMILES representation of the molecule. |
| ./BenchmarkSet1500.csv | InChI | InChI | Text | InChI representation of the molecule. |
| ./BenchmarkSet1500.csv | formula | formula | Text | Molecular formula of the molecule. |
| ./BenchmarkSet1500.csv | Number of atoms | #atoms | Float | Number of atoms of the molecule. |
| ./BenchmarkSet1500.csv | ccdc url | ccdc_url | URL | Link to CCDC entry. |
| ./BenchmarkSet1500.csv | doi | doi | Text | DOI of related publication. |
| ./BenchmarkSet1500.csv | SA-CASSCF E(S1) | SA_CASSCF_E(S1) | Float | First singlet excited-state energy calculated using SA-CASSCF (eV). |
| ./BenchmarkSet1500.csv | NEVPT2 E(S1) | NEVPT2_E(S1) | Float | First singlet excited-state energy calculated using NEVPT2 (eV). |
| ./BenchmarkSet1500.csv | SA-CASSCF E(S2) | SA_CASSCF_E(S2) | Float | Second singlet excited-state energy calculated using SA-CASSCF (eV). |
| ./BenchmarkSet1500.csv | NEVPT2 E(S2) | NEVPT2_E(S2) | Float | Second singlet excited-state energy calculated using NEVPT2 (eV). |
| ./BenchmarkSet1500.csv | SA-CASSCF E(T1) | SA_CASSCF_E(T1) | Float | First triplet excited-state energy calculated using SA-CASSCF (eV). |
| ./BenchmarkSet1500.csv | NEVPT2 E(T1) | NEVPT2_E(T1) | Float | First triplet excited-state energy calculated using NEVPT2 (eV). |
| ./BenchmarkSet1500.csv | SA-CASSCF E(T2) | SA_CASSCF_E(T2) | Float | Second triplet excited-state energy calculated using SA-CASSCF (eV). |
| ./BenchmarkSet1500.csv | NEVPT2 E(T2) | NEVPT2_E(T2) | Float | Second triplet excited-state energy calculated using NEVPT2 (eV). |
| ./BenchmarkSet1500.csv | SA-CASSCF f(S1) | SA_CASSCF_f(S1) | Float | S1 oscillator strength calculated using SA-CASSCF. |
| ./BenchmarkSet1500.csv | NEVPT2 f(S1) | NEVPT2_f(S1) | Float | S1 oscillator strength calculated using NEVPT2. |
| ./BenchmarkSet1500.csv | SA-CASSCF f(S2) | SA_CASSCF_f(S2) | Float | S2 oscillator strength calculated using SA-CASSCF. |
| ./BenchmarkSet1500.csv | NEVPT2 f(S2) | NEVPT2_f(S2) | Float | S2 oscillator strength calculated using NEVPT2. |
Files
BenchmarkSet1500.csv
Additional details
Domain Specific Metadata
| Property | Value |
|---|---|
| Data Collection | The initial dataset was derived from the around 40,000 organic semiconductor molecules reported by Omar et al. (2022), which were originally curated from the Cambridge Structural Database (CSD). This parent set was assembled prior to this work and restricted to well-defined molecular crystals composed of elements commonly found in organic semiconductors, with polymeric systems, disordered solids, and co-crystals excluded. From this space, a subset was selected to enrich for electronically challenging systems relevant to excited-state screening, using TD-DFT-derived criteria targeting small singlet–triplet gaps (S1-T1 < 0.275 eV), and signatures of double-excitation character (S2-S1 < 0.250 eV and f2-5f1 > 0.350). Additional random sampling of around 200 molecules was included to preserve chemical diversity, yielding around 1,500 molecules in total. |
| Data Collection Type |
Calculations
|
| Data Collection Missing Data | Not applicable |
| Data Collection Raw Data | Initial raw data were geometric files obtained from the CCDC. |
| Data Annotation Protocol | This data source was not annotated as such. |
| Data Annotation Platform |
Not applicable
|
| Data Annotation Analysis |
Not applicable
|
| Annotator Demographics |
Not applicable
|
| Machine Annotation Tools |
Not applicable
|
| Annotations Per Item | Not applicable |
| Data Preprocessing Protocol |
Not applicable
|
| Data Manipulation Protocol | The resource integrates standardised, fully automated workflows with multi-level electronic structure methods (CASSCF, NEVPT2) to generate reproducible excited-state data. It covers input generation for ground state optimisation using Gaussian 16, HPC job submission, error correction, input generation for excited state calculations using ORCA, and structured data extraction. The full workflow code is available at https://github.com/OrganicAI-Lab/PSDI_Benchmark_Set_1500. |
| Data Imputation Protocol | Not applicable |
| Data Use Cases |
Designed for researchers in organic electronics and data-driven chemistry who require reliable and reproducible excited-state data, as well as those developing machine learning models or screening pipelines.
Structure-property analysis direct quantum chemistry method comparison molecular design |
| Data Biases |
Not applicable
|
| Personal Sensitive Information |
No personal or sensitive information is included in the data.
|
| Data Social Impact | Not applicable |
| Data Limitations |
Not applicable
|
| Data Release Maintenance Plan |
The data are being released as a one off with no immediate plans for revisions.
|