Published April 27, 2026 | Version v2
Dataset Open

BenchmarkSet1500 aggregated AI Ready Dataset

Description

High-Accuracy Excited-State Reference Benchmark Dataset for Organic Semiconductors. The BenchmarkSet1500 resource theme provides a dataset of multireference excited states for 1500 small organic semiconductors, alongside a Python-based workflow used to generate the associated high-level excited-state calculations. It is designed for researchers in organic electronics and data-driven chemistry who require reliable and reproducible excited-state data, as well as those developing machine learning models or screening pipelines. By combining standardised computational workflows with multi-level electronic structure methods (TD-DFT, CASSCF, NEVPT2), the resource enables reproducible data generation and delivers an AI-ready dataset suitable for structure-property analysis, direct quantum chemistry method comparison, and molecular design.

This dataset is an aggregated form of individual records available at  BenchmarkSet1500

FileObjects:

contentUrl description
./BenchmarkSet1500.csv Multireference excited-state data file which contains aggregated data for organic semiconductors. Each row corresponds to a single molecule. Columns include molecular identifiers (CCDC ID, SMILES, InChI, formula, number of atoms, CCDC URL, DOI) and SA-CASSCF and NEVPT2 computed excited-state energies (S1, S2, T1, T2) and oscillator strengths (f1, f2).

Fields:

FileObject Name Extract dataType description
./BenchmarkSet1500.csv ID ID Text CCDC Molecule ID.
./BenchmarkSet1500.csv SMILES SMILES Text Canonical SMILES representation of the molecule.
./BenchmarkSet1500.csv InChI InChI Text InChI representation of the molecule.
./BenchmarkSet1500.csv formula formula Text Molecular formula of the molecule.
./BenchmarkSet1500.csv Number of atoms #atoms Float Number of atoms of the molecule.
./BenchmarkSet1500.csv ccdc url ccdc_url URL Link to CCDC entry.
./BenchmarkSet1500.csv doi doi Text DOI of related publication.
./BenchmarkSet1500.csv SA-CASSCF E(S1) SA_CASSCF_E(S1) Float First singlet excited-state energy calculated using SA-CASSCF (eV).
./BenchmarkSet1500.csv NEVPT2 E(S1) NEVPT2_E(S1) Float First singlet excited-state energy calculated using NEVPT2 (eV).
./BenchmarkSet1500.csv SA-CASSCF E(S2) SA_CASSCF_E(S2) Float Second singlet excited-state energy calculated using SA-CASSCF (eV).
./BenchmarkSet1500.csv NEVPT2 E(S2) NEVPT2_E(S2) Float Second singlet excited-state energy calculated using NEVPT2 (eV).
./BenchmarkSet1500.csv SA-CASSCF E(T1) SA_CASSCF_E(T1) Float First triplet excited-state energy calculated using SA-CASSCF (eV).
./BenchmarkSet1500.csv NEVPT2 E(T1) NEVPT2_E(T1) Float First triplet excited-state energy calculated using NEVPT2 (eV).
./BenchmarkSet1500.csv SA-CASSCF E(T2) SA_CASSCF_E(T2) Float Second triplet excited-state energy calculated using SA-CASSCF (eV).
./BenchmarkSet1500.csv NEVPT2 E(T2) NEVPT2_E(T2) Float Second triplet excited-state energy calculated using NEVPT2 (eV).
./BenchmarkSet1500.csv SA-CASSCF f(S1) SA_CASSCF_f(S1) Float S1 oscillator strength calculated using SA-CASSCF.
./BenchmarkSet1500.csv NEVPT2 f(S1) NEVPT2_f(S1) Float S1 oscillator strength calculated using NEVPT2.
./BenchmarkSet1500.csv SA-CASSCF f(S2) SA_CASSCF_f(S2) Float S2 oscillator strength calculated using SA-CASSCF.
./BenchmarkSet1500.csv NEVPT2 f(S2) NEVPT2_f(S2) Float S2 oscillator strength calculated using NEVPT2.

 

Files

BenchmarkSet1500.csv

Files (657.4 kB)

Name Size Download all
md5:f825513472a07089849c431def3f611f
632.7 kB Preview Download
md5:6f9f3dba2cd3b6b2628eaceb014154bd
19.0 kB Preview Download
md5:6f73b56f803fa54e4ddbf067ff302890
2.9 kB Preview Download
md5:7ec917b11b39336fd2fde00ea8bc6be0
2.8 kB Preview Download

Additional details

Domain Specific Metadata

 
Property Value
Data Collection The initial dataset was derived from the around 40,000 organic semiconductor molecules reported by Omar et al. (2022), which were originally curated from the Cambridge Structural Database (CSD). This parent set was assembled prior to this work and restricted to well-defined molecular crystals composed of elements commonly found in organic semiconductors, with polymeric systems, disordered solids, and co-crystals excluded. From this space, a subset was selected to enrich for electronically challenging systems relevant to excited-state screening, using TD-DFT-derived criteria targeting small singlet–triplet gaps (S1-T1 < 0.275 eV), and signatures of double-excitation character (S2-S1 < 0.250 eV and f2-5f1 > 0.350). Additional random sampling of around 200 molecules was included to preserve chemical diversity, yielding around 1,500 molecules in total.
Data Collection Type Calculations
Data Collection Missing Data Not applicable
Data Collection Raw Data Initial raw data were geometric files obtained from the CCDC.
Data Annotation Protocol This data source was not annotated as such.
Data Annotation Platform Not applicable
Data Annotation Analysis Not applicable
Annotator Demographics Not applicable
Machine Annotation Tools Not applicable
Annotations Per Item Not applicable
Data Preprocessing Protocol Not applicable
Data Manipulation Protocol The resource integrates standardised, fully automated workflows with multi-level electronic structure methods (CASSCF, NEVPT2) to generate reproducible excited-state data. It covers input generation for ground state optimisation using Gaussian 16, HPC job submission, error correction, input generation for excited state calculations using ORCA, and structured data extraction. The full workflow code is available at https://github.com/OrganicAI-Lab/PSDI_Benchmark_Set_1500.
Data Imputation Protocol Not applicable
Data Use Cases Designed for researchers in organic electronics and data-driven chemistry who require reliable and reproducible excited-state data, as well as those developing machine learning models or screening pipelines.
Structure-property analysis
direct quantum chemistry method comparison
molecular design
Data Biases Not applicable
Personal Sensitive Information No personal or sensitive information is included in the data.
Data Social Impact Not applicable
Data Limitations Not applicable
Data Release Maintenance Plan The data are being released as a one off with no immediate plans for revisions.