A team of computational scientists at the Oak Ridge National Laboratory (ORNL) has created and released massive datasets containing the ultraviolet visible spectral properties of over 10 million organic molecules. This groundbreaking endeavor aims to enhance our understanding of how molecules interact with light, which is crucial for uncovering their electronic and optical properties. These properties have significant applications in various fields, including the development of solar cells and medical imaging systems.

Using high-performance computing resources available at the Oak Ridge Leadership Computing Facility, the researchers performed quantum chemistry calculations to generate these extensive datasets. Multiple atomistic material modeling calculations were conducted for each organic molecule to compute different excited-state properties of interest. The findings of this study were published in Scientific Data.

The main objective behind creating these open-source datasets is to train a deep learning model capable of identifying molecules with specific optoelectronic and photoreactivity properties. This approach offers a faster and more efficient alternative to current methods used for molecular design.

Lead author Massimiliano Lupo Pasini, a data scientist at ORNL’s Computational Sciences and Engineering Division, explained the significance of using deep learning models in molecular design. He stated, “The use of DL models for molecular design is essential because the chemical space that must be explored for the search of these molecules is extremely large.” Traditional experiments and first-principles calculations are simply unaffordable due to their labor-intensive nature or their overwhelming computational requirements. Deep learning models provide a promising solution to overcome these limitations.

To address the challenges associated with managing large volumes of data, the researchers developed a scalable workflow software in collaboration with ORNL computer scientist Kshitij Mehta. This software ensures the proper handling of files generated by the quantum mechanics code without overwhelming the file system.

The team successfully generated the GDB-9-Ex dataset, comprising approximately 96,766 molecules, as a proof-of-concept. They demonstrated that the designed workflow effectively predicts the position and intensity of peaks in the ultraviolet-visible spectrum. Encouraged by this result, the researchers expanded their efforts and created the ORNL_AISD-Ex dataset, which includes over 10.5 million molecules. This dataset provides valuable information about each molecule’s excitation modes and HOMO-LUMO gap, which measures stability. With this data, a deep learning model like HydraGNN can efficiently identify potentially promising molecules for different applications.

The upcoming paper will detail the results of HydraGNN’s training on these datasets and the molecular discoveries made. This innovative approach to molecular design has the potential to revolutionize the field and expedite the development of new materials with tailored properties.