
It was only in April 2024 that Roman Bushuiev and Anton Bushuiev from the teams of Tomáš Pluskal at IOCB Prague and Josef Šivic at CIIRC CTU initiated the collaboration between experts from 14 research institutes across the globe on benchmarking AI methods for the discovery of molecules from mass spectrometry data. The collaborative project, titled MassSpecGym, aims to spark the development of next-generation machine learning models for identifying new molecules from nature with applications spanning drug development, environmental science, or space exploration. The first success didn’t take long to come. The results of this recent cross-disciplinary initiative were already presented as a Spotlight poster at one of the world’s top machine learning conferences – NeurIPS 2024 in Vancouver, in December 2024.
The discovery of small molecules profoundly influences numerous scientific fields such as organic chemistry, molecular biology, drug development, and environmental analysis. Despite advancements, only a small fraction of life’s molecular diversity has been uncovered. Tandem mass spectrometry (MS/MS) is a cornerstone instrumental technique for identifying molecular structures from biological and environmental samples, enabling applications such as discovering bioactive compounds for drug development, optimizing drug dosages in clinical settings, and detecting environmental pollutants at trace levels. At its core, a tandem mass spectrometer fragments molecules and records the masses of these fragments in so-called MS/MS spectra. “A typical biological or environmental sample produces thousands of tandem mass spectra, each representing a distinct molecule. Yet, annotating these spectra with molecular structures remains a challenge, with fewer than 10% of spectra successfully annotated using state-of-the-art machine learning methods. This leaves much of the chemical space uncovered, limiting our ability to unlock new scientific and technological advancements.” says Tomáš Pluskal from IOCB Prague.
Currently, the development of new AI methods for mass spectrometry is limited by the absence of well-standardized training datasets and evaluation protocols. The proposed project, entitled “MassSpecGym: A benchmark for the discovery and identification of molecules” addresses this limitation. “Machine learning benchmarks such as ImageNet revolutionized the field of AI by standardizing development, evaluation, and assessment of progress. Similarly, we propose a benchmark for molecular discovery to tackle the critical challenge of annotating tandem mass spectra and aim to foster a new generation of AI models for uncovering the undiscovered space of chemical structures present in nature,” explains doctoral student and the main author of the project Roman Bushuiev.

MassSpecGym comprises three core components: (i) the largest publicly available dataset of tandem mass spectra labeled with molecular structures, including newly acquired measurements from Tomáš Pluskal’s laboratory at IOCB Prague, (ii) three well-defined machine-learning challenges rendering the process of molecular discovery from mass spectra into well-defined computational problems, and (iii) carefully-selected held-out pairs of mass spectra and molecules designed to evaluate the ability of AI models to generalize to new chemical space. Additionally, MassSpecGym provides a user-friendly platform for developing and evaluating new AI models.

The development of MassSpecGym was initiated at Dagstuhl Seminar 24181, titled “Computational Metabolomics: Towards Molecules, Models, and Their Meaning”, one of the renowned computer science research seminars held in Germany. The project builds on the collective expertise and contributions of 30 researchers from 14 institutions. A research paper on MassSpecGym was selected for a Spotlight poster presentation at one of the world’s premier machine learning conferences – NeurIPS 2024 in Vancouver.

NeurIPS (Conference on Neural Information Processing Systems) is one of the most prestigious conferences in machine learning (along with ICLR and ICML) and is ranked among the top ten publication venues in all areas of science by Google Scholar. This year, NeurIPS received a total of 15,671 submissions out of which 26% were accepted for the presentation at the conference. MassSpecGym was selected for a Spotlight presentation, which corresponds to the top 2% of submitted papers. Nearly 16,000 participants attended the conference.
This research was co-funded by EU projects FRONTIER (No. 101097822) and ELIAS (No. 101120237).
Original article: R. Bushuiev, A. Bushuiev, N. F. de Jonge, A. Young, F. Kretschmer, R. Samusevich, J. Heirman, F. Wang, L. Zhang, K. Dührkop, M. Ludwig, N. A. Haupt, A. Kalia, C. Brungs, R. Schmid, R. Greiner, B. Wang, D. S. Wishart, L.-P. Liu, J. Rousu, W. Bittremieux, H. Rost, T. D. Mak, S. Hassoun, F. Huber, J. J. J. van der Hooft, M. A. Stravs, S. Böcker, J. Sivic, T. Pluskal, “MassSpecGym: A benchmark for the discovery and identification of molecules”, Advances in Neural Information Processing Systems (NeurIPS), 2024. https://doi.org/10.48550/arXiv.2410.23326.