Superior Protein Thermophilicity Prediction With Protein Language Model Embeddings

Protein thermostability is an essential property for many biotechnological fields, such as enzyme engineering and protein-hybrid optoelectronics. In this context, machine learning-based in silico predictions have the potential to reduce costs and development time by identifying the most promising candidates for subsequent experiments. The development of such prediction models is enabled by ever-growing protein databases and information on protein stability at different temperatures. In this study, we leverage protein language model embeddings for thermophilicity prediction with ProLaTherm, a Protein Language model-based Thermophilicity predictor. We assess ProLaTherm against several feature-, sequence-, and literature-based comparison partners on a new benchmark dataset derived from a significant update of published data. ProLaTherm outperforms all comparison partners both in a nested cross-validation setup and on protein sequences from species not seen during training with respect to multiple evaluation metrics. In terms of Matthew's correlation coefficient, ProLaTherm surpasses the second-best competitor by 18.1% in the nested cross-validation setup. Using proteins from species that do not overlap with species from the training data, ProLaTherm outperforms all competitors by at least 9.7%. On this data, it misclassified only one non-thermophilic protein as thermophilic. Furthermore, it correctly identified 97.4% of all thermophilic proteins in our test set with an optimal growth temperature above 70°C.

Publikationsart
Wissenschaftliche Poster
Titel
Superior Protein Thermophilicity Prediction With Protein Language Model Embeddings
Medien
Biological Materials Science - A workshop on biogenic, bioinspired, biomimetic and biohybrid materials for innovative optical, photonics and optoelectronics applications
Band
2024
Autoren
Prof. Dr. Florian Haselbeck , Maura John , Yuqi Zhang, Jonathan Pirnay , Juan Pablo Fuenzalida-Werner, Ruben Costa, Prof. Dr. Dominik Grimm
Herausgeber
TUM Campus Straubing for Biotechnology and Sustainability
Veröffentlichungsdatum
06.06.2024
Zitation
Haselbeck, Florian; John, Maura; Zhang, Yuqi; Pirnay, Jonathan; Fuenzalida-Werner, Juan Pablo; Costa, Ruben; Grimm, Dominik (2024): Superior Protein Thermophilicity Prediction With Protein Language Model Embeddings. Biological Materials Science - A workshop on biogenic, bioinspired, biomimetic and biohybrid materials for innovative optical, photonics and optoelectronics applications 2024.