Linear mixed models (LMMs) are a commonly used method for genome-wide association studies (GWAS) that aim to detect associations between genetic markers and phenotypic measurements in a population of individuals while accounting for population structure and cryptic relatedness. In a standard GWAS, hundreds of thousands to millions of statistical tests are performed, requiring control for multiple hypothesis testing. Typically, static corrections that penalize the number of tests performed are used to control for the family-wise error rate, which is the probability of making at least one false positive. However, it has been shown that in practice this threshold is too conservative for normally distributed phenotypes and not stringent enough for non-normally distributed phenotypes. Therefore, permutation-based LMM approaches have recently been proposed to provide a more realistic threshold that takes phenotypic distributions into account. In this work, we will discuss the advantages of permutation-based GWAS approaches, including new simulations and results from a re-analysis of all publicly available Arabidopsis thaliana phenotypes from the AraPheno database.
Mehr
Prof. Dr. Florian Haselbeck,
Maura John,
Yuqi Zhang,
Jonathan Pirnay,
Juan Pablo Fuenzalida-Werner,
Ruben Costa,
Prof. Dr. Dominik Grimm
Protein thermostability is important in many areas of biotechnology, including enzyme engineering and protein-hybrid optoelectronics. Ever-growing protein databases and information on stability at different temperatures allow the training of machine learning models to predict whether proteins are thermophilic. In silico predictions could reduce costs and accelerate the development process by guiding researchers to more promising candidates. Existing models for predicting protein thermophilicity rely mainly on features derived from physicochemical properties. Recently, modern protein language models that directly use sequence information have demonstrated superior performance in several tasks. In this study, we evaluate the usefulness of protein language model embeddings for thermophilicity prediction with ProLaTherm, a Protein Language model-based Thermophilicity predictor. ProLaTherm significantly outperforms all feature-, sequence- and literature-based comparison partners on multiple evaluation metrics. In terms of the Matthew’s correlation coefficient, ProLaTherm outperforms the second-best competitor by 18.1% in a nested cross-validation setup. Using proteins from species not overlapping with species from the training data, ProLaTherm outperforms all competitors by at least 9.7%. On these data, it misclassified only one nonthermophilic protein as thermophilic. Furthermore, it correctly identified 97.4% of all thermophilic proteins in our test set with an optimal growth temperature above 70°C.
Mehr
Prof. Dr. Florian Haselbeck,
Maura John,
Prof. Dr. Dominik Grimm
SummaryPredicting complex traits from genotypic information is a major challenge in various biological domains. With easyPheno, we present a comprehensive Python framework enabling the rigorous training, comparison, and analysis of phenotype predictions for a variety of different models, ranging from common genomic selection approaches over classical machine learning and modern deep learning based techniques. Our framework is easy-to-use, also for non-programming-experts, and includes an automatic hyperparameter search using state-of-the-art Bayesian optimization. Moreover, easyPheno provides various benefits for bioinformaticians developing new prediction models. easyPheno enables to quickly integrate novel models and functionalities in a reliable framework and to benchmark against various integrated prediction models in a comparable setup. In addition, the framework allows the assessment of newly developed prediction models under pre-defined settings using simulated data. We provide a detailed documentation with various hands-on tutorials and videos explaining the usage of easyPheno to novice users.Availability and ImplementationeasyPheno is publicly available at https://github.com/grimmlab/easyPheno and can be easily installed as Python package via https://pypi.org/project/easypheno/ or using Docker.Supplementary informationA comprehensive documentation including various tutorials complemented with videos can be found at https://easypheno.readthedocs.io/. In addition, we provide examples of how to use easyPheno with real and simulated data in the Supplementary.
Mehr
Maura John,
Markus J Ankenbrand,
Carolin Artmann,
Jan A Freudenthal,
Arthur Korte,
Prof. Dr. Dominik Grimm
Motivation: Genome-wide Association Studies (GWAS) are an integral tool for studying the architecture ofcomplex genotype and phenotype relationships. Linear Mixed Models (LMMs) are commonly used to detectassociations between genetic markers and a trait of interest, while at the same time allowing to account for population structure and cryptic relatedness. Assumptions of LMMs include a normal distribution of theresiduals and that the genetic markers are independent and identically distributed - both assumptions are often violated in real data. Permutation-based methods can help to overcome some of these limitations and provide more realistic thresholds for the discovery of true associations. Still, in practice they are rarely implemented due to the high computational complexity.Results: We propose permGWAS, an efficient linear mixed model reformulation based on 4D-tensors that can provide permutation-based significance thresholds. We show that our method outperforms current state-of-the-art LMMs with respect to runtime and that permutation-based thresholds have a lower false discovery rates for skewed phenotypes compared to the commonly used Bonferroni threshold. Furthermore, using permGWAS we re-analyzed more than 500 Arabidopsis thaliana phenotypes with 100 permutations each in less than eight days on a single GPU. Our re-analyses suggest that applying a permutation-based threshold can improve and refine the interpretation of GWAS results.Availability: permGWAS is open-source and publicly available on GitHub for download: https://github.com/grimmlab/permGWAS
Mehr
Maura John,
Prof. Dr. Florian Haselbeck,
Rupashree Dass,
Christoph Malisi,
Christian Dreischer,
Sebastian J Schultheiss,
Prof. Dr. Dominik Grimm
Genomic selection is an integral tool for breeders to accurately select plants directly from genotype data leading to faster and more resource-efficient breeding programs. Several prediction methods have been established in the last few years. These range from classical linear mixed models to complex non-linear machine learning approaches, such as Support Vector Regression, and modern deep learning-based architectures. Many of these methods have been extensively evaluated on different crop species with varying outcomes. In this work, our aim is to systematically compare twelve different phenotype prediction models, including basic genomic selection methods to more advanced deep learning-based techniques. More importantly, we assess the performance of these models on simulated phenotype data as well as on real-world data from Arabidopsis thaliana and two breeding datasets from soy and corn. The synthetic phenotypic data allows us to analyze all prediction models and especially the selected markers under controlled and predefined settings. We show that Bayes B and linear regression models with sparsity constraints perform best under different simulation settings with respect to explained variance. Further, we can confirm results from other studies that there is no superiority of more complex neural network-based architectures for phenotype prediction compared to well established methods. However, on real-world data, for which several prediction models yield comparable results with slight advantages for Elastic Net, this picture is less clear, suggesting that there is a lot of room for future research.
Mehr
Wissenschaftliche Poster
Prof. Dr. Florian Haselbeck,
Maura John,
Yuqi Zhang,
Jonathan Pirnay,
Juan Pablo Fuenzalida-Werner,
Ruben Costa,
Prof. Dr. Dominik Grimm
Superior Protein Thermophilicity Prediction With Protein Language Model Embeddings (2024) Biological Materials Science - A workshop on biogenic, bioinspired, biomimetic and biohybrid materials for innovative optical, photonics and optoelectronics applications 2024 .
Protein thermostability is an essential property for many biotechnological fields, such as enzyme engineering and protein-hybrid optoelectronics. In this context, machine learning-based in silico predictions have the potential to reduce costs and development time by identifying the most promising candidates for subsequent experiments. The development of such prediction models is enabled by ever-growing protein databases and information on protein stability at different temperatures. In this study, we leverage protein language model embeddings for thermophilicity prediction with ProLaTherm, a Protein Language model-based Thermophilicity predictor. We assess ProLaTherm against several feature-, sequence-, and literature-based comparison partners on a new benchmark dataset derived from a significant update of published data. ProLaTherm outperforms all comparison partners both in a nested cross-validation setup and on protein sequences from species not seen during training with respect to multiple evaluation metrics. In terms of Matthew's correlation coefficient, ProLaTherm surpasses the second-best competitor by 18.1% in the nested cross-validation setup. Using proteins from species that do not overlap with species from the training data, ProLaTherm outperforms all competitors by at least 9.7%. On this data, it misclassified only one non-thermophilic protein as thermophilic. Furthermore, it correctly identified 97.4% of all thermophilic proteins in our test set with an optimal growth temperature above 70°C.
Beiträge in Monografien, Sammelwerken, Schriftenreihen
Genome-wide association studies (GWAS) are a powerful tool to elucidate the genotype–phenotype map. Although GWAS are usually used to assess simple univariate associations between genetic markers and traits of interest, it is also possible to infer the underlying genetic architecture and to predict gene regulatory interactions. In this chapter, we describe the latest methods and tools to perform GWAS by calculating permutation-based significance thresholds. For this purpose, we first provide guidelines on univariate GWAS analyses that are extended in the second part of this chapter to more complex models that enable the inference of gene regulatory networks and how these networks vary.
Geeignete Datenquellen für Umweltbeschreibungen werden identifiziert und aufbereitet, dass sie mit genetischen Daten für ML-Modelle kompatibel sind. Die neuen ML-Methoden sollen heterogene Daten aus …
Wir verwenden Cookies. Einige sind notwendig für die Funktion der Webseite, andere helfen uns, die Webseite zu verbessern. Um unseren eigenen Ansprüchen beim Datenschutz gerecht zu werden, erfassen wir lediglich anonymisierte Nutzerdaten mit „Matomo“. Um unser Internetangebot für Sie ansprechender zu gestalten, binden wir außerdem externe Inhalte unserer Social-Media-Kanäle ein.