Machine learning methods for extracting patterns from high-dimensional data are very important in the biological sciences. However, in certain cases, real-world applications cannot confirm the reported prediction performance. One of the main reasons for this is data leakage, which can be seen as the illicit sharing of information between the training data and the test data, resulting in performance estimates that are far better than the performance observed in the intended application scenario. Data leakage can be difficult to detect in biological datasets due to their complex dependencies. With this in mind, we present seven questions that should be asked to prevent data leakage when constructing machine learning models in biological domains. We illustrate the usefulness of our questions by applying them to nontrivial examples. Our goal is to raise awareness of potential data leakage problems and to promote robust and reproducible machine learning-based research in biology.
Linear mixed models (LMMs) are a commonly used method for genome-wide association studies (GWAS) that aim to detect associations between genetic markers and phenotypic measurements in a population of individuals while accounting for population structure and cryptic relatedness. In a standard GWAS, hundreds of thousands to millions of statistical tests are performed, requiring control for multiple hypothesis testing. Typically, static corrections that penalize the number of tests performed are used to control for the family-wise error rate, which is the probability of making at least one false positive. However, it has been shown that in practice this threshold is too conservative for normally distributed phenotypes and not stringent enough for non-normally distributed phenotypes. Therefore, permutation-based LMM approaches have recently been proposed to provide a more realistic threshold that takes phenotypic distributions into account. In this work, we will discuss the advantages of permutation-based GWAS approaches, including new simulations and results from a re-analysis of all publicly available Arabidopsis thaliana phenotypes from the AraPheno database.
Current methods for end-to-end constructive neural combinatorial optimization usually train a policy using behavior cloning from expert solutions or policy gradient methods from reinforcement learning. While behavior cloning is straightforward, it requires expensive expert solutions, and policy gradient methods are often computationally demanding and complex to fine-tune. In this work, we bridge the two and simplify the training process by sampling multiple solutions for random instances using the current model in each epoch and then selecting the best solution as an expert trajectory for supervised imitation learning. To achieve progressively improving solutions with minimal sampling, we introduce a method that combines round-wise Stochastic Beam Search with an update strategy derived from a provable policy improvement. This strategy refines the policy between rounds by utilizing the advantage of the sampled sequences with almost no computational overhead. We evaluate our approach on the Traveling Salesman Problem and the Capacitated Vehicle Routing Problem. The models trained with our method achieve comparable performance and generalization to those trained with expert data. Additionally, we apply our method to the Job Shop Scheduling Problem using a transformer-based architecture and outperform existing state-of-the-art methods by a wide margin.
Mehr
Josef Eiglsperger,
Prof. Dr. Florian Haselbeck,
Viola Stiele,
Claudia Guadarrama Serrano,
Kelly Lim-Trinh,
Prof. Dr. Klaus Menrad,
Prof. Dr. Thomas Hannus,
Prof. Dr. Dominik Grimm
Accurately forecasting demand is a potential competitive advantage, especially when dealing with perishable products. The multi-billion dollar horticultural industry is highly affected by perishability, but has received limited attention in forecasting research. In this paper, we analyze the applicability of general compared to dataset-specific predictors, as well as the influence of external information and online model update schemes. We employ a heterogeneous set of horticultural data, three classical, and twelve machine learning-based forecasting approaches. Our results show a superiority of multivariate machine learning methods, in particular the ensemble learner XGBoost. These advantages highlight the importance of external factors, with the feature set containing statistical, calendrical, and weather-related features leading to the most robust performance. We further observe that a general model is unable to capture the heterogeneity of the data and is outperformed by dataset-specific predictors. Moreover, frequent model updates have a negligible impact on forecasting quality, allowing long-term forecasting without significant performance degradation.
Mehr
Nikita Genze,
Wouter K Vahl,
Jennifer Groth,
Maximilian Wirth,
Michael Grieb,
Prof. Dr. Dominik Grimm
Sustainable weed management strategies are critical to feeding the world’s population while preserving ecosystems and biodiversity. Therefore, site-specific weed control strategies based on automation are needed to reduce the additional time and effort required for weeding. Machine vision-based methods appear to be a promising approach for weed detection, but require high quality data on the species in a specific agricultural area. Here we present a dataset, the Moving Fields Weed Dataset (MFWD), which captures the growth of 28 weed species commonly found in sorghum and maize fields in Germany. A total of 94,321 images were acquired in a fully automated, high-throughput phenotyping facility to track over 5,000 individual plants at high spatial and temporal resolution. A rich set of manually curated ground truth information is also provided, which can be used not only for plant species classification, object detection and instance segmentation tasks, but also for multiple object tracking.
Mehr
Fabian Schäfer,
Manuel Walther,
Prof. Dr. Dominik Grimm,
Alexander Hübner
Assigning inpatients to hospital beds impacts patient satisfaction and the workload of nurses and doctors. The assignment is subject to unknown inpatient arrivals, in particular for emergency patients. Hospitals, therefore, need to deal with uncertainty on actual bed requirements and potential shortage situations as bed capacities are limited. This paper develops a model and solution approach for solving the patient bed-assignment problem that is based on a machine learning (ML) approach to forecasting emergency patients. First, it contributes by improving the anticipation of emergency patients using ML approaches, incorporating weather data, time and dates, important local and regional events, as well as current and historical occupancy levels. Drawing on real-life data from a large case hospital, we were able to improve forecasting accuracy for emergency inpatient arrivals. We achieved up to 17% better root mean square error (RMSE) when using ML methods compared to a baseline approach relying on averages for historical arrival rates. We further show that the ML methods outperform time series forecasts. Second, we develop a new hyper-heuristic for solving real-life problem instances based on the pilot method and a specialized greedy look-ahead (GLA) heuristic. When applying the hyper-heuristic in test sets we were able to increase the objective function by up to 5.3% in comparison to the benchmark approach in [40]. A benchmark with a Genetic Algorithm shows also the superiority of the hyper-heuristic. Third, the combination of ML for emergency patient admission forecasting with advanced optimization through the hyper-heuristic allowed us to obtain an improvement of up to 3.3% on a real-life problem.
Mehr
Prof. Dr. Florian Haselbeck,
Maura John,
Yuqi Zhang,
Jonathan Pirnay,
Juan Pablo Fuenzalida-Werner,
Ruben Costa,
Prof. Dr. Dominik Grimm
Protein thermostability is important in many areas of biotechnology, including enzyme engineering and protein-hybrid optoelectronics. Ever-growing protein databases and information on stability at different temperatures allow the training of machine learning models to predict whether proteins are thermophilic. In silico predictions could reduce costs and accelerate the development process by guiding researchers to more promising candidates. Existing models for predicting protein thermophilicity rely mainly on features derived from physicochemical properties. Recently, modern protein language models that directly use sequence information have demonstrated superior performance in several tasks. In this study, we evaluate the usefulness of protein language model embeddings for thermophilicity prediction with ProLaTherm, a Protein Language model-based Thermophilicity predictor. ProLaTherm significantly outperforms all feature-, sequence- and literature-based comparison partners on multiple evaluation metrics. In terms of the Matthew’s correlation coefficient, ProLaTherm outperforms the second-best competitor by 18.1% in a nested cross-validation setup. Using proteins from species not overlapping with species from the training data, ProLaTherm outperforms all competitors by at least 9.7%. On these data, it misclassified only one nonthermophilic protein as thermophilic. Furthermore, it correctly identified 97.4% of all thermophilic proteins in our test set with an optimal growth temperature above 70°C.
Mehr
Nikita Genze,
Maximilian Wirth,
Christian Schreiner,
Raymond Ajekwe,
Michael Grieb,
Prof. Dr. Dominik Grimm
BackgroundEfficient and site-specific weed management is a critical step in many agricultural tasks. Image captures from drones and modern machine learning based computer vision methods can be used to assess weed infestation in agricultural fields more efficiently. However, the image quality of the captures can be affected by several factors, including motion blur. Image captures can be blurred because the drone moves during the image capturing process, e.g. due to wind pressure or camera settings. These influences complicate the annotation of training and test samples and can also lead to reduced predictive power in segmentation and classification tasks.ResultsIn this study, we propose DeBlurWeedSeg, a combined deblurring and segmentation model for weed and crop segmentation in motion blurred images. For this purpose, we first collected a new dataset of matching sharp and naturally blurred image pairs of real sorghum and weed plants from drone images of the same agricultural field. The data was used to train and evaluate the performance of DeBlurWeedSeg on both sharp and blurred images of a hold-out test-set. We show that DeBlurWeedSeg outperforms a standard segmentation model that does not include an integrated deblurring step, with a relative improvement of 13.4% in terms of the Sørensen-Dice coefficient.ConclusionOur combined deblurring and segmentation model DeBlurWeedSeg is able to accurately segment weeds from sorghum and background, in both sharp as well as motion blurred drone captures. This has high practical implications, as lower error rates in weed and crop segmentation could lead to better weed control, e.g. when using robots for mechanical weed removal.
Mehr
Quirin Göttl,
Jonathan Pirnay,
Prof. Dr. Dominik Grimm,
Prof. Dr.-Ing. Jakob Burger
The determination of liquid phase equilibria plays an important role in chemical process simulation. This work presents a generalization of an approach called the convex envelope method (CEM), which constructs all liquid phase equilibria over the whole composition space for a given system with an arbitrary number of components. For this matter, the composition space is discretized and the convex envelope of the Gibbs energy graph is computed. Employing the tangent plane criterion, all liquid phase equilibria can be determined in a robust way. The generalized CEM is described within a mathematical framework and it is shown to work numerically with various examples of up to six components from the literature.
Mehr
Josef Eiglsperger,
Prof. Dr. Florian Haselbeck,
Prof. Dr. Dominik Grimm
Summary: Time series forecasting is a research area with applications in various domains, nevertheless without yielding a predominant method so far. We present ForeTiS, a comprehensive and open source Python framework that allows rigorous training, comparison, and analysis of state-of-the-art time series forecasting approaches. Our framework includes fully automated yet configurable data preprocessing and feature engineering. In addition, we use advanced Bayesian optimization for automatic hyperparameter search. ForeTiS is easy to use, even for non-programmers, requiring only a single line of code to apply state-of-the-art time series forecasting. Various prediction models, ranging from classical forecasting approaches to machine learning techniques and deep learning architectures, are already integrated. More importantly, as a key benefit for researchers aiming to develop new forecasting models, ForeTiS is designed to allow for rapid integration and fair benchmarking in a reliable framework. Thus, we provide a powerful framework for both end users and forecasting experts.Availability: ForeTiS is available at https://github.com/grimmlab/ForeTiS. We provide a setup using Docker, as well as a Python package at https://pypi.org/project/ForeTiS/. Extensive online documentation with hands-on tutorials and videos can be found at https://foretis.readthedocs.io.
Mehr
Prof. Dr. Florian Haselbeck,
Maura John,
Prof. Dr. Dominik Grimm
SummaryPredicting complex traits from genotypic information is a major challenge in various biological domains. With easyPheno, we present a comprehensive Python framework enabling the rigorous training, comparison, and analysis of phenotype predictions for a variety of different models, ranging from common genomic selection approaches over classical machine learning and modern deep learning based techniques. Our framework is easy-to-use, also for non-programming-experts, and includes an automatic hyperparameter search using state-of-the-art Bayesian optimization. Moreover, easyPheno provides various benefits for bioinformaticians developing new prediction models. easyPheno enables to quickly integrate novel models and functionalities in a reliable framework and to benchmark against various integrated prediction models in a comparable setup. In addition, the framework allows the assessment of newly developed prediction models under pre-defined settings using simulated data. We provide a detailed documentation with various hands-on tutorials and videos explaining the usage of easyPheno to novice users.Availability and ImplementationeasyPheno is publicly available at https://github.com/grimmlab/easyPheno and can be easily installed as Python package via https://pypi.org/project/easypheno/ or using Docker.Supplementary informationA comprehensive documentation including various tutorials complemented with videos can be found at https://easypheno.readthedocs.io/. In addition, we provide examples of how to use easyPheno with real and simulated data in the Supplementary.
Mehr
Natalia Bercovich,
Nikita Genze,
Marco Todesco,
Gregory L. Owens,
Sébastien Légaré,
Kaichi Huang,
Loren H. Rieseberg,
Prof. Dr. Dominik Grimm
Genomic studies often attempt to link natural genetic variation with important phenotypic variation. To succeed, robust and reliable phenotypic data, as well as curated genomic assemblies, are required. Wild sunflowers, originally from North America, are adapted to diverse and often extreme environments and have historically been a widely used model plant system for the study of population genomics, adaptation, and speciation. Moreover, cultivated sunflower, domesticated from a wild relative (Helianthus annuus) is a global oil crop, ranking fourth in production of vegetable oils worldwide. Public availability of data resources both for the plant research community and for the associated agricultural sector, are extremely valuable. We have created HeliantHOME (http://www.helianthome.org), a curated, public, and interactive database of phenotypes including developmental, structural and environmental ones, obtained from a large collection of both wild and cultivated sunflower individuals. Additionally, the database is enriched with external genomic data and results of genome-wide association studies. Finally, being a community open-source platform, HeliantHOME is expected to expand as new knowledge and resources become available.
Mehr
Maura John,
Markus J Ankenbrand,
Carolin Artmann,
Jan A Freudenthal,
Arthur Korte,
Prof. Dr. Dominik Grimm
Motivation: Genome-wide Association Studies (GWAS) are an integral tool for studying the architecture ofcomplex genotype and phenotype relationships. Linear Mixed Models (LMMs) are commonly used to detectassociations between genetic markers and a trait of interest, while at the same time allowing to account for population structure and cryptic relatedness. Assumptions of LMMs include a normal distribution of theresiduals and that the genetic markers are independent and identically distributed - both assumptions are often violated in real data. Permutation-based methods can help to overcome some of these limitations and provide more realistic thresholds for the discovery of true associations. Still, in practice they are rarely implemented due to the high computational complexity.Results: We propose permGWAS, an efficient linear mixed model reformulation based on 4D-tensors that can provide permutation-based significance thresholds. We show that our method outperforms current state-of-the-art LMMs with respect to runtime and that permutation-based thresholds have a lower false discovery rates for skewed phenotypes compared to the commonly used Bonferroni threshold. Furthermore, using permGWAS we re-analyzed more than 500 Arabidopsis thaliana phenotypes with 100 permutations each in less than eight days on a single GPU. Our re-analyses suggest that applying a permutation-based threshold can improve and refine the interpretation of GWAS results.Availability: permGWAS is open-source and publicly available on GitHub for download: https://github.com/grimmlab/permGWAS
Mehr
Wissenschaftliche Poster
Ashima Khanna,
Prof. Dr. Florian Haselbeck,
Prof. Dr. Dominik Grimm
Predicting Protein Thermostability through Deep Learning Leveraging 3D Structural Information (2024) Biological Materials Science - A workshop on biogenic, bioinspired, biomimetic and biohybrid materials for innovative optical, photonics and optoelectronics applications 2024 .
In protein engineering, improving thermostability is essential for many industrial and pharmaceutical applications. However, the experimental process of identifying stabilizing mutations is time-consuming due to the enormous search space. With the increasing availability of protein structural and thermostability data, computational approaches using deep learning to identify thermostable candidates are gaining popularity. In this work, we present and benchmark a novel graph neural network, ProtGCN, that incorporates geometric and energetic details of proteins to predict changes in Gibbs free energy (ΔG), a key indicator of thermostability, upon single point mutations. Unlike conventional methods that rely on sequence or structural features, our model uses protein graphs with rich node features, carefully preprocessed from a comprehensive dataset of approximately 4149 mutated sequences across 117 protein families. In addition, ProtGCN is enhanced by incorporating embeddings from the Evolutionary Scale Modeling (ESM) protein language model into the protein graphs. This integration allows ProtGCN (ESM) to outperform comparison models, achieving competitive performance with XGBoost and a protein language model-based multi-layer perceptron on all evaluation metrics, and outperforming all models on further analyses. A strength of ProtGCN (ESM) is its ability to correctly identify and predict stabilizing and destabilizing mutations with extreme effects, which are typically underrepresented in thermostability datasets. These results suggest a promising direction for future computational protein engineering research.
Prof. Dr. Florian Haselbeck,
Maura John,
Yuqi Zhang,
Jonathan Pirnay,
Juan Pablo Fuenzalida-Werner,
Ruben Costa,
Prof. Dr. Dominik Grimm
Superior Protein Thermophilicity Prediction With Protein Language Model Embeddings (2024) Biological Materials Science - A workshop on biogenic, bioinspired, biomimetic and biohybrid materials for innovative optical, photonics and optoelectronics applications 2024 .
Protein thermostability is an essential property for many biotechnological fields, such as enzyme engineering and protein-hybrid optoelectronics. In this context, machine learning-based in silico predictions have the potential to reduce costs and development time by identifying the most promising candidates for subsequent experiments. The development of such prediction models is enabled by ever-growing protein databases and information on protein stability at different temperatures. In this study, we leverage protein language model embeddings for thermophilicity prediction with ProLaTherm, a Protein Language model-based Thermophilicity predictor. We assess ProLaTherm against several feature-, sequence-, and literature-based comparison partners on a new benchmark dataset derived from a significant update of published data. ProLaTherm outperforms all comparison partners both in a nested cross-validation setup and on protein sequences from species not seen during training with respect to multiple evaluation metrics. In terms of Matthew's correlation coefficient, ProLaTherm surpasses the second-best competitor by 18.1% in the nested cross-validation setup. Using proteins from species that do not overlap with species from the training data, ProLaTherm outperforms all competitors by at least 9.7%. On this data, it misclassified only one non-thermophilic protein as thermophilic. Furthermore, it correctly identified 97.4% of all thermophilic proteins in our test set with an optimal growth temperature above 70°C.
Beiträge in Monografien, Sammelwerken, Schriftenreihen
Genome-wide association studies (GWAS) are a powerful tool to elucidate the genotype–phenotype map. Although GWAS are usually used to assess simple univariate associations between genetic markers and traits of interest, it is also possible to infer the underlying genetic architecture and to predict gene regulatory interactions. In this chapter, we describe the latest methods and tools to perform GWAS by calculating permutation-based significance thresholds. For this purpose, we first provide guidelines on univariate GWAS analyses that are extended in the second part of this chapter to more complex models that enable the inference of gene regulatory networks and how these networks vary.
Mehr
Zeitschriftenbeiträge
Michael Grieb,
Nikita Genze,
Prof. Dr. Dominik Grimm
Sorghum wird in Bayern als Energiepflanze vor allem für die Biogasproduktion angebaut. Die hohe Biomasseleistung und die große Sortenvarietät in Verbindung mit seiner Trockenheitstoleranz und Nährstoffeffizienz machen Sorghum zu einer vielversprechenden Rohstoffpflanze. Neuartige Technologien, verknüpft mit intelligenter Software, eröffnen große Potentiale im Bereich der Effizienzsteigerung in der Landwirtschaft. Mit Hilfe von modernsten Verfahren des maschinellen Lernens, wie künstliche neuronale Netze oder Deep Learning, können drohnenbasierte Bildaufnahmen der Anbauflächen analysiert und Beikraut erkannt werden.
AlphaZero-type algorithms may stop improving on single-player tasks in case the value network guiding the tree search is unable to approximate the outcome of an episode sufficiently well. One technique to address this problem is transform- ing the single-player task through self-competition. The main idea is to com- pute a scalar baseline from the agent’s historical performances and to reshape an episode’s reward into a binary output, indicating whether the baseline has been exceeded or not. However, this baseline only carries limited information for the agent about strategies how to improve. We leverage the idea of self-competition and directly incorporate a historical policy into the planning process instead of its scalar performance. Based on the recently introduced Gumbel AlphaZero (GAZ), we propose our algorithm GAZ ‘Play-to-Plan’ (GAZ PTP), in which the agent learns to find strong trajectories by planning against possible strategies of its past self. We show the effectiveness of our approach in two well-known combina- torial optimization problems, the Traveling Salesman Problem and the Job-Shop Scheduling Problem. With only half of the simulation budget for search, GAZ PTP consistently outperforms all selected single-player variants of GAZ.
Ziel ist es, intelligente, adaptierbare, frei verfügbare und einfach zu handhabende Methoden zur KI- und bildbasierten Phänotypisierung diverser Blattkrankheiten für eine verbesserte Bewertung …
Ziel ist es, mittels Drohnenfotos in Sorghum und Mais Karten zum räumlichen Verteilungsmuster des Beikrautbesatzes zu entwickeln und validieren – als Basis für teilflächenspezifischen mechanischen …
Wir verwenden Cookies. Einige sind notwendig für die Funktion der Webseite, andere helfen uns, die Webseite zu verbessern. Um unseren eigenen Ansprüchen beim Datenschutz gerecht zu werden, erfassen wir lediglich anonymisierte Nutzerdaten mit „Matomo“. Um unser Internetangebot für Sie ansprechender zu gestalten, binden wir außerdem externe Inhalte unserer Social-Media-Kanäle ein.