Bioinformatics and downstream computational analysis
The Bioinformatics team located both in Munich and at NNF-CPR ( Centre for Proteomics Research in Copenhagen) uses cutting edge technology to research bioinformatic challenges in mass spectrometry-based proteomics. We cover a wide range of topics across the entire pipeline associated with MS-based proteomics from directly processing raw data to downstream data analysis in clinical applications. Technologically, we build upon Python as a modern dynamically typed language with a clear syntax ideal for working at the intersection of different disciplines while also having access to a rich ecosystem of fantastic packages that enable using the latest advances in machine and deep learning. For writing highly performant code both for GPU and CPU, Numba ( https://numba.pydata.org/) is one of our favorites as it can massively speed up computations (often 100x) and allows us to write performant code while focusing on our domain knowledge in MS-based proteomics. We found Jupyter Notebooks to be excellent for developing and sharing ideas. In particular, nbdev ( https://github.com/fastai/nbdev)
The clinical knowledge graph (CKG) enables users to analyze clinical proteomics data with the opportunity to integrate and mine knowledge from multiple biomedical databases. The two main objectives of the CKG are (1) to build a graph database including both experimental data and data imported from diverse biomedical databases and to (2) subsequently automate knowledge discovery, making use of all the information contained in the graph. The CKG is fueled by the powerful Neo4j graph database management system and Python. For more details on the CKG you can read our preprint ( Santos et al., bioRxiv, 2020) and check out the CKG documentation page ( https://ckg.readthedocs.io/en/latest/INTRO.html ).
AlphaPept provides a novel ultra-fast data analysis toolbox for MS/MS data analysis. This Python library is designed to engage both programmers and end-users by providing an intuitive graphical user interface as well as the opportunity to create highly customized and scalable workflows. The code is highly modular and performance-optimized by using just-in-time compilation (JIT) using Numba and efficient parallelization as well as GPU computing. We use AlphaPept to build the next generation of MS analysis workflows and rapidly test and implement new ideas as well as adapt groundbreaking algorithms. Importantly, AlphaPept builds on Python and its scientific stack but is also meant as a collaborative development environment with low barriers to entry for the community to contribute.
Machine learning and deep learning in proteomics
We employ state-of-the-art machine learning and deep learning techniques for tasks in raw MS data analysis and spectrum predictions, as well as for the analysis of clinical proteomics data. The data richness of proteomic profiles is ideally suited for Machine Learning and we have already shown how it enables the classification of patients with various diseases based on different body fluids, such as the determination of Alzheimer's disease based on CSF ( Bader et al., Mol Syst Biol., 2020) or the classification of liver disease by plasma proteome profiling ( Niu et al., bioRxiv, 2020 ).
One important aspect in research, especially the ‘omics’ fields, is to provide tools to holistically inspect, exploit and validate the large datasets that are being generated. To this end, we develop visual exploration tools for inspecting raw proteomics data as well as algorithmic output and processed results. One powerful approach for rapid visualization of millions of data points is given with the Holoviz toolbox ( https://holoviz.org/), which can be used to create web applications readily. One example for a web application to explore data of a recently published study in which we investigated the role of ubiquitination in the circadian cycle can be inspected here: http://cyclingubi.biochem.mpg.de/