Network Analysis Tools From Biological Networks To Clusters And Pathways Pdf

File Name: network analysis tools from biological networks to clusters and pathways .zip
Size: 24125Kb
Published: 25.04.2021

Pathway Network Analysis of Complex Diseases Based on Multiple Biological Networks

Metrics details. Gene and protein interaction experiments provide unique opportunities to study the molecular wiring of a cell. Integrating high-throughput functional genomics data with this information can help identifying networks associated with complex diseases and phenotypes.

Here we introduce an integrated statistical framework to test network properties of single and multiple genesets under different interaction models. Our software is designed for easy integration into existing analysis pipelines and to generate high quality figures and reports.

We also developed PyGNA to take advantage of multi-core systems to generate calibrated null distributions on large datasets. We then present the results of extensive benchmarking of the tests implemented in PyGNA and a use case inspired by RNA sequencing data analysis, showing how PyGNA can be easily integrated to study biological networks. We present a tool for network-aware geneset analysis. PyGNA can either be readily used and easily integrated into existing high-performance data analysis pipelines or as a Python package to implement new tests and analyses.

With the increasing availability of population-scale omic data, PyGNA provides a viable approach for large scale geneset network analysis. The availability of high-throughput technologies enables the characterization of cells with unprecedented resolution, ranging from the identification of single nucleotide mutations to the quantification of protein abundance [ 1 ].

However, these experiments provide information about genes and proteins in isolation, whereas most biological functions and phenotypes are the result of interactions between them. Protein and gene interaction information are becoming rapidly available thanks to high-throughput screens [ 2 ], such as the yeast two hybrid system, and downstream annotation and sharing in public databases [ 3 , 4 ].

Thus, it is becoming obvious to use interaction data to map single gene information to biological pathways. Integrating interaction information with high throughput experiments has proven challenging. The vast majority of existing analytical methods are based on the concept of over-representation of a candidate set of genes in expert curated pathways or networks [ 5 , 6 ]; however, this approach is strongly biased by the richer-get-richer effect, where intensively studied genes are more likely to be associated with a pathway [ 7 ], ultimately limiting the power of new discoveries.

Many methods have now been proposed to directly integrate network information for function prediction [ 8 , 9 , 10 ], module detection [ 11 ], gene prioritization [ 12 ] and structure recognition [ 13 ]. However, results are usually sensitive to the underlying network interaction model used and test statistics [ 14 ], and performing analyses across different tools is not feasible, as the vast majority of this software comes either as a web application or visualization plugins.

While web applications are simple to use for targeted analyses, they are also difficult to integrate in high-throughput data analyses pipelines. With the increasing availability of biological interaction resources and the development of standardized high-throughput analysis pipelines, a unified and easy to use framework for network characterization of genes and proteins could generate useful information for downstream experimental validation. The PyGNA analysis workflow.

We recognize three main use-cases where PyGNA can be used, including i network analysis of high-throughput experiments, ii network analysis of curated genesets and iii simulations of networks and genesets for algorithms benchmarking. Here we build on recent advances in network theory to provide an integrated statistical framework to assess whether a set of candidate genes or geneset form a pathway, that is genes strongly interacting with each other.

We then extended this framework to perform comparisons between two genesets to find similarities with other annotated networks, as a way to infer function and comorbidities. It is important to note that the tests implemented in our software are not an exhaustive list of all the approaches presented in literature; here we favoured well established models with test statistics easy to interpret [ 14 ].

Nonetheless, PyGNA provides a flexible API to implement and benchmark new network-based statistical tests, while taking advantage of our data processing and statistical testing framework. Our software is designed with modularity in mind and to take advantage of multi-core processing available in most high-performance computing facilities. PyGNA facilitates the integration with workflow systems, such as Snakemake [ 16 ], thus lowering the barrier to introduce network analysis in existing pipelines.

We conclude by discussing how PyGNA compares to other existing tools and why it represents an advancement for geneset network analysis. We hereby introduce basic notation and properties for network analysis, describing interaction models, test statistics and hypothesis testing methods implemented in PyGNA.

Moreover, unless otherwise stated, we consider only the largest connected component LCC of the network; while this is not strictly necessary, distance measures are often not informative when computed over disconnected graphs. We denote as degree of a node i , deg i , the number of edges associated with it.

In this context, nodes represent genes or proteins, whereas edges the intervening interactions, e. We denote as interaction model, a function that quantifies the strength of interaction between any two nodes in a network. Here we introduce three interaction models with different properties and complexity.

A direct interaction model assumes that two nodes interact only if there is an edge between them; this is the most efficient model to evaluate as it requires only the inspection of the adjacency matrix.

Under a shortest path interaction model, instead, we assume that the strength of interaction between two genes is a function of their distance on a network G , that is closer genes are more likely to interact. However, for k big enough, the probability of interaction between nodes converges to a quantity proportional to the degree of the nodes, thus neglecting local structure information. We can then estimate analytically the probability of interaction at steady state as follows:.

In this case, the matrix H can be interpreted as the heat exchanged between each node of the network [ 11 ]. It is also worth noting that the above formulation is agnostic to direction and weights of the edges.

These three interaction models capture different topological properties. Direct models provide information about the neighborhood of a gene and its observed links. However, they might not be sufficiently powered to detect mid- and long-range interactions, thus statistics defined under these models are usually sensitive to missing links.

Conversely, modelling gene interactions using shortest path provides a simple analytical framework to include local and global awareness of the connectivity. However, this approach is also sensitive to missing links and small-world effects, which is common in biological networks and could lead to false positives [ 19 ].

Propagation models provide an analytical model to overcome these limitations, and have been shown to be robust for biological network analysis [ 20 ]. While its interpretation is not necessarily straightforward, the RWR model is more robust than the shortest path model, because it effectively adjusts interaction effects for network structure; it rewards nodes connected with many shortest paths, and penalizes those that are connected only by path going through high degree nodes.

Based on the above interaction models, we have implemented and tested different statistics, which are described in detail below. We are interested in testing whether the strength of interaction between nodes of the geneset is higher than expected by chance for a geneset of the same size. Under a direct interaction model, the importance of a geneset S can be quantified as the number of edges connecting each node in S to any other node in the network; we refer to this quantity as the total degree of the node.

Thus, we define the total degree statistic for a geneset S as:. Conversely, with the direct interaction model, the strength of interaction for a geneset S can be quantified as the number of edges connecting each node in S to any other node in the geneset; we refer to this quantity as the internal degree of the node. Thus, we define the internal degree statistic for a geneset S as:. However, the main limitation of this model lies in the fact that it only captures direct interactions, whereas biological networks are usually characterized by medium and long range interactions.

A shortest path interaction model allows to overcome this limitation by explicitly taking into account the distance between nodes. The topological and association statistics are ultimately used for hypothesis testing. To do that, we need a calibrated null distribution to estimate whether the observed statistics are more extreme than what expected by chance. Closed form definition of null distributions is possible only for very simple network models, which are often unrealistic.

Thus, w. It is possible to derive an empirical p-value as follows:. It is straightforward to adapt this formula to the case of testing whether a test statistic is smaller than expected by chance.

The default sampler generates null distributions by sampling nodes uniformly at random. However, certain metrics might be particularly sensitive to local network structure, especially when they solely rely on degree-related statistics to characterize a geneset.

To overcome this problem, we also implemented an additional sampler that generates null distributions matching the degree distribution of the tested dataset.

For the GNA tests, it is important to note that we are now dealing with two genesets. Hence, a null distribution can be computed either by sampling two random genesets or by sampling only one of the two; we recognize that the latter is more conservative, and is recommended when checking for association with known pathways see Additional file 1.

Rigorous benchmarking of network analyses tools is challenging, because there is no ground truth for geneset network analysis [ 14 ]. Stochastic block models SBM have been shown to be a reasonable model for analyzing biological networks [ 22 ]; importantly, since SBM define a generative process over networks, they can be used to create networks with controllable features, including modules also often referred as clusters. A new network with n nodes can be generated by assigning each node to a block and adding edges probabilistically using the block model matrix.

Hence, by modulating the values on the diagonal of the block model matrix, we can assess the performance of GNT tests by analyzing the genesets made of the genes in a block. By parametrizing the off-diagonal terms of the block model matrix, it is possible to assess the performance of GNA tests see Additional file 1 for a graphical representation of the SBMs. While the SBM are useful to simulate networks with controllable structures, they are difficult to adapt to modelling networks with highly connected nodes hubs , which are common in biological networks.

Thus, here we introduce a stochastic generation procedure to build networks with hubs, which can then be used for assessing the performances of GNT tests. We hereby describe each model in detail. Similar to the approach outlined for GNT benchmarking, we used the SBM framework to generate network with multiple gene clusters to assess the performance of GNA tests.

With this parametrization, we can directly simulate 3 different scenarios:. By varying the size of highly connected blocks and their interaction probability, along with the geneset composition, it is possible to assess the true positive rate TPR and false positive rate FPR of GNA tests.

With the HDN model, we can replicate a common scenario where the tested geneset is made of a few master regulators and many, possibly, unrelated genes. Here, the idea is that a robust GNT test should have a low false positive rate, even when observed statistics might be skewed by few highly connected nodes.

PyGNA is implemented as a Python package and can be used as a standalone command-line application or as a library to develop custom analyses. In particular, our framework is implemented following the object oriented programming paradigm OOP , and provides classes to perform data pre-processing, statistical testing, reporting and visualization.

Our basic workflows are summarized in Fig. It is important to note that parsers for new data can be easily implemented by extending the ReadData abstract class. To facilitate the integration in bioinformatics pipelines, e. PyGNA stores results as CSV files, for downstream manipulation and sharing, although new formats can be supported by extending the Output class.

It is important to note that performing tests on large networks using either shortest path or random walk models is computationally taxing. However, since the node pairwise metrics are dependent only on the network structure, they can be computed upfront as part of a pre-processing step. Here, we save matrices in Hierarchical Data Format HDF5 format, using the pytables framework [ 24 ], for efficient matrix storage.

On this point, we designed PyGNA to performs efficiently both on low-memory machines, using memory mapped input output, and high-performance computing environments, by loading matrices directly into memory. It is important to note that PyGNA can be easily extended to use different test statistics by defining new Python functions; on this point, in our online documentation, we provide a complete example on how to build GNT tests based on closeness centrality of the nodes.

A bottleneck of our network analysis framework is the bootstrap procedure used to obtain a null distribution for hypothesis testing.

However, the resampling procedure is a seamlessly parallelizable process, since each randomly sampled set of nodes is independent from the others; thus, we implemented a parallel sampler using the multiprocessing Python library, allowing the user to set the number of cores to use. If only one core is requested, the multiprocessing architecture is not set-up, sparing the overhead incurred by setting up a scheduler for running only one thread see Additional file 1. It is important to note that, currently, Python 3.

PyGNA has been developed to generate high quality figures for each analysis and to export networks and genesets in standard formats compatible with graph visualization software, such as Cytoscape [ 25 ].

The visualization functions are implemented as part of the PygnaFigure class, which comes with sensible default parameters to maximize figures readability.

Barplots are used to plot the GNT results for a single statistic. For each geneset a red bar represents the observed statistic, whereas a blue one represents the average of the empirical null distribution.

BioNetStat: A Tool for Biological Networks Differential Analysis

The Bader Lab is involved in a number of collaborative open-source bioinformatics projects designed to make biological pathway data easy to visualize and analyze. Additional features are available as plugins. Plugins are available for network and molecular profiling analyses, new layouts, additional file format support and connection with databases. Plugins may be developed using the Cytoscape open Java software architecture by anyone and plugin community development is encouraged. Node and edge attributes of any type and paths of unknown length can be specified in the search. Clusters mean different things in different types of networks.

Protocol DOI: Protein—protein interaction networks PPIs collect information on physical—and in some cases—functional interactions between proteins. Most PPIs are annotated with confidence scores , which reflect the probability that a reported interaction is a. Most PPIs are annotated with confidence scores , which reflect the probability that a reported interaction is a true interaction. These scores, however, do not allow users to isolate interactions relevant in a particular biological context. Context filtering.


Network Analysis Tools: from biological networks to clusters and pathways. September ; Nature Protocols 3(10)


PyGNA: a unified framework for geneset network analysis

Javascript is currently disabled in your browser. Several features of this site will not function whilst javascript is disabled. Received 27 January Published 4 June Volume Pages 11—

Tools for visualization and analysis of molecular networks, pathways, and -omics data

Thank you for visiting nature. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser or turn off compatibility mode in Internet Explorer.

Metrics details. Gene and protein interaction experiments provide unique opportunities to study the molecular wiring of a cell. Integrating high-throughput functional genomics data with this information can help identifying networks associated with complex diseases and phenotypes.

The study of interactions among biological components can be carried out by using methods grounded on network theory. Most of these methods focus on the comparison of two biological networks e. However, biological systems often present more than two biological states e. To compare two or more networks simultaneously, we developed BioNetStat , a Bioconductor package with a user-friendly graphical interface. BioNetStat compares correlation networks based on the probability distribution of a feature of the graph e. The analysis of the structural alterations on the network reveals significant modifications in the system.

PyGNA: a unified framework for geneset network analysis

Mapping biological data to a network

Networks in biology can appear complex and difficult to decipher. We illustrate how to interpret biological networks with the help of frequently used visualization and analysis patterns. Networks represent relationships. In a biological context, many different types of relationships can be measured, such as physical interactions between proteins or genetic interactions revealed by combinations of mutations. When large collections of diverse relationships are generated from several different high-throughput experimental analyses of a single biological system, network visualization and analysis can prove particularly useful 1 — 3. To illustrate how data visualized as a network can be easier to interpret than long lists of proteins, interactions and correlations, we analyze an example network representing the yeast chromosome maintenance and duplication machinery Fig. We apply these patterns to our example network and provide references for further reading to tutorials that describe specialized network analysis software.

To browse Academia. Skip to main content. By using our site, you agree to our collection of information through the use of cookies.

На мгновение в комнате повисла тишина, затем Росио приоткрыла губы в хитрой улыбке. - Ну видите, все не так страшно, правда? - Она села в кресло и скрестила ноги.  - И сколько вы заплатите. Вздох облегчения вырвался из груди Беккера. Он сразу же перешел к делу: - Я могу заплатить вам семьсот пятьдесят тысяч песет.

 Тогда за дело, - сказал Стратмор, положил ей на плечо руку и повел в темноте в направлении Третьего узла.

2 Response
  1. Adelaida S.

    Biological pathways play important roles in the development of complex diseases, such as cancers, which are multifactorial complex diseases that are usually caused by multiple disorders gene mutations or pathway.

Leave a Reply