Introduction

The objectives of this introductory class in Network Science are threefold. Our principle aims are summarized below:

  • To provide a general overview of the study of complex systems through networks with examples from biology.
  • To give a background on the mathematical theory of graphs that forms the formalistic basis of network science.
  • To provide hands-on examples of network analysis with the use of R.

What are networks?

Network: A theoretical, mathematical structure that describes the relations or associations between similar elements or groups therof representing them schematically.

Examples from Biology

  1. Transcriptional regulatory networks: Relationships between genes represent regulatory potential.
  2. Reaction networks (metabolic, signaling, etc.): Proteins or/and metabolites are relevant if they participate in the same reaction.
  3. Physical Interaction networks: Two elements are associated when proven to be in physical interaction (contact, attachment, adsorption).
A human protein-protein interaction network

A human protein-protein interaction network

Why networks?

  • Realization of associations. By studying networks we realize the associations/links/causal connections between components of a system.
  • Recognition of motifs We can observe recurring motifs in a network, that carry increased informational value and may underlie hidden mechanisms (e.g. regulatory network motifs that correspond to common mechanisms of gene regulation)
  • Modeling of complex processes We can build predictive models to explain complex behaviour (e.g. in epidemiology)
  • Discovery of new emergent properties We are often led to discover previously unthought of properties at systems level (e.g. the scale-free property)

Types of Networks

From the formalistic point of view, networks are mathematical representations (graphs), in which specific elements are depicted in the form of nodes (or vertices). Associations between nodes are depicted in the form of links (or edges).

Depending on the different types of associations The networks may be: a) directed or undirected b) weighted or unweighted c) simple or bi/multi-partite

Examples of Biological Networks

Examples of Biological Networks

Undirected, Directed and Weighted Networks

Different types of networks may describe the same system depending on the level of available information. If only associations are known we describe the network as undirected. If there are known types of associations that may be positive or negative (e.g. in a regulatory network) we describe a network as directed. Links now have a direction and edges are described as “incoming” or “outgoing”. If besides the simple association there are also recorded quantitative aspects (e.g. how much a link is corresponding to a certain function) then the links become weighted and the edges are assigned a numerical value or score.

Different levels of information in a regulatory network

Different levels of information in a regulatory network

Bipartite Networks

Bipartite networks are networks formed by nodes of two different types, with edges not allowed between elements of the same type. Bipartite networks may be “collapsed” to simple networks through joining of nodes, originally connected via nodes of the alternative type. Examples of bipartite networks in biology are metabolic networks where substrates and enzymes form two distinct -non interacting- categories of components.

Schematic of a bipartite network and its collapsed simple network

Schematic of a bipartite network and its collapsed simple network

Network Motifs

As in the general case, network motifs are modules of the network that tend to occur more often than expected by chance. In this sense they correspond to types of interactions/associations that may be reflecting common mechanistic relationships between components of the network. Regulatory Network Motifs in S. cerevisiae In the Figure above we see different motifs of regulation. These are a) auto-activation, b) auto-repression, c) self-activating loop, d) feed forward stimuation, e) central input and f) cascade with negative feedback. All of them are common regulatory modules in the studied organism and may help us understand some basic mechanisms of gene regulation.

Graph Theory

Network Concepts

  • Nodes. They are the components of the graph/network
  • Edges. They are the associative links between the network’s/graph’s components

Network Representations

  • Adjacency matrices:
    When representing a network we can think of a square matrix where each row and column correspond to a node. In this setting \(M[i,j]\) may be equal to 0 or 1 (for undirected, unweighted networks), -1, 0, 1 (for unweighted, directed networks) or it can hold any real value (for weighted, directed networks).
  • Edge lists:
    Even for modestly sized networks of \(N\) nodes, adjacency matrices of \(N^2\) are too big to hold a small number of values, as networks are usually sparse. For this reason we usually prefer edge list representations, where each edge in the network is represented as a row of three elements (node1, node2, edge-weight).
Network view, edge list view and adjacency matrix of a network

Network view, edge list view and adjacency matrix of a network

Graph Theory Basics

  • Node Degree is the number of links associated with a given node.

We denote with k(i) the degree of the i-th node in the network. This is the number of links associated to i. In an undirected network the total number of links, L, can be expressed as the sum of the node degrees:

\[L=\frac{1}{2} \sum_{i=1}^{N}\frac{1}{k_i}\]

Here the 1/2 factor corrects for the fact that in the sum each link is counted twice.

In a directed network the 1/2 factor is ommitted as the degree for each node is counted separately for incoming and outgoing nodes.

  • Mean Degree

An important measure of centrality for a given network is the mean degree, which corresponds to the average number of links for the total of the networks node’s. In an undirected network this is equal to:

\[<k>=\frac{1}{N}\sum_{i=1}^{N}{k_i}=\frac{2L}{N}\]

  • Density:
    The mean degree is often associated with the density of a network. We define as density the ratio of the number of edges over the maximum possible number of edges. In this regard it is different from the mean degree since it can take values from 0(no associations) to 1 (fully connected graph).

Exercise: Calculate the maximum number of edges for a directed and an undirected network with N nodes.

  • Shortest Path The shortest path is defined as the minimum number of edges one must traverse to get from a given node to another given node. Calculation of the shortest path requires the implementation of processes such as Breadth First Search (see below).
Paths and shortest path between nodes

Paths and shortest path between nodes

  • Average (shortest) path length Is another centrality measure that represents the overall “connectedness” of the network.

\[d=\frac{1}{N(N-1)}\sum_{i,j=1}^{N}d_{i,j}\]

In order to calculate it one needs to calculate all shortest paths in the network which are \(N(N-1)\). This is not a trivial task but can be solved with the application of a Breadth First Search (BFS) approach. Breadth First Search (BFS) is a graph algorithm that enumerates all nodes in the graph. One can use it to calculate the shortest path of each pair of nodes by starting from a given node and traversing all edges and marking all visited nodes with the length of the path, as shown in the figure below.

Paths and shortest path between nodes

Paths and shortest path between nodes

  • Diameter
    We define as diameter (d) the longest among all possible shortest paths in a given network.
Node degree, density and shortest path

Node degree, density and shortest path

  • Clustering Coefficient:
    Clustering coefficient is a very important measure of centrality. It relates the overall connectivity of the network with the tendency of neighbouring nodes to create links between them. It is given by the number of links that are shared between the first neighbours of a given node. More formally, it may be calculated through the formula:

\[C_i=\frac{2L_i}{k_i(k_i-1)}\] In geometric terms, the clustering coefficient may be seen as the number of “triangular” relationships, formed by the links between the first neighbours of a given node. In social terms, it measures how likely it is that a person’s “friends” are friends with each other.

Clustering Coefficient The clustering coefficient is assigned to particular nodes but an overall average may be easily calculated through the arithmetic mean of all nodes:

\[<C>=\frac{1}{N}\sum_{i=1}^{N}C_i\]

Node degree and Clustering Coefficient

Node degree and Clustering Coefficient

  • Connected Components:
    Network connectivity may vary across the network. A common problem is to define isolated connected components in a network. As connected components we define parts of the graph that are connected with each other but not with the rest of the network. Bridge-elements are defined as nodes that have links with network subsets that would otherwise be isolated connected components.
Discovery of Connected Components and Bridges through the Adjacency Matrix

Discovery of Connected Components and Bridges through the Adjacency Matrix

Exercise: Describe an algorithm that will scan a network for connected components

  • Assortativity:
    We define as assortativity the mean correlation of any property between nodes that are connected with each other. Assortativity may be calculated on any given property of the network but a very common one is the node degree. Degree assortativity is calculated as the mean correlation of node degree between connected nodes.

Exercise: Propose a way to calculate degree assortativity from an adjacency matrix (or an edge list). What can we say of networks that are a) highly assortative in terms of degree or b) of low degree assortativity?

Random Networks

Random networks are formed under the assumption that each node is associated at random with any other node. Formation of a Random Network

Random networks, also called Erdos-Renyi networks from the duo of Hungarian mathematicians who studied them in detail, can be modeled on the basis of two parameters (N, p). N is the number of nodes in the network and p is the probability of two nodes being connected.

Starting from this simple assumption we can deduce the number of edges in a random network as:

\[L=p\frac{N(N-1)}{2}\]

from which it stems that the average degree is:

\[<k>=\frac{2L}{N}=p(N-1)\] The clustering coefficient in a random network is:

\[C_i=\frac{2<L_i>}{k_i(k_i-1)}\] where \(k_i\) is the number of neighbours of node \(i\). If we combine this equation with the one above on the number of links for the whole network, we get that: \[<C>=p=\frac{<k>}{N}\] and thus that the mean clustering coefficient of a random network is equal to the network’s scaled mean degree, which, in turn, is equal to the probability of two nodes being linked.

  • Degree distribution The degree distribution of a random network would formally be modelled by a binomial distribution as each pair of nodes has a probability of \(p\) to be linked and \(1-p\) not to. It may be shown that for a large number of nodes \(>=100\) this can be simulated with the Poisson distribution.

Similarity of Poisson and Binomial Distributions for large N We thus prefer the latter as it may describe Random Networks with one parameter only, namely \(p\).

  • Mean degree and Structure of the network

How may the mean degree \(<k>\) be affecting the overall structure of a random network? Depending on the value of \(<k>\) a random network may be formed by isolated components, may form one giant component, or even be totally connected. A critical value of \(<k>\) is 1, below which the network is formed in isolated components and above which the emergence of a giant component is observed. In brief for \(<k>>1\) random networks form a component that links all nodes.

Random Network Structure depending on <k>

Random Network Structure depending on \(<k>\)

Scale-free Networks

Most of the naturally occurring networks are not random. Their degree distribution are not Poisson-like as a considerable number of nodes have higher degrees than expected under randomness. (These are called hubs).

Node degree distributions of random vs scale-free networks

Node degree distributions of random vs scale-free networks

Naturally occurring networks show a particularly consistent node degree distribution according to which:

\[p_k \sim k^γ\] by a log-transformation this becomes:

\[log(p) \sim γlog(k)\] This means that in naturally occurring networks the probability of a node p to have k links is scaling with the logarithm of k. γ is usually between 2 and 3. We say that the degree distribution follows a power-law distribution which is linear in double-logarithmic scale as may be seen in the Figure below for three networks from very different fields.

Node degree distributions of random vs scale-free networks

Node degree distributions of random vs scale-free networks

The power-law property

But what is so interesting in this power-law property? In a number of ways it reflects the dynamics underlying the formation, the evolution and the stability of these networks. First of all it describes networks were the majority of the links are distributed among the minority of nodes. This is reminiscent of a Pareto distribution or the 80-20 rule according to which 20% of the nodes collect 80% of the links and vice versa. This disparity actually means that the range of degree values is much greater than in a random network and so there is significant probability to find nodes with both very high and very low degrees.

The importance of the power-law property

The importance of the power-law property

Scale-free property

It seems that the same structure may be repeated in different scales of magnitude from the very small to the very large, in contrary to random networks where most of the nodes may be contained in a tight range around the mean. This property is what gives these networks their name as “scale-free networks”. The scale-free property emerges in very different systems from linguistics to population dynamics to biology.

Scale free properties in various systems

Scale free properties in various systems

It also gives somes additional properties to the networks that carry it. Compared to random networks, scale-free networks are:
1. Μore connected. Average shortest paths are shorter than in random networks. These are even shorter when they involve hubs and, due to the 80-20 principle, most of them do.
2. More robust. Since most of the links are associated to a few nodes it is more difficult to destroy the network’s structure by random attacks to nodes. This will require targeting the hubs.
3. Stably evolving. Attachment of new nodes further enforces the scale-free property since it is done “preferentially”, meaning that the likelihood of linking a new node with a hub.

Preferential attachment

Preferential attachment

References

  • Network Science, Albert-Laszlo Barabasi (Chapters 2,3 and 4)
  • Υπολογιστική Βιολογία, Χριστόφορος Νικολάου (Κεφάλαιο 9)