Chapter 7. Bioinformatics, Genetics, and Network Models

Scientific applications have multiple black boxes, and what goes inside these boxes is complex and often thought of as magical. However, they all follow a systematic set of protocols. These protocols are well known in the research community. For instance, network models are widely used to represent complex structured data, such as protein networks, molecular genetics, and chemical structures. Another interesting field in the research community is bioinformatics. This is a growing field that has lately generated a considerable amount of breakthrough in research.

In the field of biology, there are many different complex structures, such as DNA sequences, protein structures, and so on. In order to compare, let's take a look at some of the unknown elements within these structures. It is helpful to have a model that will visually display them. Similarly, in any application of the graph theory or networks, it is essentially beneficial to be able to visualize the complex graph structure.

Later in this chapter, we will discuss some interesting examples, such as social networks, directed graph examples in real life, data structures appropriate for these problems, and network analysis. For the purposes of demonstrating examples, here we will use specific libraries, such as metaseq, NetworkX, matplotlib, Biopython, and ETE toolkit, covering the following topics:

  • Directed graphs and multigraphs
  • The clustering coefficient of graphs
  • Analysis of social networks
  • The planar graph test and the directed acyclic graph test
  • Maximum flow and minimum cut
  • A genetic programming example
  • Stochastic block models and random graphs

Directed graphs and multigraphs

First, we will review directed graphs and multigraphs. Later, we will figure out the options in Python to generate them. Also, we will take a look at an example where you may require directed graphs. Before we conceptually describe graphs and directed graphs, let's take a look at the different ways to understand when you can use graphs and directed graphs.

Computers that are connected to each other within a university campus area can be considered a connected graph, where each computer in this connection is viewed as a node or a vertex. The connected path is an edge, and in some cases, if there is only a one-way connection, then it is a directed graph. For instance, a very restricted federal network will not allow any connection from outside to go in, but will probably not restrict the other way around. The following are simple graphs showing distances between places:

Directed graphs and multigraphs

In the preceding examples, the graph with city labels A through F is a directed graph, and the other one on the right-hand side is an undirected graph. In the directed graph, if the arrow points both ways, there is a way to go both ways, whereas in the undirected graph, both ways are assumed. If we were to represent these graphs using some data structure, what would that be? Also, if we were to plot these kinds of graphs, which libraries do we use and how do we accomplish it?

Storing graph data

Graph data is usually represented as an adjacency matrix, unless it is sparse. An adjacency matrix is a matrix that has V2 rows, assuming that the graph has a V vertex or a node. For example, for the two graphs shown in the preceding figure, the adjacency matrix looks similar to the following tables:

 

A

B

C

D

E

F

A

0

25

26

   

B

 

0

85

5

10

 

C

26

85

0

  

10

D

   

0

 

11

E

   

9

0

88

F

   

11

88

0

 

Chicago

Boston

New York

Wash DC

Miami

Dallas

Chicago

0

1613

 

1145

  

Boston

1613

0

338

725

  

New York

 

338

0

383

2145

 

Wash DC

1145

725

383

0

1709

2113

Miami

  

2145

1709

0

2161

Dallas

   

2113

2161

0

For undirected graphs, by symmetry, it is enough to use half the storage (no need to store all the information from A to B and B to A). The blank entries show that there is not enough data about the distance. If the matrix is sparse, where most of the entries are not filled, then you can store it as a list of lists. Fortunately, there are convenient methods in scipy to deal with sparse matrices. The following code is only for the first graph shown in the preceding figure:

import scipy.sparse as sparse

matrixA = sparse.lil_matrix((6,6))

matrixA = sparse.lil_matrix( [[0,25,26,0,0,0], [0,0,85,5,10,0],
   [26,85,0,0,0,10], [0,0,0,0,0,11],[0,0,0,9,0,88],[0,0,0,11,88,0]])
print matrixA   
(0, 1)  25   
(0, 2)  26   
(1, 2)  85   
(1, 3)  5   
(1, 4)  10   
(2, 0)  26   
(2, 1)  85   
(2, 5)  10
(3, 5)  11   
(4, 3)  9   
(4, 5)  88   
(5, 3)  11   
(5, 4)  88

Displaying graphs

The preceding example only shows how to represent the graph using the scipy library (the scipy.sparse package in particular). However, in the following section, we will see how to display these graphs. Although there are numerous Python packages that you can choose from to display graphs, the top three popular choices among these are NetworkX, igraph (from igraph.org), and graph-tool. Let's take a look at an example of graph display using these three packages.

igraph

Originally, igraph was intended for R users, but later, the Python version was added. For smaller graphs, you can add the vertices and edges and display them very easily, but in most cases, graphs are not small; therefore, igraph offers functions that reads the data of a graph from files conveniently and displays it.

Currently, igraph offers several formats, such as dimacs, dl, edgelist, graml, graphdb, gml, lgl, ncol, and pajek. GraphML is an XML-based file format and can be used for large graphs, and the NCOL graph format is suited for large graphs with a weighted edge list. The LGL graph format can also be used for a large graph layout with weighted edges. Most others use a simple textual format. Only the DL file format is fully supported by igraph, and for all others, igraph only supports partial file formats.

Similar to many other Python packages, the good part about igraph is that it offers very convenient ways to configure and display graphs and stores them in the SVG format so that they can be embedded in an HTML file.

Let's take a look at one example that involves the pajek format (for more details on pajek, you can refer to http://vlado.fmf.uni-lj.si/pub/networks/pajek/). There are many other parameters. A few among these are labelcolor, vertexsize, and radius for some vertex shapes. We will see two examples here. The first example has assigned labels and edges for a small graph, whereas the second example reads the data of a graph from a file and displays it. The following example shows a labeled graph using the igraph package:

from igraph import *

vertices = ["A", "B", "C", "D", "E", "F", "G", "H", "I", "J"]

edges = [(0,1),(1,2),(2,3),(3,4),(4,5),(5,6),(6,7),(7,1),
         (1,8),  (8,2),(2,4),(4,9),(9,5),(5,7),(7,0)]

graphStyle = { 'vertex_size': 20}
g = Graph(vertex_attrs={"label": vertices}, edges=edges, directed=True)
g.write_svg("simple_star.svg", width=500, height=300, **graphStyle)

There are 10 vertices in the star graph that forms five triangles and a pentagon. Also, there are 15 edges because the five triangles complete the set of edges. It is a very simple graph, where each edge is defined by the associated vertex numbers that starts from zero. The following labeled graph plot is the result of the preceding Python example:

igraph

This second example illustrates not only how to read the graph data from a file, but also how to save the plot in the SVG format so that you can embed the SVG data in HTML:

from igraph import read  

g=read("ragusa.net",format="pajek")  

g.vs["color"]="#3d679d" 
g.es["color"]="red" 

graphStyle={ 'vertex_size': 12, 'margin': 6} 
#graphStyle["layout"]=g.layout("fr")  # optional

g.write_svg("ragusa_graph.svg", width=600, height=600,**graphStyle)

The pajek format file is read using the read function from igraph. When you set up the edge and the vertex color, you can generate the SVG format of the graph. There are several different layouts that igraph offers that you can experiment with. The following plot shows a graph that was created using the igraph package by reading the graph data from a file:

igraph

The graph data in the pajek format was obtained from the pajek networks website (http://vlado.fmf.uni-lj.si/pub/networks/pajek/data/gphs.htm) from a file named Rgausa16.net. Once a data file from here is downloaded, you can use it in a similar way and display the graph, as shown in the preceding image. If we use the tinamatr.net data and set the circular layout, then the graph would appear in a circular layout, as shown in the following code:

graphStyle["layout"]=g.layout("circle")

NetworkX

One of the reasons this Python package is called NetworkX is because it is a library for network and graph analysis. From finding the shortest path from a source node or vertex to the destination node or vertex, finding the degree distribution to figure the nodes that are similar to the junction, and finding the clustering coefficient of a graph, there are several ways to perform a graph analysis.

The study of graphs has been around for a while and is applicable in neurobiology, chemistry, social network analysis, page ranks, and many more such interesting areas today. Social networks are assortative truly in the sense of joining similar affiliated members, and biological networks are the opposite. In other words, the friendship between Facebook users or the academicians (who are coauthors) can be visualized easily via graphs. Python packages offer users many options. Often, users choose several of these to combine the best of their individual functionalities.

NetworkX offers graph building and analysis capabilities. You can read and write network data in standard and nonstandard data formats, generate graph networks, analyze their structure, and build several models. The following Python code shows how one can create a directed graph by just using matplotlib:

import matplotlib.pyplot as plt
import pylab
from pylab import rcParams

import networkx as nx
import numpy as np

# set the graph display size as 10 by 10 inches
rcParams['figure.figsize'] = 10, 10

G = nx.DiGraph()

# Add the edges and weights
G.add_edges_from([('K', 'I'),('R','T'),('V','T')], weight=3)
G.add_edges_from([('T','K'),('T','H'),('I','T'),('T','H')], weight=4)
G.add_edges_from([('I','R'),('H','N')], weight=5)
G.add_edges_from([('R','N')], weight=6)

# these values to determine node colors
val_map = {'K': 1.5, 'I': 0.9, 'R': 0.6, 'T': 0.2}
values = [val_map.get(node, 1.0) for node in G.nodes()]

edge_labels=dict([((u,v,),d['weight'])
                 for u,v,d in G.edges(data=True)])

#set edge colors
red_edges = [('R','T'),('T','K')]
edge_colors = ['green' if not edge in red_edges else 'red' for edge in G.edges()]

pos=nx.spring_layout(G)

nx.draw_networkx_edges(G,pos,width=2.0,alpha=0.65)
nx.draw_networkx_edge_labels(G,pos,edge_labels=edge_labels)

nx.draw(G,pos, node_color = values, node_size=1500,
 edge_color=edge_colors, edge_cmap=plt.cm.Reds)

pylab.show()

The following diagram illustrates how you can use NetworkX to configure the edge weights and the visual aesthetics of a graph. Among several approaches of displaying a directed graph, NetworkX took a different approach by showing a thick bar at the end, rather than using an arrow symbol that determines the direction of a graph.

NetworkX

When there is a scientific study that involves a collection of elements that represent things or people, the association between them is better represented in the form of graphs, where these elements are vertices or nodes. In most of these cases, the centrality visually identifies nodes that are significantly important. Python packages (such as NetworkX) have many useful functions for graph analysis that includes finding cliques in the graph. For smaller graphs, it is easier to visually inspect intricate details, but for larger graphs, one would want to recognize a pattern of behavior, such as isolated cluster groups.

Typically, the labels for nodes and edges depend on what you are trying to display as a graph. For instance, protein interaction can be displayed as a graph. A more complex example will be a sequence space graph, where a graph node represents a protein sequence, whereas an edge represents a single DNA mutation. It would be easier for scientists to zoom into these images to see patterns, as shown in the following image. This example does not use Python and uses interactive programming to zoom and view the intricate details.

NetworkX

Note

The preceding image has been taken from http://publications.csail.mit.edu/.

Sometimes, you would want to highlight different routes on a map. For instance, if a road map is being displayed and you have to display the routes that the Olympic cycling team is going to follow this year on this map, you can do something similar to the following code:

import networkx as nx
from pylab import rcParams

# set the graph display size as 10 by 10 inches
rcParams['figure.figsize'] = 10, 10

def genRouteEdges(r):
    return [(r[n],r[n+1]) for n in range(len(r)-1)]

G=nx.Graph(name="python")
graph_routes = [[11,3,4,1,2], [5,6,3,0,1], [2,0,1,3,11,5]]
edges = []
for r in graph_routes:
    route_edges = genRouteEdges(r)
    G.add_nodes_from(r)
    G.add_edges_from(route_edges)
    edges.append(route_edges)

print("Graph has %d nodes with %d edges" %(G.number_of_nodes(),    
G.number_of_edges()))

pos = nx.spring_layout(G)
nx.draw_networkx_nodes(G,pos=pos)
nx.draw_networkx_labels(G,pos=pos)

colors = ['#00bb00', '#4e86cc', 'y']
linewidths = [22,14,10]

for ctr, edgelist in enumerate(edges):
    nx.draw_networkx_edges(G,pos=pos,edgelist=edgelist,
      edge_color = colors[ctr], width=linewidths[ctr])

Using convenient methods from NetworkX for a specific route, you can easily highlight the routes with different colors and line widths, as shown in the following image:

NetworkX

As shown in the preceding image, by controlling the highlights of routes, you can recognize different routes on a map.

Additionally, from the shortest path to degree distribution to clustering coefficients, NetworkX offers a variety of ways to perform a graph analysis. One simple way to see the shortest path is shown in the following code:

import networkx as nx

g = nx.Graph()
g.add_edge('m','i',weight=0.1)
g.add_edge('i','a',weight=1.5)
g.add_edge('m','a',weight=1.0)
g.add_edge('a','e',weight=0.75)
g.add_edge('e','h',weight=1.5) 
g.add_edge('a','h',weight=2.2)

print nx.shortest_path(g,'i','h')
nx.draw(g)

#printed shortest path as result 
['i', 'a', 'h']

One more example using NetworkX (particularly reading the data in the GML format) is the "coappearance of characters in the Les Miserables novel", which we downloaded from the datasets available from gephi.org at https://gephi.org/datasets/lesmiserables.gml.zip.

NetworkX

The preceding plot is the result of the program that reads the association of characters from Les Miserables and creates a network diagram, as shown in the following code:

import networkx as nx
from pylab import rcParams
rcParams['figure.figsize'] = 12, 12

G = nx.read_gml('/Users/kvenkatr/Downloads/lesmiserables.gml', relabel=True)
G8= G.copy()
dn = nx.degree(G8)
for n in G8.nodes():
  if dn[n] <= 8:
    G8.remove_node(n)
pos= nx.spring_layout(G8)
nx.draw(G8, node_size=10, edge_color='b', alpha=0.45, font_size=9, pos=pos)
labels = nx.draw_networkx_labels(G8, pos=pos)

Graph-tool

Among the three packages, igraph, networkx, and graph-tool, the graph-tool package is the hardest to install, especially on a Mac OS. Graph-tool has many convenient functions and is also considered very efficient in terms of centrality-related algorithms. This includes k-core, PageRank, minimum spanning tree, and the single source shortest path. The comparison table is available at https://graph-tool.skewed.de/performance. The module that includes centrality-related algorithms as mentioned earlier is graph_tool.centrality.

import graph_tool.all as gtool

gr = gtool.collection.data["polblogs"]
gr = gtool.GraphView(gr, vfilt=gtool.label_largest_component(gr))

cness = gtool.closeness(gr)

gtool.graph_draw(gr, pos=gr.vp["pos"], vertex_fill_color=cness,
               vertex_size=gtool.prop_to_size(cness, mi=5, ma=15),
               vorder=cness, vcmap=matplotlib.cm.gist_heat,
               output="political_closeness.pdf")
Graph-tool

The prefix centra in the "centrality" word truly means that some entity (in this context, this would be a node or vertex) is central. Also, many other entities are connected to the central entity. So, we can ask a reasonable question, that is, what are the characteristics that makes a vertex important? In the graph_tool centrality module, there are nine centrality-related algorithms that are offered, and PageRank is one among these, in addition to closeness.

PageRank

The graph_tool.centrality .pagerank() function generates the PageRank of the v vertex. Most people who know Google's PageRank understand how the measure works. In a nutshell, it is a way to measure how important web page A is (in terms of how many outside web sites B are depending on web page A and also on how many web pages A depends on – in graph theory they are called in-degree and out-degree). In addition to these, Google applies many other external factors to rank a web page. In the preceding example, if we replace the line that finds closeness by PageRank as follows:

pagerank = gtool.pagerank(gr)

This should generate a graph with the emphasis on PageRank. In addition to the centrality measure, there is one other factor called the clustering coefficient of a graph.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset