DNA sequencing using dynamic programming

We have just seen how to find the longest common subsequence. Using the same principle, we can implement DNA or protein sequencing, which can be very helpful for us in solving bioinformatic problems. For alignment purpose, we will use the most popular algorithm known as the Needleman-Wunsch algorithm. It is similar to our LCS algorithm, but the scoring system is different. Here, we score a match, mismatch, and gap in a different scoring system. There are two parts of the algorithm: one to calculate the matrix with possible sequence and the second part is tracking back the actual sequence with the best possible one. The Needleman-Wunsch algorithm provides the best global alignment solution for any given sequence. Since the algorithm itself is little bigger along with the scoring system explanation, which we can find in many websites or books, we want to keep our focus on the implementation part of the algorithm. We will divide the problem into two parts. First, we will generate the computational table using dynamic programming, and then, we will track it backwards to generate the actual sequence alignment. For our implementation, we will use 1 for matching, and -1 for gap penalty and mismatch score. Here is the first part of our implementation:

define("GC", "-"); 
define("SP", 1);
define("GP", -1);
define("MS", -1);

function NWSquencing(string $s1, string $s2) {
$grid = [];
$M = strlen($s1);
$N = strlen($s2);

for ($i = 0; $i <= $N; $i++) {
$grid[$i] = [];
for ($j = 0; $j <= $M; $j++) {
$grid[$i][$j] = null;
}
}
$grid[0][0] = 0;

for ($i = 1; $i <= $M; $i++) {
$grid[0][$i] = -1 * $i;
}

for ($i = 1; $i <= $N; $i++) {
$grid[$i][0] = -1 * $i;
}

for ($i = 1; $i <= $N; $i++) {
for ($j = 1; $j <= $M; $j++) {
$grid[$i][$j] = max(
$grid[$i - 1][$j - 1] + ($s2[$i - 1] === $s1[$j - 1] ? SP :
MS), $grid[$i - 1][$j] + GP, $grid[$i][$j - 1] + GP
);
}
}

printSequence($grid, $s1, $s2, $M, $N);
}

Here, we have created a two-dimensional array of size M,N, where M is the size of string #1 and N is the size of string #2. We initialized the first row and column of the grid to a negative value in the decreasing order. We multiplied the index with a gap penalty to achieve this behavior. Here, our constant SP indicates the matching score point, MS for mismatch score, GP for gap penalty, and GC indicates the Gap Character, which we will use during sequence printing. At the end of dynamic programming, the matrix will be generated. Let's consider the following two strings:

$X = "GAATTCAGTTA"; 
$Y = "GGATCGA";

Then, our table will look like this after running the Needleman algorithm:

G

A

A

T

T

C

A

G

T

T

A

0

-1

-2

-3

-4

-5

-6

-7

-8

-9

-10

-11

G

-1

1

0

-1

-2

-3

-4

-5

-6

-7

-8

-9

G

-2

0

0

-1

-2

-3

-4

-5

-4

-5

-6

-7

A

-3

-1

1

1

0

-1

-2

-3

-4

-5

-6

-5

T

-4

-2

0

0

2

1

0

-1

-2

-3

-4

-5

C

-5

-3

-1

-1

1

1

2

1

0

-1

-2

-3

G

-6

-4

-2

-2

0

0

1

1

2

1

0

-1

A

-7

-5

-3

-1

-1

-1

0

2

1

1

0

1

Now, using this scoring table, we can find out the actual sequence. Here, we will start from the bottom-right cell in the table and consider the top cell, left cell, and the diagonal cell values. If the max value among the three cells is the top one, then the top string requires an insertion of gap character (-). If the maximum value is the diagonal one, then there is a better chance of matching. So, we can compare the two characters of the two strings, and if they match, then we can put a bar or pipe character to show the alignment. Here is what the sequencing function will look like:

function printSequence($grid, $s1, $s2, $j, $i) { 
$sq1 = [];
$sq2 = [];
$sq3 = [];

do {
$t = $grid[$i - 1][$j];
$d = $grid[$i - 1][$j - 1];
$l = $grid[$i][$j - 1];
$max = max($t, $d, $l);

switch ($max) {
case $d:
$j--;
$i--;
array_push($sq1, $s1[$j]);
array_push($sq2, $s2[$i]);
if ($s1[$j] == $s2[$i])
array_push($sq3, "|");
else
array_push($sq3, " ");
break;

case $t:
$i--;
array_push($sq1, GC);
array_push($sq2, $s2[$i]);
array_push($sq3, " ");
break;

case $l:
$j--;
array_push($sq1, $s1[$j]);
array_push($sq2, GC);
array_push($sq3, " ");
break;
}
} while ($i > 0 && $j > 0);

echo implode("", array_reverse($sq1)) . " ";
echo implode("", array_reverse($sq3)) . " ";
echo implode("", array_reverse($sq2)) . " ";
}

Since we are starting from back and slowly moving to the front, we are using array push to keep the alignment in order. Then, we are printing the array by reversing it. The complexity of the algorithm is O (M * N). Here is the output if we call NWSquencing for our two strings $X and $Y:

G-AATTCAGTTA
| | | | | |
GGA-T-C-G--A
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset