Chapter 8. Find a Motif in DNA: Exploring Sequence Similarity

In the Rosalind SUBS challenge, I’ll be searching for any occurrences of one sequence inside another. A shared subsequence might represent a conserved element such as a marker, gene, or regulatory sequence. Conserved sequences between two organisms might suggest some inherited or convergent trait. I’ll explore how to write a solution using the str (string) class in Python and will compare strings to lists. Then I’ll explore how to express these ideas using higher-order functions and will continue the discussion of k-mers I started in Chapter 7. Finally, I’ll show how regular expressions can find patterns and will point out problems with overlapping matches.

In this chapter, I’ll demonstrate:

  • How to use str.find(), str.index(), and string slices

  • How to use sets to create unique collections of elements

  • How to combine higher-order functions

  • How to find subsequences using k-mers

  • How to find possibly overlapping sequences using regular expressions

Getting Started

The code and tests for this chapter are in 08_subs. I suggest you start by copying the first solution to the program subs.py and requesting help:

$ cd 08_subs/
$ cp solution1_str_find.py subs.py
$ ./subs.py -h
usage: subs.py [-h] seq subseq

Find subsequences

positional arguments:
  seq         Sequence
  subseq      subsequence

optional arguments:
  -h, --help  show this help message and exit

The program should report the starting locations where the subsequence can be found in the sequence. As shown in Figure 8-1, the subsequence ATAT can be found at positions 2, 4, and 10 in the sequence GATATATGCATATACTT:

$ ./subs.py GATATATGCATATACTT ATAT
2 4 10
mpfb 0801
Figure 8-1. The subsequence ATAT can be found at positions 2, 4, and 10

Run the tests to see if you understand what will be expected, then start your program from scratch:

$ new.py -fp 'Find subsequences' subs.py
Done, see new script "subs.py".

Here is how I define the program’s parameters:

class Args(NamedTuple): 1
    """ Command-line arguments """
    seq: str
    subseq: str


def get_args() -> Args: 2
    """ Get command-line arguments """

    parser = argparse.ArgumentParser(
        description='Find subsequences',
        formatter_class=argparse.ArgumentDefaultsHelpFormatter)

    parser.add_argument('seq', metavar='seq', help='Sequence')

    parser.add_argument('subseq', metavar='subseq', help='subsequence')

    args = parser.parse_args()

    return Args(args.seq, args.subseq) 3
1

The Args class will have two string fields, seq and subseq.

2

The function returns an Args object.

3

Package and return the arguments using Args.

Have your main() print the sequence and subsequence:

def main() -> None:
    args = get_args()
    print(f'sequence = {args.seq}')
    print(f'subsequence = {args.subseq}')

Run the program with the expected inputs and verify that it prints the arguments correctly:

$ ./subs.py GATATATGCATATACTT ATAT
sequence = GATATATGCATATACTT
subsequence = ATAT

Now you have a program that should pass the first two tests. If you think you can finish this on your own, please proceed; otherwise, I’ll show you one way to find the location of one string inside another.

Finding Subsequences

To demonstrate how to find the subsequence, I’ll start by defining the following sequence and subsequence in the REPL:

>>> seq = 'GATATATGCATATACTT'
>>> subseq = 'ATAT'

I can use in to determine if one sequence is a subset of another. This also works for membership in lists, sets, or keys of a dictionary:

>>> subseq in seq
True

That’s good information, but it doesn’t tell me where the string can be found. Luckily there’s the str.find() function, which says subseq can be found starting at index 1 (which is the second character):

>>> seq.find(subseq)
1

I know from the Rosalind description that the answer should be 2, 4, and 10. I just found 2, so how can I find the next? I can’t just call the same function again because I’ll get the same answer. I need to look further into the sequence. Maybe help(str.find) could be of some use?

>>> help(str.find)
find(...)
    S.find(sub[, start[, end]]) -> int

    Return the lowest index in S where substring sub is found,
    such that sub is contained within S[start:end].  Optional
    arguments start and end are interpreted as in slice notation.

    Return -1 on failure.

It appears I can specify a start position. I’ll use 1 greater than the position where the first subsequence was found, which was 1, so starting at 2:

>>> seq.find(subseq, 2)
3

Great. That’s the next answer—well, 4 is the next answer, but you know what I mean. I’ll try that again, this time starting at 4:

>>> seq.find(subseq, 4)
9

That was the last value I expected. What happens if I try using a start of 10? As the documentation shows, this will return -1 to indicate the subsequence cannot be found:

>>> seq.find(subseq, 10)
-1

Can you think of a way to iterate through the sequence, remembering the last position where the subsequence was found until it cannot be found?

Another option would be to use str.index(), but only if the subsequence is present:

>>> if subseq in seq:
...     seq.index(subseq)
...
1

To find the next occurrence, you could slice the sequence using the last known position. You’ll have to add this position to the starting position, but you’re essentially doing the same operation of moving further into the sequence to find if the subsequence is present and where:

>>> if subseq in seq[2:]:
...     seq.index(subseq[2:])
...
1

If you read help(str.index), you’ll see that, like str.find(), the function takes a second optional start position of the index at which to start looking:

>>> if subseq in seq[2:]:
...     seq.index(subseq, 2)
...
3

A third approach would be to use k-mers. If the subsequence is present, then it is by definition a k-mer, where k is the length of the subsequence. Use your code from Chapter 7 to extract all the k-mers and their positions from the sequence, and note the positions of the k-mers that match the subsequence.

Finally, since I’m looking for a pattern of text, I could use a regular expression. In Chapter 5, I used the re.findall() function to find all the Gs and Cs in DNA. I can similarly use this method to find all the subsequences in the sequence:

>>> import re
>>> re.findall(subseq, seq)
['ATAT', 'ATAT']

That seems to have a couple of problems. One is that it only returned two of the subsequences when I know there are three. The other problem is that this provides no information about where the matches are found. Fear not, the re.finditer() function solves this second problem:

>>> list(re.finditer(subseq, seq))
[<re.Match object; span=(1, 5), match='ATAT'>,
 <re.Match object; span=(9, 13), match='ATAT'>]

Now it’s apparent that it finds the first and last subsequences. Why doesn’t it find the second instance? It turns out regular expressions don’t handle overlapping patterns very well, but some additions to the search pattern can fix this. I’ll leave it to you and some internet searching to see if you can figure out a solution.

I’ve presented four different options for how to solve this problem. See if you can write solutions using each approach. The point is to explore the corners of Python, storing away tasty bits and tricks that might prove decisive in some future program you write. It’s OK to spend hours or days working this out. Keep at it until you have solutions that pass both pytest and make test.

Solutions

All of the solutions share the same get_args() shown previously.

Solution 1: Using the str.find() Method

Here is my first solution using the str.find() method:

def main() -> None:
    args = get_args()
    last = 0 1
    found = [] 2
    while True: 3
        pos = args.seq.find(args.subseq, last) 4
        if pos == -1: 5
            break
        found.append(pos + 1) 6
        last = pos + 1 7

    print(*found) 8
1

Initialize the last position to 0, the start of the sequence.

2

Initialize a list to hold all the positions where the subsequence is found.

3

Create an infinite loop using while.

4

Use str.find() to look for the subsequence using the last known position.

5

If the return is -1, the subsequence is not found, so exit the loop.

6

Append one greater than the index to the list of found positions.

7

Update the last known position with one greater than the found index.

8

Print the found positions using * to expand the list into the elements. The function will use a space to separate multiple values.

This solution turns on keeping track of the last place the subsequence was found. I initialize this to 0:

>>> last = 0

I use this with str.find() to look for the subsequence starting at the last known position:

>>> seq = 'GATATATGCATATACTT'
>>> subseq = 'ATAT'
>>> pos = seq.find(subseq, last)
>>> pos
1

As long as seq.find() returns a value other than -1, I update the last position to one greater to search starting at the next character:

>>> last = pos + 1
>>> pos = seq.find(subseq, last)
>>> pos
3

Another call to the function finds the last instance:

>>> last = pos + 1
>>> pos = seq.find(subseq, last)
>>> pos
9

Finally, seq.find() returns -1 to indicate that the pattern can no longer be found:

>>> last = pos + 1
>>> pos = seq.find(subseq, last)
>>> pos
-1

This solution would be immediately understandable to someone with a background in the C programming language. It’s a very imperative approach with lots of detailed logic for updating the state of the algorithm. State is how data in a program changes over time. For instance, properly updating and using the last known position is key to making this approach work. Later approaches use far less explicit coding.

Solution 2: Using the str.index() Method

This next solution is a variation that slices the sequence using the last known position:

def main() -> None:
    args = get_args()
    seq, subseq = args.seq, args.subseq 1
    found = []
    last = 0
    while subseq in seq[last:]: 2
        last = seq.index(subseq, last) + 1 3
        found.append(last) 4

    print(' '.join(map(str, found))) 5
1

Unpack the sequence and subsequence.

2

Ask if the subsequence appears in a slice of the sequence starting at the last found position. The while loop will execute as long as this condition is true.

3

Use str.index() to get the starting position of the subsequence. The last variable gets updated by adding 1 to the subsequence index to create the next starting position.

4

Append this position to the list of found positions.

5

Use map() to coerce all the found integer positions to strings, then join them on spaces to print.

Here again, I rely on tracking the last place a subsequence was found. I start at position 0, or the beginning of the string:

>>> last = 0
>>> if subseq in seq[last:]:
...     last = seq.index(subseq, last) + 1
...
>>> last
2

The while True loop in the first solution is a common way to start an infinite loop. Here, the while loop will only execute as long as the subsequence is found in the slice of the sequence, meaning I don’t have to manually decide when to break out of the loop:

>>> last = 0
>>> found = []
>>> while subseq in seq[last:]:
...     last = seq.index(subseq, last) + 1
...     found.append(last)
...
>>> found
[2, 4, 10]

The found positions, in this case, are a list of integer values. In the first solution, I used *found to splat the list and relied on print() to coerce the values to strings and join them on spaces. If instead I were to try to create a new string from found using str.join(), I would run into problems. The str.join() function joins many strings into a single string and so raises an exception when you give it nonstring values:

>>> ' '.join(found)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: sequence item 0: expected str instance, int found

I could use a list comprehension to turn each number n into a string using the str() function:

>>> ' '.join([str(n) for n in found])
'2 4 10'

This can also be written using a map():

>>> ' '.join(map(lambda n: str(n), found))
'2 4 10'

I can leave out the lambda entirely because the str() function expects a single argument, and map() will naturally supply each value from found as the argument to str(). This is my preferred way to turn a list of integers into a list of strings:

>>> ' '.join(map(str, found))
'2 4 10'

Solution 3: A Purely Functional Approach

This next solution combines many of the preceding ideas using a purely functional approach. To start, consider the while loops in the first two solutions used to append nonnegative values to the found list. Does that sound like something a list comprehension could do? The range of values to iterate includes all the positions n from 0 to the end of the sequence minus the length of the subsequence:

>>> r = range(len(seq) - len(subseq))
>>> [n for n in r]
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]

A list comprehension can use these values with str.find() to search for the subsequence in the sequence starting at each position n. Starting at positions 0 and 1, the subsequence can be found at index 1. Starting at positions 2 and 3, the subsequence can be found at index 3. This continues until -1 indicates the subsequence is not present for those positions n:

>>> [seq.find(subseq, n) for n in r]
[1, 1, 3, 3, 9, 9, 9, 9, 9, 9, -1, -1, -1]

I only want the nonnegative values, so I use filter() to remove them:

>>> list(filter(lambda n: n >= 0, [seq.find(subseq, n) for n in r]))
[1, 1, 3, 3, 9, 9, 9, 9, 9, 9]

Which could also be written by reversing the comparison in the lambda:

>>> list(filter(lambda n: 0 <= n, [seq.find(subseq, n) for n in r]))
[1, 1, 3, 3, 9, 9, 9, 9, 9, 9]

I show you this because I’d like to use partial() with the operator.le() (less than or equal) function because I don’t like lambda expressions:

>>> from functools import partial
>>> import operator
>>> ok = partial(operator.le, 0)
>>> list(filter(ok, [seq.find(subseq, n) for n in r]))
[1, 1, 3, 3, 9, 9, 9, 9, 9, 9]

I’d like to change the list comprehension to a map():

>>> list(filter(ok, map(lambda n: seq.find(subseq, n), r)))
[1, 1, 3, 3, 9, 9, 9, 9, 9, 9]

but again I want to get rid of the lambda by using partial():

>>> find = partial(seq.find, subseq)
>>> list(filter(ok, map(find, r)))
[1, 1, 3, 3, 9, 9, 9, 9, 9, 9]

I can use set() to get a distinct list:

>>> set(filter(ok, map(find, r)))
{1, 3, 9}

These are almost the correct values, but they are the index positions, which are zero-based. I need the values one greater, so I can make a function to add 1 and apply this using a map():

>>> add1 = partial(operator.add, 1)
>>> list(map(add1, set(filter(ok, map(find, r)))))
[2, 4, 10]

In these limited examples, the results are properly sorted; however, one can never rely on the order of values from a set. I must use the sorted() function to ensure they are properly sorted numerically:

>>> sorted(map(add1, set(filter(ok, map(find, r)))))
[2, 4, 10]

Finally, I need to print these values, which still exist as a list of integers:

>>> print(sorted(map(add1, set(filter(ok, map(find, r))))))
[2, 4, 10]

That’s almost right. As in the first solution, I need to splat the results to get print() to see the individual elements:

>>> print(*sorted(map(add1, set(filter(ok, map(find, r))))))
2 4 10

That’s a lot of closing parentheses. This code is starting to look a little like Lisp. If you combine all these ideas, you wind up with the same answer as the imperative solution but now by combining only functions:

def main() -> None:
    args = get_args()
    seq, subseq = args.seq, args.subseq 1
    r = list(range(len(seq) - len(subseq))) 2
    ok = partial(operator.le, 0) 3
    find = partial(seq.find, subseq) 4
    add1 = partial(operator.add, 1) 5
    print(*sorted(map(add1, set(filter(ok, map(find, r)))))) 6
1

Unpack the sequence and subsequence.

2

Generate a range of numbers up to the length of the sequence less the length of the subsequence.

3

Create a partial ok() function that will return True if a given number is greater than or equal to 0.

4

Create a partial find() function that will look for the subsequence in the sequence when provided with a start parameter.

5

Create a partial add1() function that will return one greater than the argument.

6

Apply all the numbers from the range to the find() function, filter out negative values, make the result unique by using the set() function, add one to the values, and sort the numbers before printing.

This solution uses only pure functions and would be fairly easy to understand for a person with a background in the Haskell programming language. If it seems like a jumble of confusion to you, I’d encourage you to spend some time working in the REPL with each piece until you understand how all these functions fit together perfectly.

Solution 4: Using K-mers

I mentioned that you might try finding the answer using k-mers, which I showed in Chapter 7. If the subsequence exists in the sequence, then it must be a k-mer, where k equals the length of the subsequence:

>>> seq = 'GATATATGCATATACTT'
>>> subseq = 'ATAT'
>>> k = len(subseq)
>>> k
4

Here are all the 4-mers in the sequence:

>>> kmers = [seq[i:i + k] for i in range(len(seq) - k + 1)]
>>> kmers
['GATA', 'ATAT', 'TATA', 'ATAT', 'TATG', 'ATGC', 'TGCA', 'GCAT', 'CATA',
 'ATAT', 'TATA', 'ATAC', 'TACT', 'ACTT']

Here are the 4-mers that are the same as the subsequence I’m looking for:

>>> list(filter(lambda s: s == subseq, kmers))
['ATAT', 'ATAT', 'ATAT']

I need to know the positions as well as the k-mers. The enumerate() function will return both the index and value of all the elements in a sequence. Here are the first four:

>>> kmers = list(enumerate([seq[i:i + k] for i in range(len(seq) - k + 1)]))
>>> kmers[:4]
[(0, 'GATA'), (1, 'ATAT'), (2, 'TATA'), (3, 'ATAT')]

I can use this with filter(), but now the lambda is receiving a tuple of the index and value so I will need to look at the second field (which is at index 1):

>>> list(filter(lambda t: t[1] == subseq, kmers))
[(1, 'ATAT'), (3, 'ATAT'), (9, 'ATAT')]

I really only care about getting the index for the matching k-mers. I could rewrite this using a map() with an if expression to return the index position when it’s a match, and None otherwise:

>>> list(map(lambda t: t[0] if t[1] == subseq else None, kmers))
[None, 1, None, 3, None, None, None, None, None, 9, None, None, None, None]

I’m frustrated that the standard map() function can only pass a single value to the lambda. What I need is some way to splat the tuple, like *t, to turn it into two values. Luckily, I’ve studied the itertools module documentation and found the starmap() function, so named because it will add a star to the lambda argument to splat it. This allows me to unpack a tuple value like (0, 'GATA') into the variables i with the index value of 0 and kmer with the value 'GATA'. With these, I can compare the kmer to the subsequence and also add 1 to the index (i):

>>> from itertools import starmap
>>> list(starmap(lambda i, kmer: i + 1 if kmer == subseq else None, kmers))
[None, 2, None, 4, None, None, None, None, None, 10, None, None, None, None]

This probably seems like an odd choice until I show you that filter(), if passed None for the lambda, will use the truthiness of each value, so that None values will be excluded. Because this line of code is getting rather long, I’ll write the function f() for map() on a separate line:

>>> f = lambda i, kmer: i + 1 if kmer == subseq else None
>>> list(filter(None, starmap(f, kmers)))
[2, 4, 10]

I can express a k-mer solution using imperative techniques:

def main() -> None:
    args = get_args()
    seq, subseq = args.seq, args.subseq
    k = len(subseq) 1
    kmers = [seq[i:i + k] for i in range(len(seq) - k + 1)] 2
    found = [i + 1 for i, kmer in enumerate(kmers) if kmer == subseq] 3
    print(*found) 4
1

When looking for k-mers, k is the length of the subsequence.

2

Use a list comprehension to generate all the k-mers from the sequence.

3

Iterate through the index and value of all the k-mers, where the k-mer is equal to the subsequence. Return one greater than the index position.

4

Print the found positions.

I can also express these ideas using purely functional techniques. Note that mypy insists on a type annotation for the found variable:

def main() -> None:
    args = get_args()
    seq, subseq = args.seq, args.subseq
    k = len(subseq)
    kmers = enumerate(seq[i:i + k] for i in range(len(seq) - k + 1)) 1
    found: Iterator[int] = filter( 2
        None, starmap(lambda i, kmer: i + 1 if kmer == subseq else None, kmers))
    print(*found) 3
1

Generate an enumerated list of the k-mers.

2

Select the positions of those k-mers equal to the subsequence.

3

Print the results.

I find the imperative version easier to read, but would recommend you use whichever you find most intuitive. Whichever solution you prefer, the interesting point is that k-mers can prove extremely useful in many situations, such as for partial sequence comparison.

Solution 5: Finding Overlapping Patterns Using Regular Expressions

To this point, I’ve been writing fairly complex solutions to find a pattern of characters inside a string. This is precisely the domain of regular expressions, and so it’s a bit silly to write manual solutions. I showed earlier in this chapter that the re.finditer() function does not find overlapping matches and so returns just two hits when I know there are three:

>>> import re
>>> list(re.finditer(subseq, seq))
[<re.Match object; span=(1, 5), match='ATAT'>,
 <re.Match object; span=(9, 13), match='ATAT'>]

I’m going to show you that the solution is quite simple, but I want to stress that I did not know the solution until I searched the internet. The key to finding the answer was knowing what search terms to use—something like regex overlapping patterns turns up several useful results. The point of this aside is that no one knows all the answers, and you will constantly be searching for solutions to problems you never knew even existed. It’s not what you know that’s important, but what you can learn.

The problem turns out to be that the regex engine consumes strings as they match. That is, once the engine matches the first ATAT, it starts searching again at the end of the match. The solution is to wrap the search pattern in a look-ahead assertion using the syntax ?=(<pattern>) so that the engine won’t consume the matched string. Note that this is a positive look-ahead; there are also negative look-ahead assertions as well as both positive and negative look-behind assertions.

So if the subsequence is ATAT, then I want the pattern to be ?=(ATAT). The problem now is that the regex engine won’t save the match—I’ve just told it to look for this pattern but haven’t told it to do anything with the text that is found. I need to further wrap the assertion in parentheses to create a capture group:

>>> list(re.finditer('(?=(ATAT))', 'GATATATGCATATACTT'))
[<re.Match object; span=(1, 1), match=''>,
 <re.Match object; span=(3, 3), match=''>,
 <re.Match object; span=(9, 9), match=''>]

I can use a list comprehension over this iterator to call the match.start() function on each of the re.Match objects, adding 1 to correct the position:

>>> [match.start() + 1 for match in re.finditer(f'(?=({subseq}))', seq)]
[2, 4, 10]

Here is the final solution that I would suggest as the best way to solve this problem:

def main() -> None:
    args = get_args()
    seq, subseq = args.seq, args.subseq
    print(*[m.start() + 1 for m in re.finditer(f'(?=({subseq}))', seq)])

Benchmarking

It’s always interesting for me to see which solution runs the fastest. I’ll use hyperfine again to run each version 1,000 times:

$ hyperfine -m 1000 -L prg ./solution1_str_find.py,./solution2_str_index.py,
./solution3_functional.py,./solution4_kmers_functional.py,
./solution4_kmers_imperative.py,./solution5_re.py 
'{prg} GATATATGCATATACTT ATAT' --prepare 'rm -rf __pycache__'
...
Summary
  './solution2_str_index.py GATATATGCATATACTT ATAT' ran
    1.01 ± 0.11 times faster than
        './solution4_kmers_imperative.py GATATATGCATATACTT ATAT'
    1.02 ± 0.14 times faster than
        './solution5_re.py GATATATGCATATACTT ATAT'
    1.02 ± 0.14 times faster than
        './solution3_functional.py GATATATGCATATACTT ATAT'
    1.03 ± 0.13 times faster than
        './solution4_kmers_functional.py GATATATGCATATACTT ATAT'
    1.09 ± 0.18 times faster than
        './solution1_str_find.py GATATATGCATATACTT ATAT'

The differences aren’t significant enough, in my opinion, to sway me to choose based solely on performance. My preference would be to use regular expressions given that they are specifically designed to find patterns of text.

Going Further

Expand the program to look for a subsequence pattern. For example, you might search for simple sequence repeats (also known as SSRs or microsatellites) such as GA(26), which would mean “GA repeated 26 times.” Or a repeat such as (GA)15GT(GA)2, which means “GA repeated 15 times, followed by GT, followed by GA, repeated 2 times.” Also, consider how you might find subsequences expressed using the IUPAC codes mentioned in Chapter 1. For instance, R represents either A or G, so ARC can match the sequences AAC and AGC.

Review

Key points from this chapter:

  • The str.find() and str.index() methods can determine if a subsequence is present in a given string.

  • Sets can be used to create unique collections of elements.

  • By definition, k-mers are subsequences and are relatively quick to extract and compare.

  • Regular expressions can find overlapping sequences by using a look-ahead assertion combined with a capture group.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset