4

String Matching

String matching is the problem of finding strings that refer to the same real-world entity. For example, the string David Smith in one database may refer to the same person as David R. Smith in another database. Similarly, the strings 1210 W. Dayton St, Madison WI and 1210 West Dayton, Madison WI 53706 refer to the same physical address.

String matching plays a critical role in many data integration tasks, including schema matching, data matching, and information extraction. Consequently, in this chapter we examine this problem in depth. Section 4.1 defines the string matching problem. Section 4.2 describes popular similarity measures that can be used to compute a similarity score between any two given strings. Finally, Section 4.3 discusses how to efficiently apply such a measure to match a large number of strings.

4.1 Problem Description

The problem we address in this chapter is the following. Given two sets of strings X and Y, we want to find all pairs of strings (x,y ), where image and image, such that x and y refer to the same real-world entity. We refer to such pairs as matches. Figure 4.1(a-b) shows two example databases representing sets of persons, and Figure 4.1(c) shows the matches between them. For example, the first match image states that strings Dave Smith and David D. Smith refer to the same real-world person.

image

Figure 4.1 An example of matching person names. Column (c) shows the matches between the databases in (a) and (b).

Solving the matching problem raises two major challenges: accuracy and scalability. Matching strings accurately is difficult because strings that refer to the same real-world entity are often very different. The reasons for appearing different include typing and OCR errors (e.g., David Smith is misspelled as Davod Smith), different formatting conventions (10/8/2009 vs. Oct 8, 2009), custom abbreviation, shortening of strings or omission (Daniel Walker Herbert Smith vs. Dan W. Smith), different names or nicknames (William Smith vs. Bill Smith), and shuffling parts of the string (Dept. of Computer Science, UW-Madison vs. Computer Science Dept., UW-Madison). Additionally, in some cases the data source(s) do not contain enough information to determine whether two strings refer to the same entity (e.g., trying to decide whether two authors mentioned in two different publications are the same person).

To address the accuracy challenge, a common solution is to define a similarity measures that takes two strings x and y and returns a score in the range [0,1]. The intention is that the higher the score, the more likely that x and y match. We then say that x and y match if image, where t is a prespecified threshold. Many similarity measures have been proposed, and we discuss the main ones in Section 4.2.

The second challenge is to apply the similarity measure s (x,y ) to a large number of strings. Since string similarity is typically applied to data entries, applying s (x,y ) to all pairs of strings in the Cartesian product of sets X and Y would be quadratic in the size of the data and therefore impractical. To address this challenge, several solutions have been proposed to apply s (x,y ) only to the most promising pairs. We discuss the key ideas of these solutions in Section 4.3.

4.2 Similarity Measures

A broad range of measures have been proposed to match strings. A similarity measure maps a pair of strings (x,y ) into a number in the range [0,1] such that a higher value indicates greater similarity between x and y. The terms distance and cost measures have also been used to describe the same concept, except that smaller values indicate higher similarity.

Broadly speaking, current similarity measures fall into four groups: sequence-based, set-based, hybrid, and phonetic measures. We now describe each one in turn.

4.2.1 Sequence-Based Similarity Measures

The first set of similarity measures we consider views strings as sequences of characters and computes a cost of transforming one string into the other. We begin with a basic measure called edit distance, and then consider several more elaborate versions.

Edit Distance

The edit distance measure, also known as Levenshtein distance, d (x,y ), computes the minimal cost of transforming string x to string y. Transforming a string is carried out using a sequence of the following operators: delete a character, insert a character, and substitute one character for another. For example, the cost of transforming the string x = David Smiths to the string y = Davidd Simth is 4. The required operations are: inserting a d (after David), substituting m by i, substituting i by m, and deleting the last character of x, which is s.

It is not hard to see that the minimal cost of transforming x to y is the same as the minimal cost of transforming y to x (using in effect the same transformation). Thus, d (x,y ) is well-defined and symmetric with respect to both x and y.

Intuitively, edit distance tries to capture the various ways people make editing mistakes, such as inserting an extra character (e.g., Davidd instead of David) or swapping two characters (e.g., Simth instead of Smith). Hence, the smaller the edit distance, the more similar the two strings are.

The edit distance function d (x,y ) is converted into a similarity function s (x,y ) as follows:

image

For example, the similarity score between David Smiths and Davidd Smith is

image

The value of d (x,y ) can be computed using dynamic programming. Let image and image, where the xi and yj are characters. Let d (i,j ) denote the edit distance between image and image(which are the i th and j th prefixes of x and y ).

The key to designing a dynamic-programming algorithm is to establish a recurrence equation that enables computing d (i,j ) from previously computed values of d. Figure 4.2(a) shows the recurrence equation in our case. To understand the equation, observe that we can transform string image into string image in one of four ways: (a) transforming image into image, then copying xi into yj if image, (b) transforming image into image, then substituting xi with yj if image, (c) deleting xi, then transforming image into image, and (d) transforming image into image, then inserting yj. The value d (i,j ) is the minimum of the costs of the above transformations. Figure 4.2(b) simplifies the equation in Figure 4.2(a) by merging the first two lines.

image

Figure 4.2 (a) The recurrence equation for computing edit distance between strings x and y using dynamic programming and (b) a simplified form of the equation.

Figure 4.3 shows an example computation using the simplified equation. The figure illustrates computing the distance between the strings x = dva and y = dave. We start with the matrix in Figure 4.3(a), where the xi are listed on the left and the yj at the top. Note that we have added x0 and y0, two null characters at the start of x and y, to simplify the implementation. Specifically, this allows us to quickly fill in the first row and column, by setting d (i, 0) = i and image.

image

Figure 4.3 Computing the edit distance between dva and dave using the dynamic programming equation in Figure 4.2(b). (a) The dynamic programming matrix after filling in several cells, (b) the filled-in matrix, and (c) the sequence of edit operations that transforms dva into dave, found by following the bold arrows in part (b).

Now we can use the equation in Figure 4.2(b) to fill in the rest of the matrix. For example, we have image. Since this value is obtained by adding 0 to image, we add an arrow pointing from cell (1,1) to cell (0,0). Similarly, d (1,2) = 1 (see Figure 4.3(a)). Figure 4.3(b) shows the entire filled-in matrix. The edit distance between x and y can then be found to be 2, in (3,4), the bottom rightmost cell.

In addition to computing the edit distance, the algorithm also shows the sequence of edit operations, by following the arrows. In our example, since the arrow goes “diagonal,” from (3,4) to (2,3), we know that x3(character a ) has been copied into or substituted with y4(character e). The arrow then goes “diagonal” again, from (2,3) to (1,2). So again x2 (character v) has been copied into or substituted with y3(character v). Next, the arrow goes “horizontal,” from (1,2) to (1,1). This means a gap - has been inserted into x and aligned with a in y (which denotes an insertion operation). This process stops when we have reached cell (0,0). The transformation is depicted in Figure 4.3(c).

The cost of computing the edit distance is image. In practice, the lengths of x and y often are roughly the same, and hence we often refer to the above cost as quadratic.

The Needleman-Wunch Measure

The Needleman-Wunch measure generalizes the Levenshtein distance. Specifically, it is computed by assigning a score to each alignment between the two input strings and choosing the score of the best alignment, that is, the maximal score. An alignment between two strings x and y is a set of correspondences between the characters of x and y, allowing for gaps. For example, Figure 4.4(a) shows an alignment between the strings x = dva and y = deeve, where d corresponds to d, v to v, and a to e. Note that a gap of length 2 has been inserted into x, and so the two characters ee of y do not have any corresponding characters in x.

image

Figure 4.4 An example for the Needleman-Wunch function: (a) an alignment between two strings x = dva and y = deeve, with a gap in x, and (b) a score matrix that assigns to each pair of characters a score and a gap penalty cg.

The score of an alignment is computed using a score matrix and a gap penalty. The matrix assigns a score for a correspondence between every pair of characters and therefore allows for penalizing transformations on a case-by-case basis. Figure 4.4(b) shows a sample score matrix where a correspondence between two identical characters scores 2, and -1 otherwise. A gap of length 1 has a penalty cg (set to 1 in Figure 4.4). A gap of length k has a linear penalty of kcg.

The score of an alignment is then the sum of the scores of all correspondences in the alignment, minus the penalties for the gaps. For example, the score of the alignment in Figure 4.4(a) is 2 (for correspondence d-d) + 2 (for v-v) − 1 (for a-e) − 2 (penalty for the gap of length 2) = 1. This is the best alignment between x and y (that is, the one with the highest score) and therefore is the Needleman-Wunch score between the two.

As described, the Needleman-Wunch measure generalizes the Levenshtein distance in three ways. First, it computes similarity scores instead of distance values. Second, it generalizes edit costs into a score matrix, thus allowing for more fine-grained score modeling. For example, the letter o and the number 0 are often confused in practice (e.g., by OCR systems). Hence, a correspondence between these two should have a higher score than one between a and 0, say. As another example, when matching bioinformatic sequences, different pairs of amino acids may have different semantic distances. Finally, the Needleman-Wunch measure generalizes insertion and deletion into gaps and generalizes the cost of such operations from 1 to an arbitrary penalty cg.

The Needleman-Wunch score s (x,y ) is computed with dynamic programming, using the recurrence equation shown in Figure 4.5(a). We note three differences between the algorithm used here and the one used for computing the edit distance (Figure 4.2(b)). First, here we compute the max instead of min. Second, we use the score matrix image in the recurrence equation instead of the unit costs of edit operations. Third, we use the gap cost cg instead of the unit gap cost. Note that when initializing this matrix, we must set image and image(instead of i and j as in the edit distance case).

image

Figure 4.5 An example for the Needleman-Wunch function: (a) the recurrence equation for computing the similarity score using dynamic programming, (b) the dynamic-programming matrix between dva and deeve using the equation in part (a), and (c) the optimal alignment between the above two strings, found by following the arrows in part (b).

Figure 4.5(b) shows the fully filled-in matrix for x = dva and y = deeve, using the score matrix and gap penalty in Figure 4.4(b). Figure 4.5(c) shows the best alignment, found by following the arrows in the matrix of Figure 4.5(b).

The Affine Gap Measure

The affine gap measure is an extension of the Needleman-Wunch measure that handles longer gaps more gracefully. Consider matching x = David Smith with y = David R. Smith. The Needleman-Wunch measure can do this match very well, by opening a gap of length 2 in x, right after David, then aligning the gap with R. However, consider matching the same x = David Smith with y′ = David Richardson Smith, as shown in Figure 4.6(a). Here the gap between the two strings is 10 characters long. Needleman-Wunch does not match very well because the cost of the gap is too high. For example, assume that each character correspondence has a score of 2 and that cg is 1; then the score of the above alignment under Needleman-Wunch is 6 ⋅ 2 - 10 = 2.

image

Figure 4.6 (a) An example of two strings where there is a long gap, (b) the recurrence equations for the affine gap measure.

In practice, gaps tend to be longer than one character. Hence, assigning a uniform penalty to each character in the gap in a sense will unfairly punish long gaps. The affine gap measure addresses this problem by distinguishing between the cost of opening a gap and the cost of continuing the gap. Formally, the measure assigns to each gap of length k a cost image, where co is the cost of opening the gap (i.e., the cost of the very first character of the gap), and cr is the cost of continuing the gap (i.e., the cost of the remaining image characters of the gap). Cost cr is less than co, thereby lessening the penalty of a long gap. Continuing with the example in Figure 4.6(a), if image and image, the score of the alignment is 6 ⋅ 2 - 1 - 9 ⋅ 0.5 = 6.5, much higher than the score 2 obtained under Needleman-Wunch.

Figure 4.6(b) shows the recurrence equations for the affine gap measure. Deriving these equations is rather involved. Since we now penalize opening a gap and continuing a gap differently, in every stage, for each cell (i,j ) of the dynamic-programming matrix we keep track of three values:

M (i,j ): the best score between image and image given that xi is aligned to yj

image: the best score given that xi is aligned to a gap

image: the best score given that yj is aligned to a gap

The best score for the cell, s (i,j ), is then the maximum of these three scores.

In deriving the above recurrence equations, we make the following assumption about the cost function. We assume that in an optimal alignment, we will never have an insertion followed directly by a deletion, or vice versa. This means we will never have the situation depicted in Figure 4.7(a) or Figure 4.7(b). We can guarantee this property by setting the cost image to be lower than the lowest mismatch score in the score matrix. Under such conditions, the alignment in Figure 4.7(c) will have a higher score than those to its left.

image

Figure 4.7 By setting the gap penalties and the score matrix appropriately, the alignment in (c) will always score higher than those in (a) and (b).

We now explain how to derive the equations for M (i,j ), image, and image in Figure 4.6(b). Figure 4.8 explains how to derive the equation for M (i,j). This equation considers the case where xi is aligned with yj(Figure 4.8(a)). Thus, image is aligned with image. This can only happen in one of three ways, as shown in Figures 4.8(b)–(d): image is aligned with image, image is aligned into a gap, or image is aligned into a gap. These three ways give rise to the three cases in the equation for M (i,j ) in Figure 4.6(b).

image

Figure 4.8 The derivation of the equation for M(i, j) in Figure 4.6(b).

Figure 4.9 shows how to derive the equation for image in Figure 4.6(b). This equation considers the case where xi is aligned into a gap (Figure 4.9(a)). This can only happen in one of three ways, as shown in Figure 4.9(b)–(d): image is aligned with yj, image is aligned into a gap, or yj is aligned into a gap. The first two ways give rise to the two cases shown in Figure 4.6(b). The third case cannot happen, because it means an insertion followed directly by a deletion. This violates the assumption we described earlier. The equation for image in Figure 4.6(b) is derived in a similar fashion. The complexity of computing the affine gap measure remains image.

image

Figure 4.9 The derivation of the equation for Ix(i, j) in Figure 4.6(b).

The Smith-Waterman Measure

The previous measures consider global alignments between the input strings. That is, they attempt to match all characters of x with all characters of y.

Global alignments may not be well suited for some cases. For example, consider the two strings Prof. John R. Smith, Univ of Wisconsin and John R. Smith, Professor. The similarity score based on global alignments will be relatively low. In such a case, what we really want is to find two substrings of x and y that are most similar, and then return the score between the substrings as the score for x and y. For example, here we would want to identify John R. Smith to be the most similar substrings of the above two strings. This means matching the above two strings by ignoring certain prefixes (e.g., Prof.) and suffixes (e.g., Univ of Wisconsin in x and Professor in y ). We call this local alignment.

The Smith-Waterman measure is designed to find matching substrings by introducing two key changes to the Needleman-Wunch measure. First, the measure allows the match to restart at any position in the strings (no longer limited to just the first position). The restart is captured by the first line of the recurrence equation in Figure 4.10(a). Intuitively, if the global match dips below zero, then this line has the effect of ignoring the prefixand restarting the match. Similarly, note that all values in the first row and column of the dynamic-programming matrix are zeros, instead of image and image as in the Needleman-Wunsch case. Applying this recurrence equation to the strings avd and dave produces the matrix in Figure 4.10(b).

image

Figure 4.10 An example for the Smith-Waterman measure: (a) the recurrence equation for computing the similarity score using dynamic programming and (b) the dynamic-programming matrix between avd and dave using the equation in part (a).

The second key change is that after computing the matrix using the recurrence equation, the algorithm starts retracing the arrows from the largest value in the matrix (4 in our example) rather than starting from the lower-right corner (3 in the matrix). This change effectively ignores suffixes if the match they produce is not optimal. Retracing ends when we meet a cell with value 0, which corresponds to the start of the alignment. In our example we can read out the best local alignment between avd and dave, which is av.

The Jaro Measure

The Jaro measure was developed mainly to compare short strings, such as first and last names. Given two strings x and y, we compute their Jaro score as follows.

Find the common characters xi and yj such that image and image. Intuitively, common characters are those that are identical and are positionally “close to one another.” It is not hard to see that the number of common characters xi in x is the same as the number of common characters yj in y. Let this number be c.

Compare the i th common character of x with the i th common character of y. If they do not match, then we have a transposition. Let the number of transpositions be t.

Compute the Jaro score as

image

As an example, consider x = jon and y = john. We have c = 3. The common character sequence in x is jon, and so is the sequence in y. Hence, there is no transposition, and t = 0. Thus, image. In contrast, the similarity score according to edit distance would be 0.75.

Now suppose x = jon and y = ojhn. Here the common character sequence in x is jon and the common character sequence in y is ojn. Thus t = 2, and image.

The cost of computing the Jaro distance is image, due to the cost of finding common characters.

The Jaro-Winkler Measure

The Jaro-Winkler measure is designed to capture cases where two strings have a low Jaro score, but share a prefix and thus are still likely to match. Specifically, the measure introduces two parameters: PL, which is the length of the longest common prefix between the two strengths, and PW, which is the weight to give the prefix. The measure is computed using the following formula:

image

4.2.2 Set-Based Similarity Measures

The previous class of measures considers strings as sequences of characters. We now describe similarity measures that view strings as sets or multi-sets of tokens, and use set-related properties to compute similarity scores.

There are many ways to generate tokens from the input strings. A common method is to consider the words in the string (as delimited by the space character) and possibly stem the words. Common stop words (e.g., the, and, of) are typically excluded. For example, given the string David Smith, we may generate the set of tokens {david, smith}.

Another common type of token is q-grams, which are substrings of length q that are present in the string. For example, the set of all 3-grams of David Smith is {##d, #da, dav, avi, …, ith, h##, th#}. Note that we have appended the special character # to the start and the end of the string, to handle 3-grams in these positions.

We discuss several set-based similarity measures. The bibliographic notes contain pointers to others that have been proposed in the literature. In our discussion we often assume sets of tokens, but we note that these measures have also been considered for the multi-set case, and what we discuss below can be generalized to that case.

The Overlap Measure

Let Bx and By be the sets of tokens generated for strings x and y, respectively. The overlap measure returns the number of common tokens image.

Consider x = dave and y = dav; then the set of all 2-grams of x is Bx = {#d, da, av, ve, e#} and the set of all 2-grams of y is By = {#d, da, av, v#}. So O (x,y ) = 3.

The Jaccard Measure

Continuing with the above notation, the Jaccard similarity score between two strings x and y is image.

Again, consider x = dave with Bx = {#d, da, av, ve, e#}, and y = dav with By = {#d, da, av, v#}. Then J (x,y ) = 3/6.

The TF/IDF Measure

This measure employs the notion of TF/IDF score commonly used in information retrieval (IR) to find documents that are relevant to keyword queries. The intuition underlying the TF/IDF measure is that two strings are similar if they share distinguishing terms. For example, consider the three strings x = Apple Corporation, CA, y = IBM Corporation, CA, and z = Apple Corp. The edit distance and Jaccard measure would match x with y as s (x,y ) is higher than s (x,z ). However, the TF/IDF measure is able to recognize that Apple is a distinguishing term, whereas Corporation and CA are far more common, and thus would correctly match x with z.

When discussing the TF/IDF measure, we assume that the pair of strings being matched is taken from a collection of strings. Figure 4.11(a) shows a tiny such collection of three strings, x = aab, y = ac, and z = a. We convert each string into a bag of terms. Using IR terminology, we refer to such a bag of terms as a document. For example, we convert string x = aab into document Bx= {a, a, b}.

image

Figure 4.11 In the TF/IDF measure (a) strings are converted into bags of terms, (b) TF and IDF scores of the terms are computed, then (c) these scores are used to compute feature vectors.

We now compute term frequency (TF) scores and inverse document frequency (IDF) scores as follows:

For each term t and document d, we compute the term frequency tf (t,d) to be the number of times t occurs in d. For example, since a occurs twice in Bx, we have image.

For each term t, we compute the inverse document frequency idf (t ) to be the total number of documents in the collection divided by the number of documents that contain t (variations of this definition are also commonly used, see below). For example, since a appears in all three documents in Figure 4.11(a), we have image. A higher value of IDF means that the occurrence of the term is more distinguishing.

Figure 4.11(b) shows the TF/IDF scores for the tiny example in Figure 4.11(a).

Next, we convert each document d into a feature vector vd. The intuition is that two documents will be similar if their corresponding vectors are close to each other. The vector of d has a feature image for each term t, and the value of image is a function of the TF and IDF scores. Vector vd thus has as many features as the number of terms in the collection. Figure 4.11(c) shows the three vectors image, and vz for the three documents image, and Bz, respectively. In the figure, we use a relatively simple score: image. Thus, the score for feature a of vx is image, and so on.

Now we are ready to compute the TF/IDF similarity score between any two strings p and q. Let T be the set of all terms in the collection. Then conceptually the vectors vp and vq(of the strings p and q ) can be viewed as vectors in the image-dimensional space where each dimension corresponds to a term. The TF/IDF score between p and q can be computed as the cosine of the angle between these two vectors:

image (4.1)

For example, the TF/IDF score between the two strings x and y in Figure 4.11(a) is image, using the vectors vx and vy in Figure 4.11(c).

It is not difficult to see from Equation 4.1 that the TF/IDF similarity score between two strings p and q is high if they share many frequent terms (that is, terms with high TF scores), unless these terms also commonly appear in other strings in the collection (in which case the terms have low IDF scores). Using the IDF component, the TF/IDF similarity score can effectively discount the importance of such common terms.

In the above example, we assumed image. This means that if we “double” the number of occurrences of t in document d, then image will also double. In practice this is found to be excessive: doubling the number of occurrences of t should increase image but not double it. One way of addressing this is to “dampen” the TF and IDF components by a logarithmic factor. Specifically, we can take

image

(In fact, log(idf(t)) itself is also commonly referred to as the inverse document frequency.) In addition, the vector vd is often normalized to length 1, by setting

image

This way, computing the TF/IDF similarity score s (p,q ) as in Equation 4.1 reduces to computing the dot product between the two normalized vectors vp and vq.

4.2.3 Hybrid Similarity Measures

We now describe several similarity measures that combine the benefits of sequence-based and set-based methods.

The Generalized Jaccard Measure

Recall that the Jaccard measure considers the number of overlapping tokens in the input strings x and y. However, a token from x and a token from y have to be identical in order to be considered in the overlap set, which may be restrictive in some cases.

For example, consider matching the names of the nodes of two taxonomies describing divisions of companies. Each node is described by a string, such as Energy and Transportation and Transportation, Energy, and Gas. The Jaccard measure is a promising candidate for matching such strings because intuitively two nodes are similar if their names share many tokens (e.g., energy and transportation). However, in practice tokens are often misspelled, such as energy vs. eneryg. The generalized Jaccard measure will enable matching in such cases.

As with the Jaccard measure, we begin by converting the string x into a set of tokens image and string y into a set of tokens image. Figure 4.12(a) shows two such strings x and y, with image and image.

image

Figure 4.12 An example of computing the generalized Jaccard measure.

The next three steps will determine the set of pairs of tokens that are considered in the “softened” overlap set. First, let s be a similarity measure that returns values in the range [0,1]. We apply s to compute a similarity score for each pair image. Continuing with the above example, Figure 4.12(a) shows all six such scores.

Second, we keep only those scores that equal or exceed a given threshold α. Figure 4.12(b) shows the remaining scores in our example, given image. The sets Bx and By, together with the edges that denote the remaining scores, form a bipartite graph G. In the third step we find the maximum-weight matching M in the bipartite graph G. Figure 4.12(c) shows this matching in our example. The total weight of this matching is 0.7 + 0.9 = 1.6.

Finally, we return the normalized weight of M as the generalized Jaccard score between x and y. To normalize, we divide the weight by the sum of the number of edges in M and the number of “unmatched” elements in Bx and By. This sum is image. Formally,

image

Continuing with the above example, the generalized Jaccard score is image.

The measure GJ (x,y ) is guaranteed to be between 0 and 1. It is a natural generalization of the Jaccard measure J (x,y ): if we constrain the elements of Bx and By to match only if they are identical, GJ (x,y ) reduces to J (x,y ). We discuss how to compute GJ (x,y ) efficiently later in Section 4.3.

The Soft TF/IDF Similarity Measure

This measure is similar in spirit to the generalized Jaccard measure, except that it uses the TF/IDF measure instead of the Jaccard measure as the “higher-level” similarity measure.

Consider an example with three strings x = Apple Corporation, CA, y = IBM Corporation, CA, and z = Aple Corp. To match these strings, we would like to use the TF/IDF measure so that we can discount common terms such as Corporation and CA. Unfortunately, in this case the TF/IDF measure does not help us match x with z, because the term Apple in x does not match the misspelled term Aple in z. Thus, x and z do not share any term. As with the generalized Jaccard measure, we would like to “soften” the requirement that Apple and Aple match exactly, and instead require that they be similar to each other.

We compute the soft TF/IDF measure as follows. As with the TF/IDF measure, given two strings x and y, we create the two documents Bx and By. Figure 4.13(b) shows the documents created for the strings in Figure 4.13(a).

image

Figure 4.13 An example of computing soft TF/IDF similarity score for two strings x and y.

Next, we compute close (x,y,k) to be the set of all terms in Bx that have at least one close term in By. Specifically, close (x,y,k ) is the set of terms image such that there exists a term image that satisfies image, where s′ is a basic similarity measure (e.g., Jaro-Winkler) and k is a prespecified threshold. Continuing with our example in Figure 4.13(b), suppose image, as shown in the figure. Then image. Note that image is excluded because the closest term to d in By is d′, but d′ is still too far from d (at a similarity score of 0.6).

In the final step, we compute s (x,y ) as in the traditional TF/IDF score, but giving a weight to each component of the TF/IDF formula according to the similarity score produced by s′. Specifically, let vx and vy be the feature vectors for x and y, as explained when we discussed the TF/IDF similarity measure. The vectors vx and vy are normalized to length 1, so that the traditional TF/IDF score can be computed as the dot product of the two vectors. Then we have

image

where image is the term that maximizes image for all image. Figure 4.13(c) shows how to compute s (x,y ) given the examples in Figures 4.13(a-b).

The Monge-Elkan Similarity Measure

The Monge-Elkan similarity measure can be effective for domains in which more control is needed over the similarity measure. To apply this measure to two strings x and y, first we break them into multiple substrings, say image and image, where the Ai and Bj are substrings. Next, we compute

image

where s′ is a secondary similarity measure, such as Jaro-Winkler. To illustrate the above formula, suppose image and image. Then

image

Note that we ignore the order of the matching of the substrings and only consider the best match for the substrings of x in y. Furthermore, we can customize the secondary similarity measure s′ to a particular application.

For example, consider matching strings x = Comput. Sci. and Eng. Dept., University of California, San Diego and y = Department of Computer Science, Univ. Calif., San Diego. To employ the Monge-Elkan measure, first we break x and y into substrings such as Comput., Sci., and Computer. Next we must design a secondary similarity measure s′ that works well for such substrings. In particular, it is clear that s′ must handle matching abbreviations well. For example, s′ may decide that if a substring Ai is a prefix of a substring Bj, such as Comput. and Computer, then they match, that is, their similarity score is 1.

4.2.4 Phonetic Similarity Measures

The similarity measures we have discussed so far match strings based on their appearance. In contrast, phonetic measures match strings based on their sound. These measures have been especially effective in matching names, since names are often spelled in different ways that sound the same. For example, Meyer, Meier, and Mire sound the same, as do Smith, Smithe, and Smythe. We describe the Soundex similarity measure, which is the most commonly used. We mention extensions of the basic Soundex measure in the bibliographic notes.

Soundex is used primarily to match surnames. It maps a surname x into a four-character code that captures the sound of the name. Two surnames are deemed similar if they share the same code. Mapping x to a code proceeds as follows. We use x = Ashcraft as a running example in our description.

1. Keep the first letter of x as the first letter of the code. The first letter of the code for Ashcraft is A. The following steps are performed on the rest of the string x.

2. Remove all occurrences of W and H. Go over the remaining letters and replace them with digits as follows: replace B, F, P, V with 1; C, G, J, K, Q, S, X, Z with 2; D, T with 3; L with 4; M, N with 5; and R with 6. Note that we do not replace the vowels A, E, I, O, U, and Y. Continuing with our example, we convert Ashcraft into A226a13.

3. Replace each sequence of identical digits by the digit itself. So A226a13 becomes A26a13.

4. Drop all the nondigit letters (except the first one, of course). Then return the first four letters as the Soundex code. So A26a13 becomes A2613, and the corresponding Soundex code is A261.

Thus the Soundex code is always a letter followed by three digits, padded by 0 if there are not enough digits. For example, the soundex code for Sue is S000.

As described, the Soundex measure in effect “hashes” similar sounding consonants (such as B, F, P, and V) into the same digit, thereby mapping similar sounding names into the same soundex code. For example, it maps both Robert and Rupert into R163.

Soundex is not perfect. For example, it fails to map the similar sounding surnames Gough and Goff, or Jawornicki and Yavornitzky (an Americanized spelling of the former), into the same code. Nevertheless, it is a useful tool that has been used widely to match and index names in applications such as census records, vital records, ship passenger lists, and geneology databases. While Soundex was designed primarily for Caucasian surnames, it has been found to work well for names of many different origins (such as those appearing in the records of the U.S. Immigration and Naturalization Services). However, it does not work as well for names of East Asian origins, because much of the discriminating power of these names resides in the vowel sounds, which the code ignores.

4.3 Scaling Up String Matching

Once we have selected a similarity measure s (x,y ), the next challenge is to match strings efficiently. Let X and Y be two sets of strings to be matched, and t be a similarity threshold. A naive matching solution would be as follows:

for each string x ∈ X do

 for each string y ∈ Y do

  if s(x, y) ≥ t then return (x,y) as a matched pair

 end for

end for

This image solution is clearly impractical for large data sets. A more commonly employed solution is based on developing a method foo that can quickly find the strings that may match a given string x. Given such a method, we employ the following algorithm:

for each string xX do

 use method FindCands to find a candidate set image

 for each string y ∈ Z do

  ifimagethen return (x,y ) as a matched pair

 end for

end for

This solution, often called a blocking solution, takes image time, which is much faster than image because foo is designed so that finding Z is inexpensive and image is much smaller than image. The set Z is often called the umbrella set of x. It should contain all true positives (i.e., all strings in Y that can possibly match x ) and as few negative positives (i.e., those strings in Y that do not match x ) as possible.

Clearly, the method foo lies at the heart of the above solution, and many techniques have been proposed for it. These techniques are typically based on indexing or filtering heuristics. We now discuss the basic ideas that underlie several common techniques for FindCands. In the following, we explain the techniques using the Jaccard and overlap measures. Later we discuss how to extend these techniques to other similarity measures.

4.3.1 Inverted Index Over Strings

This technique first converts each string image into a document and then builds an inverted index over these documents. Given a term t, we can use the index to quickly find the list of documents created from Y that contain t, and hence the strings in Y that contain t.

Figure 4.14(a) shows an example of matching the two sets X and Y. We scan the set Y to build the inverted index shown in Figure 4.14(b). For instance, the index shows that the term area appears in just document 5, but the term lake appears in two documents, 4 and 6.

image

Figure 4.14 An example of using an inverted index to speed up string matching.

Given a string image, the method FindCands uses the inverted index to quickly locate the set of strings in Y that share at least one term with x. Continuing with the above example, given x = {lake, mendota}, we use the index in Figure 4.14(b) to find and merge the ID lists for lake and mendota, to obtain the umbrella set Z = {4, 6}.

This method is clearly much better than naively matching x with all strings in Y. Nevertheless, it still suffers from several limitations. First, the inverted list of some terms (e.g., stop words) can be very long, so building and manipulating such lists are quite costly. Second, this method requires enumerating all pairs of strings that share at least one term. The set of such pairs can still be very large in practice. The techniques described below address these issues.

4.3.2 Size Filtering

This technique retrieves only the strings in Y whose size make them match candidates. Specifically, given a string image, we infer a constraint on the size of strings in Y that can possibly match x. We use a B-tree index to retrieve only the strings that satisfy the size constraints.

To derive the constraint on the size of strings in Y, recall that the Jaccard measure is defined as follows (where image refers to the number of tokens in x ):

image

First, we can show that

image (4.2)

To see why, consider the case where image. In this case, clearly image. So we only have to prove image, or equivalently that image. This inequality is true because image and image. The case where image can be proven similarly.

Now let t be the prespecified similarity threshold. If x and y match, then it must be that image. Together with Equation 4.2, this implies that image or, equivalently,

image (4.3)

Thus, given a string image, we know that only strings that satisfy Equation 4.3 can possibly match x.

To illustrate, consider again the string x = {lake, mendota} (the first string in set X in Figure 4.14(a)). Suppose t = 0.8. Using the above equation, if image matches x, we must have image. We can immediately see that no string in the set Y in Figure 4.14(a) satisfies this constraint.

Exploiting the above idea, procedure FindCands builds a B-tree index over the sizes of strings in Y. Given a string image, it uses the index to find strings in Y that satisfy Equation 4.3 and returns that set of strings as the umbrella set Z. This technique is effective when there is significant variability in the number of tokens in the strings of X and Y.

4.3.3 Prefix Filtering

The basic idea underlying this technique is that if two sets share many terms, then large subsets of them also must share terms. Using this principle, we can reduce the number of candidate strings that may match a string x.

We first explain this technique using the overlap similarity measure and then extend it to the Jaccard measure. Suppose that x and y are strings that have an overlap of tokens image. Then it is easy to see that any subset image of size at least image must overlap y. For example, consider the sets x = {lake, monona, area} and y = {lake, mendota, monona, area} in Figure 4.15(a). We have image. Thus, the subset image= {lake, monona} in Figure 4.15(a) overlaps y (as does any other subset of size 2 of x ).

image

Figure 4.15 An example of using prefix filtering to speed up string matching.

We can exploit this idea in procedure FindCands as follows. Suppose we want to find all pairs (x,y ) with overlap image(recall that image is the overlap similarity measure). Given a particular set x, we construct a subset x′ of size image, and use an inverted index to find all sets y that overlap x′. Figures 4.15(b-c) illustrate this idea. Suppose we want to match strings in the sets X and Y of Figure 4.15(b), using image. We begin by building an inverted index over the strings in set Y, as shown in Figure 4.15(c). Next, given a string such as x1 = {lake, mendota}, we take the “prefix” of size image, which is {lake} in this case, and let that be the set image. We use the inverted index to find all strings in Y that contain at least one token in image. This produces the set image. Note that if we use the inverted index to find all strings in Y that contain at least one token in x, we would end up with image, a larger candidate set. Thus, restricting index lookup to just a subset of x can significantly reduce the resulting set size.

Selecting the Subset Intelligently

So far we arbitrarily selected a subset x′ of x and checked its overlap with the entire set y. We can do even better by selecting a particular subset x′ of x and checking its overlap with only a particular subset y′ of y. Specifically, suppose we have imposed an ordering image over the universe of all possible terms. For example, we can order terms in increasing frequency, as computed over the union of sets X and Y of Figure 4.15(b). Figure 4.16(a) shows all terms found in image in this order.

image

Figure 4.16 We can build an inverted index over only the prefixes of strings in Y, then use this index to perform string matching.

We reorder the terms in each set image and image according to the order image. Figure 4.16(b) shows the reordered sets X and Y. Given a reordered set x, we refer to the subset x′ that contains the first n terms of x as the prefix of size n (or the n th prefix) of x. For example, the 2nd prefix of x3= {dane, mendota, monona, lake} is {dane, mendota}. Given the above, we can establish the following property:

Proposition 4.1

Let x and y be two sets such that image. Let x′ be the prefix of size image of x, and let y′ be the prefix of size image of y. Then x′ and y′ overlap.

Proof

Let x″ be the suffix of size image of x (see Figure 4.16(c)). Clearly, image. Similarly, let image be the suffix of size image of y. Now suppose image. Then we must have image(otherwise, image does not hold). So there exists an element u such that image and image. Similarly, there exists an element v such that image and image. Since image and image, we have image in the ordering image. But since image and image, we also have image in the ordering image, a clear contradiction. Thus, x′ overlaps y′.

Given the above property, we can revise procedure FindCands as follows. Suppose again that we consider overlap of at least k (that is, image).

We reorder the terms in each set image and image in increasing order of their frequency (as shown in Figure 4.16(b)).

For each image, we create y′, the prefix of size image of y.

We build an inverted index over all the prefixes y′. Figure 4.17(a) shows this index for our example, assuming image.

For each image, we create x′, the prefix of size image of x, then use the above inverted index to find all sets image such that x′ overlaps with y′.

image

Figure 4.17 The inverted indexes over (a) the prefixes of size image of all y ε Y and (b) all y ε Y. The former is often significantly smaller than the latter in practice.

Consider for example x = {mendota, lake}, and therefore x′= {mendota}. Using mendota to look up the inverted index in Figure 4.17(a) yields y6. Thus, FindCands returns y6 as the sole candidate that may match x. Note that if we check the overlap between x′ and the entire y, then y7 is also returned. Thus, checking the overlap between prefixes can reduce the size of the resulting set. In practice, this reduction can be quite significant.

It is also important to note that the size of the inverted index is much smaller. For comparison purposes, Figure 4.17(b) shows the inverted index for the entire sets image(reproduced from Figure 4.15(c)). The index we create here does not contain an entry for the term lake, and its index list for mendota is also smaller than the same index list in the entire-sets index.

Applying Prefix Filtering to the Jaccard Measure

The following equation enables us to extend the prefix filtering method to the Jaccard measure.

image (4.4)

The equation shows how to convert the Jaccard measure to the overlap measure, except for one detail. The threshold α as defined above is not a constant and depends on image and image. Thus, we cannot build the inverted index over the prefixes of image using α. To address this, we index the “longest” prefixes. In particular, it can be shown that we only have to index the prefixes of length image of the image to ensure that we do not miss any correct matching pairs.

4.3.4 Position Filtering

Position filtering further limits the set of candidate matches by deriving an upper bound on the size of the overlap between a pair of strings. As an example, consider the two strings x = {dane, area, mendota, monona, lake} and y = {research, dane, mendota, monona, lake}. Suppose we are considering image. In prefix filtering we will index the prefix of length image of y, which is y′= {research, dane} in this case (because image). Similarly, the prefix of length image of x is x′= {dane, area}. Since x′ overlaps y′, in prefix filtering we will return the above pair (x,y) as a candidate pair.

However, we can do better than this. Let x″ be the rest of x, after x′, and similarly let y″ be the rest of y, after y′. Then it is easy to see that

image (4.5)

Applying this inequality to the above example, we have image. However, using Equation 4.4 we have image. Hence, we can immediately discard the pair (x,y ) from the set of candidate matches. More generally, position filtering combines the constraints from Equation 4.5 and Equation 4.4 to further reduce the set of candidate matches.

4.3.5 Bound Filtering

Bound filtering is an optimization for computing the generalized Jaccard similarity measure. Recall from Section 4.2.3 that the generalized Jaccard measure computes the normalized weight of the maximum-weight matching M in the bipartite graph connecting x and y:

image

In the equation, s is a secondary similarity measure, image is the set of tokens that corresponds to x, and image is the set that corresponds to y.

Computing GJ (x,y ) in a straightforward fashion would require computing the maximum-weight matching M in the bipartite graph, which can be very expensive. To address this problem, given a pair (x,y ) we compute an upper bound UB (x,y ) and a lower bound LB (x,y ) on GJ (x,y ). FindCands uses these bounds as follows: if image, then we can ignore (x,y ) as it cannot be a match; if image, then we return (x,y ) as a match. Otherwise, we compute GJ (x,y ).

The upper and lower bounds are computed as follows. First, for each element image, we find an element image with the highest element-level similarity, such that image(recall from the description of GJ (x,y ) that we consider only matches between image and image such that image). Let S1 be the set of all such pairs.

For example, consider the two strings x and y together with the similarity scores between their elements in Figure 4.18(a)(reproduced from Figure 4.12(a)). Figure 4.18(b) shows the set image. Note that for element image, there is no element in By such that the similarity score between them equals or exceeds α, which is 0.5 in this case.

image

Figure 4.18 An example of computing an upper and lower bound for the generalized Jaccard measure.

Similarly, for each element image, we find an element image with the highest element-level similarity, such that image. Let S2 be the set of all such pairs. Continuing with our example, Figure 4.18(c) shows image.

The upper bound for GJ (x,y ) is given by the following formula:

image

Note that the numerator of UB (x,y ) is at least as large as that of GJ (x,y ), and that the denominator of UB (x,y ) is no larger than that of GJ (x,y ). The lower bound is given by the following formula:

image

Continuing with our example, image and image.

4.3.6 Extending Scaling Techniques to Other Similarity Measures

So far we have discussed scaling techniques for the Jaccard measure or overlap measure. We now describe how to extend these techniques to multiple similarity measures. First, as noted earlier, we can easily prove that

image

by replacing image in J (x,y ) with image. Thus, if a technique works for the overlap measure O (x,y ), there is a good chance that we can also extend it to work for the Jaccard measure J (x,y ), and vice versa. For example, earlier we described how to extend the prefix filtering technique originally developed for the overlap measure to the Jaccard measure.

In general, a promising way to extend a technique T to work for a similarity measure s (x,y ) is to translate s (x,y ) into constraints on a similarity measure that already works well with T. For example, consider edit distance. Let d (x,y ) be the edit distance between x and y, and let Bx and By be the corresponding q-gram sets of x and y, respectively. Then we can show that

image

Given the above constraint, we can extend prefix filtering to work with edit distance by indexing the prefixes of size image.

As yet another example, consider the TF/IDF cosine similarity C (x,y ). We can show that

image

Given this, we can extend prefix filtering to work with C (x,y ) by indexing the prefixes of size image(this can be further optimized to just indexing the prefixes of size image). Finally, the above constraints can also help us extend position filtering to work with edit distance and cosine similarity measures.

Bibliographic Notes

Durbin et al. [196] provide an excellent description of the various edit distance algorithms, together with HMM-based probabilistic interpretations of these algorithms. Further discussion of string similarity measures and string matching can be found in [146, 204, 280, 370, 455]. Tutorials on string matching include [355, 563]. The Web site [118] describes numerous string similarity measures and provides open-source implementations.

Edit distance was introduced in [376]. The basic dynamic programming algorithm for computing edit distance is described in [455]. Variations of edit distance include Needleman-Wunsch [456], affine gap [566], Smith-Waterman [526], Jaro [331], and Jaro-Winkler [575]. Learning the parameters of edit distance and related similarity measures was discussed in [84, 86, 497].

The Jaccard measure was introduced in [329]. The notion of TF/IDF originated from the information retrieval community [414], and TF/IDF-based string similarity measures are discussed in [31, 120, 145, 264, 352]. Soft TF/IDF was introduced in [86, 148]. Generalized Jaccard was introduced in[469]. The Monge-Elkan hybrid similarity measure is introduced in [442]. Cohen et al. [86, 148] empirically compare the effectiveness of string similarity measures over a range of matching tasks.

The Soundex measure was introduced in [500, 501]. Other phonetic similarity measures include New York State Identification and Intelligence System (NYSIIS) [537], Oxford Name Compression Algorithm (ONCA) [250], Metaphone [477], and Double Metaphone [478].

The material on scaling up string matching in Section 4.3 was adapted from [370]. Building inverted indexes to scale up string matching was discussed in [508]. The technique of size filtering was discussed in [34]. Prefix indexes were introduced in [124]. Bayardo et al. [59] discuss how to combine these indexes with inverted indexes to further scale up string matching. On et al.[469]discuss bound filtering, and Xiao et al. [582] discuss position indexes.

Gravano et al. [265] discuss q-gram-based string matching in RDBMSs. Koudas, Marathe, and Srivastava [354] discuss the trade-offs between accuracy and performance. Vernica, Carey, and Li [559] discuss string matching in the map reduce framework. Further techniques to scale up string matching were discussed in [388, 423, 548].

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset